Univariate Twin Analysis

S.Purcell

Introduction

In this tutorial we shall take a look at a dataset containing scores for 500 MZ twins and 500 DZ twins on a single measure. The aim of these particular analyses is to determine whether or not there is any evidence for a genetic effect on the trait.

The dataset can be downloaded here. Depending on your browser, you might have to right-click and choose 'Save As...' or 'Save Target As...'. Alternatively, you might be able to left-click on the link to view the raw data in your browser window and then copy-and-paste the dataset into WordPad or some other text-editor and save it from there onto your computer.

The dataset is in comma-separated value format so it can be read by almost any program, including Excel. Each row of the dataset represents information on one twin pair. It contains four variables:
  • An ID number for each twin (we can ignore these)
  • The zygosity of the twin pair (1 for MZ, 2 for DZ)
  • The trait score for the one twin (t0_1)
  • the trait score for the other twin (t0_2)
If you have access to Stata, open it up, after you have transfered the datafile to your computer. If you do not have access to Stata, open up whichever statistics package you are most familiar with. A spreadsheet such as Excel could even be used for most of these analyses.

In the Stata command line, type the following to load the raw data into memory, replacing the u:\ path with the appropriate path for your computer. The comma option after the filename tells Stata to expect a comma-separated value file.

Assuming it loads correctly, you should see the following text in the output window. This simply confirms the fact that our data file contains 4 columns (i.e. the variables as listed above) and 1000 observations (i.e. 500 MZ twin pairs and 500 DZ twin pairs).

The Variables window in Stata should display the names of the four variables. These were the top row of the raw datafile (often called the header row). Unlike SPSS, Stata does not provide a spreadsheet-style view of the data by default. Although it can sometimes be useful to literally see all your data on the screen, when working with large datasets, as is typical in quantitative genetic research, the spreadsheet-style view becomes less useful.

The tab command in Stata is used to tabulate values of a variable to produce a frequency table. Type in tab zyg to get a breakdown of the zygosity status of the sample. If all is well, this should confirm that you have successfully loaded 500 pairs of each twin zygosity into Stata.

The trait scores for each twin (t0_1 and t0_2) are continuously-distributed variables, so it is not informative to tabulate each the values in the same manner as above. Rather, we have to rely on summary descriptive statistics such as the mean and variance to tell us about the distribution. The command su (summarise) can be used to give these statistics. If you type su t0_1 t0_2 Stata will print the mean for all twin 1s and all twin 2s separately, as well as several other useful statistics such as the standard deviation.

We can see that there is no real difference between the mean scores for all twin 1s and all twin 2s. Also, there is no real difference in the variability of these variables: the standard deviations are very similar.

Another useful way to look at data is to plot it. To generate a histogram for all twin 1 scores, type the following:

The bin(30) option tells Stata how many bars (or bins) to draw in the histogram and the norm option is used to plot a normal curve over the histogram, drawn in red.

This should result in the graph as shown to the right. We can see that the trait is approximately normally distributed, which is good news for these analyses.

 

 

 

 

 

 

 

In genetic analyses, we are often more concerned with the relationships in scores between related individuals. An ideal way to get a sense of the relationship between two continuously-distributed measures is to draw a scatter-plot of the two variables. We can use the following command to plot the scores of one twin against the scores of the other twin.

Instead of producing a histogram, Stata realises that we have specified two variables after the graph command and so produces a bivariate scatter-plot.

Each dot represents a twin pair: the position on the vertical, or y, axis represents the score for the first twin, the position on the horizontal, or x, axis represents the score for the other twin. We can see evidence of an association between twins here: the distribution is slightly more drawn-out going from bottom-left to top-right. That is, individuals who score low are more likely to have co-twin who also scores below average, and vice-versa. The association is by no means strong, however.

 

 

 

We can quantify this association between scores with a correlation coefficient. Type the command as shown to the right for Stata to calculate the correlation between twins for this trait.

Stata displays the following correlation matrix: the correlation is 0.4241. This will undoubtably be significantly greater than zero (the command pwcorr t0_1 t0_2, o sig can be used to also give p-values for the correlations).

So far, all the graphs and statistics have not differentiated between MZ and DZ twins. The aim of this analysis is to compare twin similarity between MZ and DZ twins, so we need to analyse twins separately by their zygosity.

Type the command sort zyg to sort the dataset by this variable. It is now possible to stratify analysis by twin zygosity. That is, rather than giving the twin correlation for all twins, Stata will output the twin correlation for MZ twins and DZ twins separately.

Looking first at whether there are any differences in means or standard deviations between MZ and DZ twins, use the command as shown to the right.

The output of this command, shown left, is now broken down for MZ and DZ twins: we can see that there is no big difference between MZ and DZ twins in terms of their trait means or standard deviations. (Formally, we would test these statistically, but an 'eye-ball test' will suffice for present purposes.)

Now enter the following command, shown to the right, in order to obtain the twin correlations for MZ and DZ twins separately: these quantities are fundamental to the twin analysis we are about to perform.

We see that the MZ correlation (0.5231) is larger than the DZ correlation (0.3204). So the correlation we obtained earlier for the whole sample, ignoring twin zygosity, effectively represented an average between these two figures.

Looking at the scatter-plots by zygosity we can see this difference in correlations: the left scatter-plot is for MZ twins, the right one for DZ twins. We can see a slightly stronger relationship for MZ twins.

As mentioned in the Appendix text, a significantly larger MZ correlation compared to DZ correlation is evidence for the existence of genetic influence on a trait. The standard method of estimating heritability from twin correlations is as follows (also described in the Appendix):
  • Heritability is twice the difference between MZ and DZ correlations
  • Proportion of variance due to environmental factors shared between twins equals the MZ correlation minus the heritability
  • Proportion of variance due to environmental factors not shared between twins equals 1 minus the MZ correlation
We can use the di (for display) command in Stata to act as a pocket calculator: the estimate for heritability is therefore approximately 40% (i.e. twice the difference between MZ and DZ correlations is 0.4054). The shared environment accounts for 12% of variation, whilst nonshared environmental effects account for just under half of the variance in this population. The last line merely confirms that these proportions of variance do in fact add up to 1, as they should

Covariances and Model-fitting

These estimates of heritability are based on the comparison between correlations. As mentioned in the Appendix, it is often preferable to analyse variances and covariances instead, for a number of reasons. We can obtain the variance-covariance matrices for MZ and DZ twins from Stata by adding the cov option to the correlation command as shown to the right.

As we saw earlier, there are no major differences between four estimates of variance. We can apply model-fitting methods to these variance-covariance matrices and compare the results to the simple analysis with correlations.

 

 

 

We can use the Model-fitting 2 module from the main BGIM site. Clicking on the WWW symbol shown below enters the module designed to perform model-fitting analysis on univariate twin data.

Model-fitting 2
This on-line resource performs a maximum-likelihood analysis of univariate twin data and presents the parameter estimates for nested sub-models. The tutorial explains how to use the module and how to interpret the output.

 

We can simply enter the MZ and DZ variances and covariances estimated by Stata into this module: the picture to the right shows these. Also, we must specify how many MZ and DZ twins these variances and covariances were based on - 500 each in this case. Not shown here, the module also asks for a Type I error rate : just enter 0.05 for this. This represents the chance of falsely rejecting a model, as described in the Appendix and other modules.

 

 

 

 

 

 

The results of this analysis are shown below:

The figures represent components of variance. From the full ACE model we obtain a similar estimate of heritability as from the correlational analysis. The total variance is 2.7124 + 1.122 + 3.6693 = 7.4937. Therefore the proportion of variance due to additive genetic effects, or heritability, is 2.7124/7.4937 = 0.36. In other words, the model-fitting analysis estimates that 36% of the variation in the population is due to genetic effects. This can be regarded as a more accurate estimate than the 40% obtained from analysing correlations. The reason for the difference is that the model-fitting analysis is taking into account the small differences amongst the four variances.

We also see that the best-fitting model is in fact the AE model: that is, the effect of shared environment is not significant. A model with only additive genetic and nonshared environmental effects is shown on the second line (again, go to the module tutorial for a full explanation of all the figures in this table). The standard estimates from the AE model are given below: the impact of genes is estimated much higher now:

 

 

Further exploration: Simulation

The datafile we used in the analysis above was not collected from real individuals measuring any particular trait: it was simulated, using the SIMULATOR module at this website.

SIMULATOR
This on-line resource allows users to easily simulate multivariate ACE datasets and explore the properties of the Cholesky decomposition description of data.

When using simulated data, we have the advantage of actually knowing what the answers should be because we have simulated the dataset according to these values. In the univariate case, this module allows you to create a random dataset of a specified number of MZ and DZ twin pairs whose scores are as if they come from a population for which the various quantities we estimated above are known.

In the simulation module, the relative balance of additive genetic, shared environmental and nonshared environmental effects are specified in terms of the path-coefficients rather than as variance components or proportions of variance. See the Appendix for the distinction between path co-efficients and variance components: essentially we simply square the path co-efficient to give the variance component (think of path tracing rules- we traverse the path twice for the trait variance). The following path co-efficients were used in the simulation of the dataset:

By selecting the Calculate option, the module gives the corresponding proportions of variance attributable to A, C and E. In this case the total variance is there therefore 9 (2*2 + 1*1 + 2*2). The proportion of variance due to additive effects of genes is (2*2) / 9 = 44.4%. In a similar manner, we calculate that the shared environment accounts for 11.1% of variance, the nonshared environment for 44.4%.

In infinitely large samples, the methods of analysis used here would have yielded these exact figures. Instead, because of sampling error, the estimates were slightly off, even though the sample sizes seemed reasonably large.

Try using the simulation module to create another dataset according to the same properties, analyse it, and you will notice that the estimates will be slightly different again. But if you repeated this a large number of times, you should also notice that the estimates may be a little out each time, but there should be no consistent bias in them (i.e. such as the heritability being consistently over-estimated).

Bear in mind, however, the simulation module simulates data according to very simplistic assumptions that might not hold in real life, which potentially could lead to systematic bias. An example of this would be the equal environment assumption, as mentioned in the Appendix.

Try simulating a univariate dataset with different values, and then use Stata to calculate the correlations and covariances. Try experimenting with the approximate size of samples you need before the estimate become stable.

Further exploration: Using Mx to perform the analysis

Try installing Mx on your computer and run the basic ACE script to perform the same analyses. The basic ACE script can be found here. You will need to edit the script just to enter the covariance matrices, as follows:
Group2: MZ twin pairs
Data NInput_vars=2 NObservations=500
     CMatrix 
	7.949
	4.088 7.682
and for DZ twins
Group3: DZ twin pairs
Data NInput_vars=2 NObservations=500
     CMatrix 
	 7.012
	 2.315 7.447
Looking at the output, you will see the relevant parameters A, C and E representing the components of variance. These will correspond very closely to the values obtained from the model-fitting module above. In this case,
 
  MATRIX A
 This is a computed FULL matrix of order    1 by    1
  [=X*X']
             1
 1      2.7139
  
  MATRIX C
 This is a computed FULL matrix of order    1 by    1
  [=Y*Y']
             1
 1      1.1213
  
  MATRIX E
 This is a computed FULL matrix of order    1 by    1
  [=Z*Z']
             1
 1      3.6684

The script also contains a standardisation group, so the proportions of variance are also given (T, U and V correspond to A, C and E respectively).
  MATRIX T
 This is a computed FULL matrix of order    1 by    1
  [=D~*A*D~]
          1
 1   0.3617
  
  MATRIX U
 This is a computed FULL matrix of order    1 by    1
  [=D~*C*D~]
          1
 1   0.1494
  
  MATRIX V
 This is a computed FULL matrix of order    1 by    1
  [=D~*E*D~]
          1
 1   0.4889

This gives the same heritability estimate, 36%. The fit statistics are also important:
 Your model has    3 estimated parameters and      6 Observed statistics
  
 Chi-squared fit of model >>>>>>>     1.984
 Degrees of freedom >>>>>>>>>>>>>         3
 Probability >>>>>>>>>>>>>>>>>>>>     0.576
The non-significant fit (p=0.576) represents that the model accounts for the observed data well.

The script can be modified to test an AE model- remove the free keyword after the Y matrix from the first group:
	Begin Matrices;
		X Lower nvar nvar free    ! genetic structure
		Y Lower nvar nvar         ! shared environment 
		Z Lower nvar nvar free    ! non-shared path co-efficients
	
		H Full 1 1
	End Matrices;
As can be seen this gives similar results to the model-fitting module above:
  MATRIX T
 This is a computed FULL matrix of order    1 by    1
  [=D~*A*D~]
          1
 1   0.5252
  
  MATRIX U
 This is a computed FULL matrix of order    1 by    1
  [=D~*C*D~]
          1
 1   0.0000
  
  MATRIX V
 This is a computed FULL matrix of order    1 by    1
  [=D~*E*D~]
          1
 1   0.4748
Try using Mx to test the CE model; try analysing other univariate twin covariance matrices.



Site created by S.Purcell, last updated 20.05.2007