Skip to content

6. Heritability

It is meaningless to perform a genomic association study with a phenotype unless it is heritable. Preceding studies are needed to determine heritability estimates for especially complex traits; however, this is out of scope for this tutorial. By heritability assessment, this means estimating the proportion of variance in the phenotype explained by the genotyped SNPs. Later, it should be calculated how much of this heritability was captured by the selected markers.

GCTA software will be used to perform the GREML (Genetic Relatedness Estimation through Maximum Likelihood) analysis for heritability assessment.

The --reml command of GCTA estimates the proportion of phenotypic variance that can be explained by SNPs. Here is a more detailed explanation of the [GREML method (https://morrisanimalfoundation.github.io/grGWAS/ext_docs/GREML_analysis/) and the underlying linear mixed model.

In brief, GREML utilizes the genetic relationships between individuals in a sample population, captured in a genomic relationship matrix (GRM), to partition the phenotypic variance into its genetic and residual components. By fitting a linear mixed model, GREML estimates the proportion of variance attributable to all SNPs.

The summary result of REML analysis will be saved in a plain text file (*.hsq).

6.1. GCTA-GRM

  • This command calculates the genetic relationship matrix (GRM) from all the autosomal SNPs.
  • GCTA receives PLINK binary PED files as input. However, the Plink files have to be tweaked to update the chromosome names into a numerical format and remove all extra chromosomes.
## a) Prepare the Plink input files
gcta="gcta_${target}" && mkdir -p "$gcta"
plink2 --bfile $gwas/AxiomGT1v2.filtered.${target}.noRelatives.LD_prune --chr-set 38 no-xy --allow-extra-chr '0' \
      --make-bed --out $gcta/AxiomGT1v2.filtered.${target}.noRelatives.LD_prune.gcta

## b) Run the GCTA-GRM command
gcta64 --bfile $gcta/AxiomGT1v2.filtered.${target}.noRelatives.LD_prune.gcta --autosome-num 38 --autosome --make-grm \
       --out $gcta/AxiomGT1v2.filtered.${target}.noRelatives.LD_prune.gcta_grm --thread-num 10

6.2. Input phenotype data

The input phenotype file is a plain text file similar to that of Plink. If the phenotypic value is coded as 0 or 1 (or 1 and 2, compatible with PLINK), then it will be recognized as a case-control study (0 for controls and 1 for cases). Missing value should be represented by "-9" or "NA".

For this tutorial, the file $gwas/atopy.pheno that has 318 cases and 1664 controls will be used

6.3. Covariates

The data can be adjusted for quantitative and discrete covariates. Previous PCA analysis showed that three principal components explain most of the variance and none of them segregates with the phenotype of interest. Therefore, the GREML analysis for these 3 eigenvectors will be adjusted. Adjustments must also be made for possible gender effect.

This information has already been obtained, but it needs to be reformatted to match the GCTA expectations.

tail -n+2 $gwas/AxiomGT1v2.filtered.${target}.noRelatives.LD_prune.pca.eigenvec | cut -f1-5 > $gwas/3PCs.txt
cat $gcta/AxiomGT1v2.filtered.${target}.noRelatives.LD_prune.gcta.fam | awk '{print $1,$2,$5}' > $gwas/gender.txt

6.4. Breeding value

A breeding value is an estimate of an animal's genetic merit for a particular trait. The additional command --reml-pred-rand predicts the total genetic effect (i.e. breeding value) of each individual attributed by the aggregative effect of the SNPs used to estimate the GRM. The total genetic effects of all the individuals will be saved in a plain ext file *.indi.blp.

6.5. Run GCTA-GREML

Now that all the required pieces have been obtained, it's time to run the analysis.

Note: It is not required to have exactly the same individuals in these files. GCTA will find the individuals in common in the files and sort the order of the individuals.

gcta64 --reml --grm $gcta/AxiomGT1v2.filtered.${target}.noRelatives.LD_prune.gcta_grm \
       --pheno $gwas/${target}.pheno \
       --qcovar $gwas/3PCs.txt --covar $gwas/gender.txt \
       --reml-pred-rand \
       --out $gwas/${target}_greml --thread-num 10

Here is the output:

...

Accepted options:
--reml
--grm gcta_atopy/AxiomGT1v2.filtered.atopy.noRelatives.LD_prune.gcta_grm
--pheno gwas_atopy/atopy.pheno
--qcovar gwas_atopy/3PCs.txt
--covar gender.txt
--reml-pred-rand
--out gwas_atopy/atopy_greml
--thread-num 10

Note: the program will be running on 10 threads.

Reading IDs of the GRM from [gcta/AxiomGT1v2.filtered.atopy.noRelatives.LD_prune.gcta_grm.grm.id].
1989 IDs are read from [gcta/AxiomGT1v2.filtered.atopy.noRelatives.LD_prune.gcta_grm.grm.id].
Reading the GRM from [gcta/AxiomGT1v2.filtered.atopy.noRelatives.LD_prune.gcta_grm.grm.bin].
GRM for 1989 individuals are included from [gcta/AxiomGT1v2.filtered.atopy.noRelatives.LD_prune.gcta_grm.grm.bin].
Reading phenotypes from [gwas_atopy/atopy.pheno].
Non-missing phenotypes of 1989 individuals are included from [gwas_atopy/atopy.pheno].
Reading quantitative covariate(s) from [gwas_atopy/3PCs.txt].
3 quantitative covariate(s) of 1989 individuals are included from [gwas_atopy/3PCs.txt].
Reading discrete covariate(s) from [gwas_atopy/gender.txt].
1 discrete covariate(s) of 1989 individuals are included from [gwas_atopy/gender.txt].
Assuming a disease phenotype for a case-control study: 317 cases and 1672 controls 
Note: disease prevalence can be specified by the option --prevalence so that GCTA can transform the variance explained to the underlying liability scale.
3 quantitative variable(s) included as covariate(s).
1 discrete variable(s) included as covariate(s).
1989 individuals are in common in these files.

Performing  REML analysis ... (Note: may take hours depending on sample size).
1989 observations, 5 fixed effect(s), and 2 variance component(s)(including residual variance).
Calculating prior values of variance components by EM-REML ...
Updated prior values: 0.0671216 0.0786665
logL: 927.007
Running AI-REML algorithm ...
Iter.   logL    V(G)    V(e)
1       948.09  0.04286 0.09394
2       974.49  0.02871 0.10482
3       986.91  0.02029 0.11235
4       992.52  0.01516 0.11746
5       995.02  0.01198 0.12090
6       996.11  0.00996 0.12321
7       996.60  0.00867 0.12475
8       996.81  0.00602 0.12801
9       996.98  0.00624 0.12786
10      996.98  0.00623 0.12787
11      996.98  0.00623 0.12787
Log-likelihood ratio converged.

Calculating the logLikelihood for the reduced model ...
(variance component 1 is dropped from the model)
Calculating prior values of variance components by EM-REML ...
Updated prior values: 0.13387
logL: 995.87451
Running AI-REML algorithm ...
Iter.   logL    V(e)
1       995.88  0.13387
2       995.88  0.13387
Log-likelihood ratio converged.

Summary result of REML analysis:
Source  Variance        SE
V(G)    0.006227        0.004659
V(e)    0.127868        0.005905
Vp      0.134095        0.004273
V(G)/Vp 0.046439        0.034585

Sampling variance/covariance of the estimates of variance components:
2.170601e-05    -1.915605e-05
-1.915605e-05   3.486728e-05

Summary result of REML analysis has been saved in the file [gwas_atopy/atopy_greml.hsq].

BLUP solutions of the genetic effects for 1989 individuals have been saved in the file [gwas_atopy/atopy_greml.indi.blp].

6.6. Interpretation of results

a) $gwas/atopy_greml.hsq: The output file of the --reml command.

  • V(G), V(e) and Vp for genetic variance, residual variance, and phenotypic variance respectively.
    Note that Vp = V(G) + V(e)
  • V(G)/Vp represents the proportion of phenotypic variance explained by SNPs. The results here indicate that SNPs can explain ~4.6% only of the phenotypic variance.
  • The standard error (SE) is important to judge the reliability of the results. A 95% confidence interval (CI) is approximately h2-SNP estimate +/- 1.96 * SE. If the SE is too large, the 95% CI will cover the whole parameter space (from 0 to 1) so that it won't be possible to make any meaningful inferences from the estimate. Therefore, a value of SE < 0.1 is needed to have reliable results.
  • logL is the log likelihood for the full model (the null hypothesis that σ2g ≠ 0).
  • logL0 is the log likelihood for the reduced model (the null hypothesis that σ2g = 0).
  • LRT is the log-likelihood ratio test statistic. It is calculated as twice the difference in log-likelihood between the full (h2 ≠ 0) and reduced (h2 = 0) models. i.e. LRT = 2[logL - logL0].
  • Degree of freedom (df), p-value, and sample size.
    For more information about the full and reduced models and the interpretation of log likelihoods and LRT, check here.

b) $gwas/atopy_greml.indi.blp: The output file of the --reml-pred-rand command.

  • Columns are family ID, individual ID, an intermediate variable, the total genetic effect, another intermediate variable and the residual effect.
# Histogram of the genetic effect.
awk -v size=0.02 'BEGIN{OFS="\t";bmin=bmax=0}{ b=int($4/size); a[b]++; bmax=b>bmax?b:bmax; bmin=b<bmin?b:bmin } END { for(i=bmin;i<=bmax;++i){if(i==0) print -1*size,size,a[i]/1;else if(i<0) print (i-1)*size,i*size,a[i]/1;else print i*size,(i+1)*size,a[i]/1 }}' $gwas/${target}_greml.indi.blp > $gwas/${target}_greml.indi.blp_gen_histo

# Histogram of the residual effect.
awk -v size=0.02 'BEGIN{OFS="\t";bmin=bmax=0}{ b=int($6/size); a[b]++; bmax=b>bmax?b:bmax; bmin=b<bmin?b:bmin } END { for(i=bmin;i<=bmax;++i){if(i==0) print -1*size,size,a[i]/1;else if(i<0) print (i-1)*size,i*size,a[i]/1;else print i*size,(i+1)*size,a[i]/1 }}' $gwas/${target}_greml.indi.blp > $gwas/${target}_greml.indi.blp_res_histo

# Annotate the report by the phenotype.
awk 'BEGIN{OFS="\t"}NR==FNR{a[$2]=$3;next}{print $2,$4,$6,a[$2]}' $gwas/atopy.pheno $gwas/${target}_greml.indi.blp | sort -k2,2g > $gwas/${target}_greml.indi.blp.pheno



Note: GCTA has additional valuable options e.g. --gxe to estimate the variance of genotype-environment (GE) interaction and --prevalence to transform the estimate of variance explained, V(1)/Vp, on the observed scale to that on the underlying scale, V(1)/Vp_L. Check GCTA's documentation for details.