4. QC & filtration
Careful QC of genotyping data is an important step towards a successful GWAS. Samples and/or variants that do not meet quality expectations need to be excluded.
Note: Typically, screening and excluding 1st degree relatives is the first step. However, this step will be postponed until the target cases and controls have been defined.
4.1. Data exploration
Plink commands are used to assess heterozygosity, missing genotyping, allele frequency and Hardy-Weinberg equilibrium (HWE). The output are the files ending in ".het", ".imiss (Per-individual)/.lmiss (per-variant)", ".frq" and ".hwe".
mkdir -p inspect
plink --bfile AxiomGT1v2.comp_merge.deDup.sexConfirm.public_ids --chr-set 38 no-xy --allow-extra-chr \
--het --missing --freq --hardy 'midp' \
--output-chr 'chrM' --out inspect/AxiomGT1v2.explore
# There are currently:
# 913984 variants loaded from .bim file.
# 3224 samples (1609 males, 1615 females) loaded from .fam.
Note: The 'midp' modifier of the command
--hardy
applies the mid-p adjustment described here. The mid-p adjustment tends to bring the null rejection rate in line with the nominal p-value and reduces the filter's tendency to favor retention of variants with missing data.
4.1.1. Extreme cases of heterozygosity
The heterozygosity report of Plink compares the expected and observed number of homozygotes. Column #6 in the report is an estimate for their divergence. Postive estimates indicate high homozygosity while negative estimates indicate low homozygosity.
awk -v size=0.02 'BEGIN{OFS="\t";bmin=bmax=0}{ b=int($6/size); a[b]++; bmax=b>bmax?b:bmax; bmin=b<bmin?b:bmin } END { for(i=bmin;i<=bmax;++i){if(i==0) print -1*size,size,a[i]/1;else if(i<0) print (i-1)*size,i*size,a[i]/1;else print i*size,(i+1)*size,a[i]/1 }}' <(tail -n+2 inspect/AxiomGT1v2.explore.het) > inspect/AxiomGT1v2.explore.het.histo
cat inspect/AxiomGT1v2.explore.het | awk '{if($6>0.3)print}' > inspect/AxiomGT1v2.explore.het.highHomo ## high homozygosity (i.e. low heterozygosity)
cat inspect/AxiomGT1v2.explore.het | awk '{if(NR==1)print}{if($6<-0.3)print}' > inspect/AxiomGT1v2.explore.het.lowHomo ## low homozygosity
4.1.2. Markers with significant deviation from HWE
The Hardy-Weinberg equilibrium report reports the results of an exact test comparing the expected and observed heterozygote frequency of each variant. Column #9 in the report is the Hardy-Weinberg equilibrium exact test p-value.
awk 'BEGIN{OFS="\t";}{ if($9<1e-50)a["1e-50 or less"]++;
else if($9<1e-40)a["1e-40:1e-50"]++; else if($9<1e-30)a["1e-30:1e-40"]++; \
else if($9<1e-20)a["1e-20:1e-30"]++; else if($9<1e-10)a["1e-10:1e-20"]++; \
else if($9<1e-9)a["1e-9:1e-10"]++; else if($9<1e-8)a["1e-8:1e-9"]++; \
else if($9<1e-7)a["1e-7:1e-8"]++; else if($9<1e-6)a["1e-6:1e-7"]++; \
else if($9<1e-5)a["1e-5:1e-6"]++; else if($9<1e-4)a["1e-4:1e-5"]++; \
else if($9<0.001)a["1e-3:1e-4"]++; else if($9<0.01)a["1e-2:1e-3"]++; } \
END { for(i in a) print i,a[i] }' <(tail -n+2 inspect/AxiomGT1v2.explore.hwe) | sort -g > inspect/AxiomGT1v2.explore.hwe.histo
## Use this to further explore variants with extreme deviation from HWE:
cat inspect/AxiomGT1v2.explore.hwe | awk '{if(NR==1)print}{if($9<1e-50)print}' > inspect/AxiomGT1v2.explore.hwe.lowHWE
4.1.3. Genotype missingness in samples and variants
There are 2 missingness reports. One with the ".imiss" extension for samples and another one with the ".lmiss" extension for variants. Columns #6 and #5 are the missing call rate in both files respectively.
awk -v size=0.02 'BEGIN{OFS="\t";bmin=bmax=0}{ b=int($6/size); a[b]++; bmax=b>bmax?b:bmax; bmin=b<bmin?b:bmin } END { for(i=bmin;i<=bmax;++i) print i*size,(i+1)*size,a[i]/1 }' <(tail -n+2 inspect/AxiomGT1v2.explore.imiss) > inspect/AxiomGT1v2.explore.imiss.histo
awk -v size=0.02 'BEGIN{OFS="\t";bmin=bmax=0}{ b=int($5/size); a[b]++; bmax=b>bmax?b:bmax; bmin=b<bmin?b:bmin } END { for(i=bmin;i<=bmax;++i) print i*size,(i+1)*size,a[i]/1 }' <(tail -n+2 inspect/AxiomGT1v2.explore.lmiss) > inspect/AxiomGT1v2.explore.lmiss.histo
4.1.4. Variants with very low MAF
The allele frequency report calculates the minor allele frequencies (MAF) in column #5.
# Here are 2 different resolutions for a histogram of MAF:
awk -v size=0.01 'BEGIN{OFS="\t";bmin=bmax=0}{ b=int($5/size); a[b]++; bmax=b>bmax?b:bmax; bmin=b<bmin?b:bmin } END { for(i=bmin;i<=bmax;++i) print i*size,(i+1)*size,a[i]/1 }' <(tail -n+2 inspect/AxiomGT1v2.explore.frq) > inspect/AxiomGT1v2.explore.frq.histo
awk -v size=0.001 'BEGIN{OFS="\t";bmin=bmax=0}{ b=int($5/size); a[b]++; bmax=b>bmax?b:bmax; bmin=b<bmin?b:bmin } END { for(i=bmin;i<=bmax;++i) print i*size,(i+1)*size,a[i]/1 }' <(tail -n+2 inspect/AxiomGT1v2.explore.frq) > inspect/AxiomGT1v2.explore.frq.histo2
4.2. Filter low quality samples and variants
echo "Oldies grls2ef0ebb4" > het_excess.lst
tail -n+2 inspect/AxiomGT1v2.explore.het.lowHomo | awk '{print $1,$2}' > het_excess.lst
plink --bfile AxiomGT1v2.comp_merge.deDup.sexConfirm.public_ids --chr-set 38 no-xy --allow-extra-chr \
--hwe 1e-50 'midp' --geno 0.05 --mind 0.05 --maf 0.01 \
--remove het_excess.lst\
--make-bed --output-chr 'chrM' --out AxiomGT1v2.filtered
Now, there are 579771 variants and 3205 samples that passed filters and QC.