3. Prep Array Sets
Genotyping data are provided as 2 datasets for Affymetrix (ThermoFisher) Axiom Canine HD Array sets A and B. Both datasets must be merged into one final dataset, but this requires resolving replicate SNPs, known technical replicates and gender conflicts. Finally, sample IDs in the genotyping files must match those in the phenotype files.
3.1. Replicate SNPs
Both arrays have a number of replicate SNPs which are useful for QC but also require special attention for proper merging of the files. A QC merging mode of PLINK is used to identify mismatching non-missing calls between the 2 arrays.
plink --bfile GRLS_Genotyping/setB/AxiomGT1.bin --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
--bmerge GRLS_Genotyping/setA/AxiomGT1.bin --merge-mode 7 \
--output-chr 'chrM' --out AxiomGT1.mismatching7
Here is the output of this command:
3339 samples loaded from GRLS_Genotyping/setB/AxiomGT1.bin.fam.
3355 samples to be merged from GRLS_Genotyping/setA/AxiomGT1.bin.fam.
Of these, 17 are new, while 3338 are present in the base dataset.
547232 markers loaded from GRLS_Genotyping/setB/AxiomGT1.bin.bim.
397685 markers to be merged from GRLS_Genotyping/setA/AxiomGT1.bin.bim.
Of these, 368215 are new, while 29470 are present in the base dataset.
Warning: Variants 'Affx-206088448' and 'Affx-205939940' have the same position.
Warning: Variants 'Affx-205859745' and 'Affx-205344060' have the same position.
Warning: Variants 'Affx-206706550' and 'Affx-205359377' have the same position.
1460 more same-position warnings: see log file.
Performing 1-pass diff (mode 7), writing results to AxiomGT1.mismatching7.diff
98370860 overlapping calls, 97897061 non-missing in both filesets.
97593330 concordant, for a concordance rate of 0.996897.
The concordance rate of non-missing genotypes is pretty good (99.7%). However, there are still two issues to handle:
-
Mismatching genotypes: Use the following code to have a closer look at the mismatching genotypes to see if they are uniformly distributed among the samples or not.
tail -n+2 AxiomGT1.mismatching7.diff | awk '{print $2,$3}' | sort | uniq -c | sort -k1,1nr > AxiomGT1.mismatching7.diff.samples
It seems that two samples show exceptional higher rate of mismatching genotypes:
GRLS grlsU85RZ2FF_1
andGRLS grlsQZAMFHKK_1
have 11676 and 11440 mismatches respectively. The latter sample was predicted to be female on Array A and male on Array B according to the genotyping analysis notes. This likely indicates that these samples had something wrong. Therefore, both samples will be excluded from further analysis during the merging step. -
Variants having the same position: The mismatching analysis reveals 1463 markers that have the same position on both arrays but with different SNP IDs. Further digging in the array annotation shows that most of these markers are identical with minor differences in the flanking sequences. To avoid genotyping errors, these SNPs will be excluded from array B.
grep "Warning: Variants .* have the same position" AxiomGT1.mismatching7.log | awk -F"'" 'BEGIN{OFS="\n";}{print $2,$4}' > same_pos.arrB.lst plink --bfile GRLS_Genotyping/setB/AxiomGT1.bin --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \ --exclude same_pos.arrB.lst \ --make-bed --output-chr 'chrM' --out GRLS_Genotyping/setB/AxiomGT1.bin_noSamePos
3.2. Merging of array sets A and B genotyping data
Now, the genotyping data of both arrays should be merged. For SNPs shared between the arrays, the genotypes of array A will overwrite the non-missing calls in array B. Also, the 2 samples with higher rates of non-concordance will be excluded.
echo "GRLS grlsU85RZ2FF_1|GRLS grlsQZAMFHKK_1" | tr '|' '\n' > hiMismatches_samples.lst
plink --bfile GRLS_Genotyping/setB/AxiomGT1.bin_noSamePos --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
--bmerge GRLS_Genotyping/setA/AxiomGT1.bin --merge-mode 3 \
--remove hiMismatches_samples.lst \
--make-bed --output-chr 'chrM' --out AxiomGT1v2.comp_merge
The output of the merging command provides some useful statistics:
...
Warning: 1355 het. haploid genotypes present (see AxiomGT1v2.comp_merge.hh );
Warning: Nonmissing nonmale Y chromosome genotype(s) present;
...
Total genotyping rate in remaining samples is 0.994631.
913984 variants and 3354 samples pass filters and QC.
...
The first warning is usually caused by male heterozygous calls in the X chromosome pseudo-autosomal region. Looking at AxiomGT1v2.comp_merge.hh
shows that all the 1355 heterozygous haploid genotypes belong to sample GRLS grlsJCVWWRMM_1
. According to the genotyping analysis notes, this sample is reported as a male in metadata as well as by the genotyping algorithm on array B, but the algorithm failed to predict its gender on array A. This sample will be handled later during the check for gender accuracy.
The second warning indicates that some non-male samples have Y chromosome genotypes. Most PLINK commands (e.g. --freq) treat these genotypes as missing. This behavior is intended for safety, however it makes it tricky to identify these markers. Here is a workaround to find these markers:
mkdir -p temp_nonmale
cat AxiomGT1v2.comp_merge.fam | awk '{if($5=="2")print}' > temp_nonmale/female.lst
plink --bfile AxiomGT1v2.comp_merge --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
--keep temp_nonmale/female.lst --chr y \
--recode --output-chr 'chrM' --out temp_nonmale/AxiomGT1v2.comp_merge.fe
## Stats need to be generated but PLINK treats Y genotypes in females as missing. Fake non-Y positions are usedfor these markers
sed -i 's/chrY/chr_notY/' temp_nonmale/AxiomGT1v2.comp_merge.fe.map
plink --file temp_nonmale/AxiomGT1v2.comp_merge.fe --chr-set 38 no-y --allow-no-sex --allow-extra-chr \
--freq counts \
--output-chr 'chrM' --out temp_nonmale/AxiomGT1v2.comp_merge.fe.explore
tail -n+2 temp_nonmale/AxiomGT1v2.comp_merge.fe.explore.frq.counts | awk '{if($5!=0 || $6!=0)print}' | wc -l
3.3. Identification and removal of duplicate samples
Plink2 has an efficient function to calculate the KING-robust kinship estimator and filter duplicate samples as well. Duplicate samples have kinship coefficients ~0.5, first-degree relations (parent-child, full siblings) correspond to ~0.25, second-degree relations correspond to ~0.125, etc. The conventional cutoff of ~0.354 (the geometric mean of 0.5 and 0.25) will be used here to identify and filter duplicate samples.
plink2 --bfile AxiomGT1v2.comp_merge --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
--king-cutoff 0.354 \
--out AxiomGT1v2.comp_merge.nodup
The log file indicates that the input file has 3354 samples while the output of the deduplication step will end up having 3236 samples. Now, the samples selected to be removed by Plink2 kinship are compared to the list of samples planned to run in duplicates as listed in the input metadata file map_id_sex.tab
.
## This code will identify all the samples planned to run in duplicates in `map_id_sex.tab`.
## Then find the number of replicates remaining after exclusion of those selected by the Plink2 kinship filter.
## The expectations is that one replicate will remain from each group of replicate samples.
awk 'BEGIN{FS=OFS="\t"}FNR==NR{a[$3]+=1;next}/^Family_ID/{print}{if(a[$3]>1)print}' GRLS_Genotyping/map_id_sex.tab GRLS_Genotyping/map_id_sex.tab > dup_samples.tab
awk 'BEGIN{FS=OFS="\t"}FNR==NR{a[$1 FS $2]=1;next}{if(!a[$1 FS $2])print $0}' AxiomGT1v2.comp_merge.nodup.king.cutoff.out.id dup_samples.tab > dup_samples_remain.tab
tail -n+2 dup_samples_remain.tab | cut -f3 | sort | uniq -c | sort -k1,1nr
2 grlsFGFIVFXX
1 grls1XIBPR44
1 grls3GG6Z3GG
1 grls4IH8IF22
.
.
It seems that 2 duplicates of sample grlsFGFIVFXX
are still remaining! Let's try to calculate their KING kinship to see how similar they are.
grep grlsFGFIVFXX dup_samples.tab > failed_deDup.lst
plink2 --bfile AxiomGT1v2.comp_merge --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
--keep failed_deDup.lst --make-king-table 'cols=maybefid,id,maybesid,nsnp,hethet,ibs0,ibs1,kinship' \
--out AxiomGT1v2.comp_merge.failed_deDup
cat AxiomGT1v2.comp_merge.failed_deDup.kin0
Here is the output:
FID1 | ID1 | FID2 | ID2 | NSNP | HETHET | IBS0 | HET1_HOM2 | HET2_HOM1 | KINSHIP |
---|---|---|---|---|---|---|---|---|---|
GRLS | grlsFGFIVFXX_2 | GRLS | grlsFGFIVFXX_1 | 879956 | 0.0645396 | 0.0414475 | 0.11505 | 0.112207 | -0.0559478 |
- NSNP: Number of variants considered (autosomal, neither call missing)
- HETHET: Proportion of considered call pairs which are het-het
- IBS0: Proportion of considered call pairs which are opposite homs
- HET1_HOM2: Proportion of sample 1 het, sample 2 hom
- HET2_HOM1: Proportion of sample 1 hom, sample 2 het
- KINSHIP: KING-robust between-family kinship estimate
It is obvious that these two samples are unrelated. Therefore, both of them should be excluded with the duplicate samples identified by the Plink2 kinship filter. There are 120 samples included in this total, so there will end up being 3234 samples in the output PLINK file
plink2 --bfile AxiomGT1v2.comp_merge --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
--remove <(cat AxiomGT1v2.comp_merge.nodup.king.cutoff.out.id failed_deDup.lst) \
--make-bed --output-chr 'chrM' --out AxiomGT1v2.comp_merge.deDup
3.4. Check for gender accuracy & remove samples with the incorrect gender identities
According to the genotyping analysis notes, there are two samples GRLS grlsQZAMFHKK_1
and GRLS grlsJCVWWRMM_1
that had discordant computed gender on the two arrays. The former was removed already because of the high rate of mismatching genotypes between the 2 arrays (see 3.1. Replicate SNPs). The latter showed high rate of heterozygous haploid genotypes despite being reported as male in metadata (see 3.2. Merging of array sets). Therefore, this sample should be discarded as well from further analysis.
There are an additional 9 samples with concordant gender on both arrays but different from the metadata. This information can be reproduced by comparing known gender in metadata file map_id_sex.tab
versus the computed gender based on genotyping data in the PLINK files.
echo "Family_ID Individual_ID computed_sex metadata_sex" | tr ' ' '\t' > gender_conflict.lst
echo "GRLS grlsJCVWWRMM_1 ? 1" | tr ' ' '\t' >> gender_conflict.lst
awk 'BEGIN{FS=OFS="\t"}FNR==NR{a[$1 FS $2]=$4;next}{if(a[$1 FS $2] && a[$1 FS $2]!=$5)print $1,$2,$5,a[$1 FS $2]}' GRLS_Genotyping/map_id_sex.tab AxiomGT1v2.comp_merge.deDup.fam >> gender_conflict.lst
Incorrect gender could be a mistake in the metadata or an indication for sample swap. Therefore, it is safer to exclude these samples from the genotyping data.
plink2 --bfile AxiomGT1v2.comp_merge.deDup --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
--remove gender_conflict.lst \
--make-bed --output-chr 'chrM' --out AxiomGT1v2.comp_merge.deDup.sexConfirm
3.5. Update sample IDs in the genotyping files to match the phenotyping files
The sample IDs in the genotyping files have a suffix that identify the duplicates of the same sample. The samples have already been deduplicated in the genotyping files (see 3.3. Removal of duplicate samples). Now the IDs need to be updated to exactly match the public_ids (e.g. grlsH764T844) used in phenotypic data files.
tail -n+2 GRLS_Genotyping/map_id_sex.tab | awk 'BEGIN{FS=OFS="\t"}{print $1,$2,$1,$3}' > public_ids.update.lst
plink2 --bfile AxiomGT1v2.comp_merge.deDup.sexConfirm --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
--update-ids public_ids.update.lst \
--make-bed --output-chr 'chrM' --out AxiomGT1v2.comp_merge.deDup.sexConfirm.public_ids