Skip to content

3. Prep Array Sets

Genotyping data are provided as 2 datasets for Affymetrix (ThermoFisher) Axiom Canine HD Array sets A and B. Both datasets must be merged into one final dataset, but this requires resolving replicate SNPs, known technical replicates and gender conflicts. Finally, sample IDs in the genotyping files must match those in the phenotype files.

3.1. Replicate SNPs

Both arrays have a number of replicate SNPs which are useful for QC but also require special attention for proper merging of the files. A QC merging mode of PLINK is used to identify mismatching non-missing calls between the 2 arrays.

plink --bfile GRLS_Genotyping/setB/AxiomGT1.bin --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
      --bmerge GRLS_Genotyping/setA/AxiomGT1.bin --merge-mode 7 \
      --output-chr 'chrM' --out AxiomGT1.mismatching7

Here is the output of this command:

3339 samples loaded from GRLS_Genotyping/setB/AxiomGT1.bin.fam.
3355 samples to be merged from GRLS_Genotyping/setA/AxiomGT1.bin.fam.
Of these, 17 are new, while 3338 are present in the base dataset.
547232 markers loaded from GRLS_Genotyping/setB/AxiomGT1.bin.bim.
397685 markers to be merged from GRLS_Genotyping/setA/AxiomGT1.bin.bim.
Of these, 368215 are new, while 29470 are present in the base dataset.
Warning: Variants 'Affx-206088448' and 'Affx-205939940' have the same position.
Warning: Variants 'Affx-205859745' and 'Affx-205344060' have the same position.
Warning: Variants 'Affx-206706550' and 'Affx-205359377' have the same position.
1460 more same-position warnings: see log file.
Performing 1-pass diff (mode 7), writing results to AxiomGT1.mismatching7.diff
98370860 overlapping calls, 97897061 non-missing in both filesets.
97593330 concordant, for a concordance rate of 0.996897.

The concordance rate of non-missing genotypes is pretty good (99.7%). However, there are still two issues to handle:

  1. Mismatching genotypes: Use the following code to have a closer look at the mismatching genotypes to see if they are uniformly distributed among the samples or not.

    tail -n+2 AxiomGT1.mismatching7.diff | awk '{print $2,$3}' | sort | uniq -c | sort -k1,1nr > AxiomGT1.mismatching7.diff.samples
    

    It seems that two samples show exceptional higher rate of mismatching genotypes: GRLS grlsU85RZ2FF_1 and GRLS grlsQZAMFHKK_1 have 11676 and 11440 mismatches respectively. The latter sample was predicted to be female on Array A and male on Array B according to the genotyping analysis notes. This likely indicates that these samples had something wrong. Therefore, both samples will be excluded from further analysis during the merging step.

  2. Variants having the same position: The mismatching analysis reveals 1463 markers that have the same position on both arrays but with different SNP IDs. Further digging in the array annotation shows that most of these markers are identical with minor differences in the flanking sequences. To avoid genotyping errors, these SNPs will be excluded from array B.

    grep "Warning: Variants .* have the same position" AxiomGT1.mismatching7.log | awk -F"'" 'BEGIN{OFS="\n";}{print $2,$4}' > same_pos.arrB.lst
    plink --bfile GRLS_Genotyping/setB/AxiomGT1.bin --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
      --exclude same_pos.arrB.lst \
      --make-bed --output-chr 'chrM' --out GRLS_Genotyping/setB/AxiomGT1.bin_noSamePos
    

3.2. Merging of array sets A and B genotyping data

Now, the genotyping data of both arrays should be merged. For SNPs shared between the arrays, the genotypes of array A will overwrite the non-missing calls in array B. Also, the 2 samples with higher rates of non-concordance will be excluded.

echo "GRLS grlsU85RZ2FF_1|GRLS grlsQZAMFHKK_1" | tr '|' '\n' > hiMismatches_samples.lst
plink --bfile GRLS_Genotyping/setB/AxiomGT1.bin_noSamePos --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
      --bmerge GRLS_Genotyping/setA/AxiomGT1.bin --merge-mode 3 \
      --remove hiMismatches_samples.lst \
      --make-bed --output-chr 'chrM' --out AxiomGT1v2.comp_merge

The output of the merging command provides some useful statistics:

...
Warning: 1355 het. haploid genotypes present (see AxiomGT1v2.comp_merge.hh );
Warning: Nonmissing nonmale Y chromosome genotype(s) present;
...
Total genotyping rate in remaining samples is 0.994631.
913984 variants and 3354 samples pass filters and QC.
...

The first warning is usually caused by male heterozygous calls in the X chromosome pseudo-autosomal region. Looking at AxiomGT1v2.comp_merge.hh shows that all the 1355 heterozygous haploid genotypes belong to sample GRLS grlsJCVWWRMM_1. According to the genotyping analysis notes, this sample is reported as a male in metadata as well as by the genotyping algorithm on array B, but the algorithm failed to predict its gender on array A. This sample will be handled later during the check for gender accuracy.

The second warning indicates that some non-male samples have Y chromosome genotypes. Most PLINK commands (e.g. --freq) treat these genotypes as missing. This behavior is intended for safety, however it makes it tricky to identify these markers. Here is a workaround to find these markers:

mkdir -p temp_nonmale
cat AxiomGT1v2.comp_merge.fam | awk '{if($5=="2")print}' > temp_nonmale/female.lst
plink --bfile AxiomGT1v2.comp_merge --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
      --keep temp_nonmale/female.lst --chr y \
      --recode --output-chr 'chrM' --out temp_nonmale/AxiomGT1v2.comp_merge.fe

## Stats need to be generated but PLINK treats Y genotypes in females as missing. Fake non-Y positions are usedfor these markers       
sed -i 's/chrY/chr_notY/' temp_nonmale/AxiomGT1v2.comp_merge.fe.map 

plink --file temp_nonmale/AxiomGT1v2.comp_merge.fe --chr-set 38 no-y --allow-no-sex --allow-extra-chr \
      --freq counts \
      --output-chr 'chrM' --out temp_nonmale/AxiomGT1v2.comp_merge.fe.explore
tail -n+2 temp_nonmale/AxiomGT1v2.comp_merge.fe.explore.frq.counts | awk '{if($5!=0 || $6!=0)print}' | wc -l
37 markers non-missing Y chromosome genotypes in females are identified. These markers could be in the pseudo-autosomal region, but unfortunately there are no known boundaries to those regions in dogs. As mentioned above, it is safe to ignore these markers in PLINK.

3.3. Identification and removal of duplicate samples

Plink2 has an efficient function to calculate the KING-robust kinship estimator and filter duplicate samples as well. Duplicate samples have kinship coefficients ~0.5, first-degree relations (parent-child, full siblings) correspond to ~0.25, second-degree relations correspond to ~0.125, etc. The conventional cutoff of ~0.354 (the geometric mean of 0.5 and 0.25) will be used here to identify and filter duplicate samples.

plink2 --bfile AxiomGT1v2.comp_merge --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
       --king-cutoff 0.354 \
       --out AxiomGT1v2.comp_merge.nodup

The log file indicates that the input file has 3354 samples while the output of the deduplication step will end up having 3236 samples. Now, the samples selected to be removed by Plink2 kinship are compared to the list of samples planned to run in duplicates as listed in the input metadata file map_id_sex.tab.

## This code will identify all the samples planned to run in duplicates in `map_id_sex.tab`.
## Then find the number of replicates remaining after exclusion of those selected by the Plink2 kinship filter.
## The expectations is that one replicate will remain from each group of replicate samples.

awk 'BEGIN{FS=OFS="\t"}FNR==NR{a[$3]+=1;next}/^Family_ID/{print}{if(a[$3]>1)print}' GRLS_Genotyping/map_id_sex.tab GRLS_Genotyping/map_id_sex.tab > dup_samples.tab
awk 'BEGIN{FS=OFS="\t"}FNR==NR{a[$1 FS $2]=1;next}{if(!a[$1 FS $2])print $0}' AxiomGT1v2.comp_merge.nodup.king.cutoff.out.id dup_samples.tab > dup_samples_remain.tab
tail -n+2 dup_samples_remain.tab | cut -f3 | sort | uniq -c | sort -k1,1nr
Hear is a list for the count of remaining replicates per sample:

2     grlsFGFIVFXX
1     grls1XIBPR44
1     grls3GG6Z3GG
1     grls4IH8IF22
.
.

It seems that 2 duplicates of sample grlsFGFIVFXX are still remaining! Let's try to calculate their KING kinship to see how similar they are.

grep grlsFGFIVFXX dup_samples.tab > failed_deDup.lst
plink2 --bfile AxiomGT1v2.comp_merge --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
       --keep failed_deDup.lst --make-king-table 'cols=maybefid,id,maybesid,nsnp,hethet,ibs0,ibs1,kinship' \
       --out AxiomGT1v2.comp_merge.failed_deDup
cat AxiomGT1v2.comp_merge.failed_deDup.kin0

Here is the output:

FID1 ID1 FID2 ID2 NSNP HETHET IBS0 HET1_HOM2 HET2_HOM1 KINSHIP
GRLS grlsFGFIVFXX_2 GRLS grlsFGFIVFXX_1 879956 0.0645396 0.0414475 0.11505 0.112207 -0.0559478
  • NSNP: Number of variants considered (autosomal, neither call missing)
  • HETHET: Proportion of considered call pairs which are het-het
  • IBS0: Proportion of considered call pairs which are opposite homs
  • HET1_HOM2: Proportion of sample 1 het, sample 2 hom
  • HET2_HOM1: Proportion of sample 1 hom, sample 2 het
  • KINSHIP: KING-robust between-family kinship estimate

It is obvious that these two samples are unrelated. Therefore, both of them should be excluded with the duplicate samples identified by the Plink2 kinship filter. There are 120 samples included in this total, so there will end up being 3234 samples in the output PLINK file

plink2 --bfile AxiomGT1v2.comp_merge --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
       --remove <(cat AxiomGT1v2.comp_merge.nodup.king.cutoff.out.id failed_deDup.lst) \
       --make-bed --output-chr 'chrM' --out AxiomGT1v2.comp_merge.deDup

3.4. Check for gender accuracy & remove samples with the incorrect gender identities

According to the genotyping analysis notes, there are two samples GRLS grlsQZAMFHKK_1 and GRLS grlsJCVWWRMM_1 that had discordant computed gender on the two arrays. The former was removed already because of the high rate of mismatching genotypes between the 2 arrays (see 3.1. Replicate SNPs). The latter showed high rate of heterozygous haploid genotypes despite being reported as male in metadata (see 3.2. Merging of array sets). Therefore, this sample should be discarded as well from further analysis.

There are an additional 9 samples with concordant gender on both arrays but different from the metadata. This information can be reproduced by comparing known gender in metadata file map_id_sex.tab versus the computed gender based on genotyping data in the PLINK files.

echo "Family_ID Individual_ID computed_sex metadata_sex" | tr ' ' '\t' > gender_conflict.lst
echo "GRLS grlsJCVWWRMM_1 ? 1" | tr ' ' '\t' >> gender_conflict.lst
awk 'BEGIN{FS=OFS="\t"}FNR==NR{a[$1 FS $2]=$4;next}{if(a[$1 FS $2] && a[$1 FS $2]!=$5)print $1,$2,$5,a[$1 FS $2]}' GRLS_Genotyping/map_id_sex.tab AxiomGT1v2.comp_merge.deDup.fam >> gender_conflict.lst

Incorrect gender could be a mistake in the metadata or an indication for sample swap. Therefore, it is safer to exclude these samples from the genotyping data.

plink2 --bfile AxiomGT1v2.comp_merge.deDup --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
       --remove gender_conflict.lst \
       --make-bed --output-chr 'chrM' --out AxiomGT1v2.comp_merge.deDup.sexConfirm
The output of the last PLINK command indicates that the final file has 3224 samples (1615 females, 1609 males; 3224 founders)

3.5. Update sample IDs in the genotyping files to match the phenotyping files

The sample IDs in the genotyping files have a suffix that identify the duplicates of the same sample. The samples have already been deduplicated in the genotyping files (see 3.3. Removal of duplicate samples). Now the IDs need to be updated to exactly match the public_ids (e.g. grlsH764T844) used in phenotypic data files.

tail -n+2 GRLS_Genotyping/map_id_sex.tab | awk 'BEGIN{FS=OFS="\t"}{print $1,$2,$1,$3}' > public_ids.update.lst 
plink2 --bfile AxiomGT1v2.comp_merge.deDup.sexConfirm --chr-set 38 no-xy --allow-no-sex --allow-extra-chr \
       --update-ids public_ids.update.lst \
       --make-bed --output-chr 'chrM' --out AxiomGT1v2.comp_merge.deDup.sexConfirm.public_ids