Site Tools


integrarray_qc_guidelines

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
integrarray_qc_guidelines [2025/12/15 19:16] 98.60.169.32integrarray_qc_guidelines [2025/12/23 16:14] (current) – [Molecular validation of findings.] 206.192.168.19
Line 2: Line 2:
  
  
-**//1. Genotype Calling//** +==== 1. Genotype Calling ====
  
 Genotypes were called for many samples at the Center for Inherited Disease Research (CIDR). The genotyping platform used was the Global Screening Plus Custom Array (here called the GSA array but also referenced as product by BPM Amos_Custom_20032937X382854_A1, genome build GRCh38/hg38), with the calling algorithm GenomeStudio version 2011.1, Genotyping Module version 1.9.4, GenTrain Version 1.0. The array consists of a total of 741,786 genotyped SNPs and 0 intensity-only probe. The custom component was configured with about 46,789 additional genotypes (based on 50,000 available beadtypes). The annotated file of SNPs configured for the array is denoted as‘Annotated file’. The requested SNPs and genes are included in ‘VariantListsforArray’. The CIDR quality control (QC) steps follow the R packages GWASTools (Gogarten SM et al. 2012), GENESIS (Conomos MP et al. 2017) and SNPRelate (Zheng X et al. 2012). The methods of QA/QC used by CIDR were described in (Laurie CC et al. 2010). Genotypes were called for many samples at the Center for Inherited Disease Research (CIDR). The genotyping platform used was the Global Screening Plus Custom Array (here called the GSA array but also referenced as product by BPM Amos_Custom_20032937X382854_A1, genome build GRCh38/hg38), with the calling algorithm GenomeStudio version 2011.1, Genotyping Module version 1.9.4, GenTrain Version 1.0. The array consists of a total of 741,786 genotyped SNPs and 0 intensity-only probe. The custom component was configured with about 46,789 additional genotypes (based on 50,000 available beadtypes). The annotated file of SNPs configured for the array is denoted as‘Annotated file’. The requested SNPs and genes are included in ‘VariantListsforArray’. The CIDR quality control (QC) steps follow the R packages GWASTools (Gogarten SM et al. 2012), GENESIS (Conomos MP et al. 2017) and SNPRelate (Zheng X et al. 2012). The methods of QA/QC used by CIDR were described in (Laurie CC et al. 2010).
  
-===== //2. Sample QC// =====+===== 2. Sample QC ===== 
  
-===== 2.1 Initial call rate filtering (by groups) ===== +==== 2.1 Initial call rate filtering (by groups) ====
  
 We propose to use less stringent QC filters than those implemented by CIDR. We propose to use less stringent QC filters than those implemented by CIDR.
Line 19: Line 19:
   *  Then, exclude SNPs with call rate <95% .   *  Then, exclude SNPs with call rate <95% .
  
-===== 2.2 Duplicate calling concordance =====+==== 2.2 Duplicate calling concordance ====
  
 If possible, consortia should share at least 10,000 SNPs among de-identified individuals to allow for identification of potential duplicated individuals. If possible, consortia should share at least 10,000 SNPs among de-identified individuals to allow for identification of potential duplicated individuals.
Line 26: Line 26:
  
  
-===== 2.3 Duplicate concordance ===== +==== 2.3 Duplicate concordance ====
  
 Identify duplicates within study. Identify duplicates within study.
Line 36: Line 36:
 This step has already been completed by CIDR for samples run by it. This step has already been completed by CIDR for samples run by it.
  
-===== 2.4 Sex checks ===== +==== 2.4 Sex checks ====
  
 Check that the reported sex of each participant aligns with the genetically inferred sex. The --check-sex option in PLINK will be used to assess consistency between reported and inferred sex. The pseudoautosomal regions need to be excluded first using the --split-x function. Check that the reported sex of each participant aligns with the genetically inferred sex. The --check-sex option in PLINK will be used to assess consistency between reported and inferred sex. The pseudoautosomal regions need to be excluded first using the --split-x function.
Line 45: Line 45:
 We will include XO, XXY, and XYY karyotypes when they occur but mark these unusual sex chromosome patterns. These karyotypes were likely called correctly in GenomeStudio; however, their inferred sex based on the X-chromosome inbreeding coefficient (F value) typically falls within the 0.2–0.8 range (ambiguous). Therefore, such samples should be manually retained. Samples with low X heterozygosity (<5%) or rare karyotypes (e.g., XXX) wil be removed as they are difficult to genotype will be excluded. Pipelines need to track all deleted individuals and the reasons for their deletion, as some investigators are specifically interested in sex chromosome anomalies and how these may relate to cancer risk and aging. We will include XO, XXY, and XYY karyotypes when they occur but mark these unusual sex chromosome patterns. These karyotypes were likely called correctly in GenomeStudio; however, their inferred sex based on the X-chromosome inbreeding coefficient (F value) typically falls within the 0.2–0.8 range (ambiguous). Therefore, such samples should be manually retained. Samples with low X heterozygosity (<5%) or rare karyotypes (e.g., XXX) wil be removed as they are difficult to genotype will be excluded. Pipelines need to track all deleted individuals and the reasons for their deletion, as some investigators are specifically interested in sex chromosome anomalies and how these may relate to cancer risk and aging.
  
-===== 2.4 Ancestry =====+==== 2.4 Ancestry ====
  
 A set of ~ 10,000 uncorrelated markers will be selected for ancestry inference. These markers will be used to classify individuals as European/East Asian/African American ancestry, or other ancestry using Fastpop (https://github.com/biomedicaldatascience/FastPop4) or Structure. The UNM group will perform principal components analysis (PCA) to identify continental ancestry using R. In addition, the UNM group will implement a PCA-based approach using a panel of about 10,000 GSA-specific markers or 25,732 pre-selected intra-European ancestry informative markers ([[https://github.com/biomedicaldatascience/AIPS]]) for deriving intra-European ancestry. A set of ~ 10,000 uncorrelated markers will be selected for ancestry inference. These markers will be used to classify individuals as European/East Asian/African American ancestry, or other ancestry using Fastpop (https://github.com/biomedicaldatascience/FastPop4) or Structure. The UNM group will perform principal components analysis (PCA) to identify continental ancestry using R. In addition, the UNM group will implement a PCA-based approach using a panel of about 10,000 GSA-specific markers or 25,732 pre-selected intra-European ancestry informative markers ([[https://github.com/biomedicaldatascience/AIPS]]) for deriving intra-European ancestry.
Line 52: Line 52:
 When performing PCA for ancestry inference, we usually exclude the 6p (HLA) region and regions of 8p23 and 17q21 that show inversions and strong LD patterns. There are additional regions with strong LD patterns that should to be considered for exclusion. Many of these regions have inversion polymorphisms driving the strong LD patterns (PMID: [[https://pubmed.ncbi.nlm.nih.gov/38617337/|38617337]]). When performing PCA for ancestry inference, we usually exclude the 6p (HLA) region and regions of 8p23 and 17q21 that show inversions and strong LD patterns. There are additional regions with strong LD patterns that should to be considered for exclusion. Many of these regions have inversion polymorphisms driving the strong LD patterns (PMID: [[https://pubmed.ncbi.nlm.nih.gov/38617337/|38617337]]).
  
-===== 2.5 Heterozygosity ====+==== 2.5 Heterozygosity ==== 
  
 Exclude samples with heterozygosity <5% or > 40%, or with heterozygosity deviation if p<10<sup>-6</sup>, (|Z|>4.892). Inflated heterozygosity suggests contaminated samples, while low heterozygosity suggested poor DNA quality. Perform heterozygosity test within each ancestral groups separately. Exclude samples with heterozygosity <5% or > 40%, or with heterozygosity deviation if p<10<sup>-6</sup>, (|Z|>4.892). Inflated heterozygosity suggests contaminated samples, while low heterozygosity suggested poor DNA quality. Perform heterozygosity test within each ancestral groups separately.
  
-===== 2.6 Relatives ===== +==== 2.6 Relatives ====
  
 To identify relatives, it is important to establish a set of markers with low fixation index (FST) across major populations. The UNM group will provide a set of the markers from the GSA component of the array. We use KING to identify relatives. For fixed-effects analysis, relatives who share an identify-by-descent (IBD) coefficient greater than 0.125 will be removed to prevent inflation of type I error. For mixed-effect model, all individuals will be retained as the model accounts for relatedness among the samples. To identify relatives, it is important to establish a set of markers with low fixation index (FST) across major populations. The UNM group will provide a set of the markers from the GSA component of the array. We use KING to identify relatives. For fixed-effects analysis, relatives who share an identify-by-descent (IBD) coefficient greater than 0.125 will be removed to prevent inflation of type I error. For mixed-effect model, all individuals will be retained as the model accounts for relatedness among the samples.
  
-===== 2.7 Cross study duplicates ===== +==== 2.7 Cross study duplicates ====
  
 It is useful if there are some shared duplicates across groups. If HapMap controls were studied we can provide a list of those that were genotyped by CIDR. It is useful if there are some shared duplicates across groups. If HapMap controls were studied we can provide a list of those that were genotyped by CIDR.
  
-===== //3. SNP QC across groups// =====+===== 3. SNP QC across groups ===== 
 + 
 + 
 +==== 3.1 Call rate ====
  
-===== 3.1 Call rate =====  
  
   *  Exclude SNPs zeroed by the cluster file with no genotypes.   *  Exclude SNPs zeroed by the cluster file with no genotypes.
Line 73: Line 76:
   *  Exclude samples with call rate <95%   *  Exclude samples with call rate <95%
   *  Exclude SNPs with call rate <95%   *  Exclude SNPs with call rate <95%
-**3.2 Hardy-Weinberg** +==== 3.2 Hardy-Weinberg ====
 Check Hardy-Weinberg: exclude SNP if P < 10<sub>-7</sub> in controls or P < 10<sub>-12</sub> in cases. Check Hardy-Weinberg: exclude SNP if P < 10<sub>-7</sub> in controls or P < 10<sub>-12</sub> in cases.
  
-===== //4. SNP QC Exclusions Combined Across Groups// =====+===== 4. SNP QC Exclusions Combined Across Groups ===== 
 + 
 +==== 4.1 Combine list of failures ====
  
-===== 4.1 Combine list of failures =====  
  
 All consortia should exclude SNPS that fail call rate or HWE thresholds in any participating group. All consortia should exclude SNPS that fail call rate or HWE thresholds in any participating group.
  
  
-===== 4.2 Duplicate probes ====+==== 4.2 Duplicate probes ==== 
 + 
  
 There are a number of variants on the chip sharing the same probe position (or a few with the same alleles but the sequence from the opposite strand.) I will ask CIDR if they identified any duplicated probes. Note that while the standard Gentrain algorithms exclude triallelic SNPs, this is not strictly necessary, and we may choose to retain these. There are a number of variants on the chip sharing the same probe position (or a few with the same alleles but the sequence from the opposite strand.) I will ask CIDR if they identified any duplicated probes. Note that while the standard Gentrain algorithms exclude triallelic SNPs, this is not strictly necessary, and we may choose to retain these.
  
-===== //5. Additional Steps Before Imputation// =====+===== 5. Additional Steps Before Imputation =====
  
-===== 5.1 Rare SNPs with poor call rate ===== +==== 5.1 Rare SNPs with poor call rate ====
  
 Exclude SNPs with call rate below 95% and MAF <0.001 in any group from the imputation input files. However, genotyped calls for these SNPs will remain available for analysis. Exclude SNPs with call rate below 95% and MAF <0.001 in any group from the imputation input files. However, genotyped calls for these SNPs will remain available for analysis.
  
-===== 5.2 Non-ideal cluster plots ===== +==== 5.2 Non-ideal cluster plots ====
  
 SNPs flagged as Possible (P) or Subset interference (S) in the second round of cluster plot checking will be excluded. These are either rare SNPs where there is no clear heterozygote cluster or SNPs with more than three clouds due to interference from other SNPs or potential copy number variation. SNPs flagged as Possible (P) or Subset interference (S) in the second round of cluster plot checking will be excluded. These are either rare SNPs where there is no clear heterozygote cluster or SNPs with more than three clouds due to interference from other SNPs or potential copy number variation.
  
  
-===== 5.3 Imputation standard ===== +==== 5.3 Imputation standard ====
  
 Imputation preparation for variants from chromosomes 1–22 and X was performed using preparation tool (version 4.3.0) available at https://www.chg.ox.ac.uk/~wrayner/tools/. Imputation was conducted using the TOPMed R3 reference panel (GRCh38 build) on the Michigan Imputation Server, with EAGLE used for phasing and Minimac4 applied for imputation. Imputation preparation for variants from chromosomes 1–22 and X was performed using preparation tool (version 4.3.0) available at https://www.chg.ox.ac.uk/~wrayner/tools/. Imputation was conducted using the TOPMed R3 reference panel (GRCh38 build) on the Michigan Imputation Server, with EAGLE used for phasing and Minimac4 applied for imputation.
  
-===== 6. Principal Components ===== +===== 6. Principal Components =====
  
 **PCs will be** //Defined for the Integrarray and validated against some consortium-specific PC definitions.// Separate PC sets will be generated for the European and Asian subsets, which could serve as covariates in ancestry-specific analysis, plus a global set to use for those of mixed ethnicity. It may also be important, where possible, to look at genomic inflation in individual studies to determine whether specific PCs are required (e.g., Finland, HMBCS). **PCs will be** //Defined for the Integrarray and validated against some consortium-specific PC definitions.// Separate PC sets will be generated for the European and Asian subsets, which could serve as covariates in ancestry-specific analysis, plus a global set to use for those of mixed ethnicity. It may also be important, where possible, to look at genomic inflation in individual studies to determine whether specific PCs are required (e.g., Finland, HMBCS).
  
 The figure below describes either i) classification of ancestry using PCA (shown by most likely descent ellipses) or ii) assignment of continental origin to individuals based on the closest location on the continental ancestry triangle. We prefer the latter approach as the ancestry can then be used as a covariate in analyses or for subsequent selection. The figure below describes either i) classification of ancestry using PCA (shown by most likely descent ellipses) or ii) assignment of continental origin to individuals based on the closest location on the continental ancestry triangle. We prefer the latter approach as the ancestry can then be used as a covariate in analyses or for subsequent selection.
 +{{ :cluster_in_pca_figure.png?500 |}}
  
 (All lists will become available on the Integarray wiki: https://www.integralu19.org/index.php/Integarray/dokuwiki (All lists will become available on the Integarray wiki: https://www.integralu19.org/index.php/Integarray/dokuwiki
Line 113: Line 119:
  
  
-Analytical plans 
  
  
-Main Integral Site 
  
 +
 +
 +
 +===== Analytical plans =====
 +
 +
 +
 +
 +Main Integral Site
 The Integral analytical team has developed and implemented procedures using a mixed-effect model with the SAIGE program (https://saigegit.github.io/SAIGE-doc/). The advantage of the mixed effects model is that it allows all samples to be jointly analyzed irrespective of ancestry or relatedness, but there are several challenges associated with using this software package. First, one must derive a set of markers that show low levels of variation among ancestral groups. This set of markers is used with The Integral analytical team has developed and implemented procedures using a mixed-effect model with the SAIGE program (https://saigegit.github.io/SAIGE-doc/). The advantage of the mixed effects model is that it allows all samples to be jointly analyzed irrespective of ancestry or relatedness, but there are several challenges associated with using this software package. First, one must derive a set of markers that show low levels of variation among ancestral groups. This set of markers is used with
  
Line 123: Line 136:
  
  
-Collaborating sites+===== Collaborating sites ===== 
  
 We imagine that collaborating centers will rather use a fixed effects model. With fixed effects modelling, the first step is to remove related individuals. The same approach to identify markers with low FST across ethnic populations should be applied, otherwise within population similarity in allele frequencies will inflate the IBD estimates derived from KING or other similar software, leading to an unnecessary reduction in sample size. Second, individuals showing IBD values more than 0.10 should be removed to reduce false positives due to relatives decreasing the standard errors. Third, Fastpop or similar programs should be used to classify individuals into disjoint populations that show similar allele frequencies to avoid confounding between allele frequencies and case/control prevalences (these are likely to vary among populations, resulting in confounding). GWAS are then conducted within populations and meta-analysis is conducted across populations. While the main Integral site will conduct its primary analysis using a mixed-effect analytical scheme, we will also conduct a fixed-effects analysis to understand variability in effect sizes among populations and to assist with combining data within populations across all the data sites. We imagine that collaborating centers will rather use a fixed effects model. With fixed effects modelling, the first step is to remove related individuals. The same approach to identify markers with low FST across ethnic populations should be applied, otherwise within population similarity in allele frequencies will inflate the IBD estimates derived from KING or other similar software, leading to an unnecessary reduction in sample size. Second, individuals showing IBD values more than 0.10 should be removed to reduce false positives due to relatives decreasing the standard errors. Third, Fastpop or similar programs should be used to classify individuals into disjoint populations that show similar allele frequencies to avoid confounding between allele frequencies and case/control prevalences (these are likely to vary among populations, resulting in confounding). GWAS are then conducted within populations and meta-analysis is conducted across populations. While the main Integral site will conduct its primary analysis using a mixed-effect analytical scheme, we will also conduct a fixed-effects analysis to understand variability in effect sizes among populations and to assist with combining data within populations across all the data sites.
  
  
-Covariates and stratifying variables.+===== Covariates and stratifying variables. ===== 
  
 We have consistently found that different genetic factors are prominent by histological subgroups. Therefore, we will perform analyses for overall lung cancer and also according to major strata that include adenocarcinoma, squamous carcinoma and small cell carcinoma. We will not adjust for smoking behaviors because that will reduce the observed effects of genetic factors that affect lung cancer risk through their effects on smoking behavior. However, for variants that reach genome-wide significance, we will conduct post-GWAS analyses to identify the risk associated with variants according to smoking status (never, ever or never, current, former). We also do not anticipate adjusting for age or sex because these factors rarely substantially change the observed odds ratios for associations with lung cancer risk in large studies. Future studies will evaluate sex-specific, age-specific effects and smoking-specific effects on risk. We have consistently found that different genetic factors are prominent by histological subgroups. Therefore, we will perform analyses for overall lung cancer and also according to major strata that include adenocarcinoma, squamous carcinoma and small cell carcinoma. We will not adjust for smoking behaviors because that will reduce the observed effects of genetic factors that affect lung cancer risk through their effects on smoking behavior. However, for variants that reach genome-wide significance, we will conduct post-GWAS analyses to identify the risk associated with variants according to smoking status (never, ever or never, current, former). We also do not anticipate adjusting for age or sex because these factors rarely substantially change the observed odds ratios for associations with lung cancer risk in large studies. Future studies will evaluate sex-specific, age-specific effects and smoking-specific effects on risk.
  
  
-Transcriptome-wide association study (TWAS):+===== Transcriptome-wide association study (TWAS): ===== 
  
 There has been an increasing emphasis in understanding mechanisms by which genetic factors influence lung cancer risk. In particular, single-cell TWAS has emerged as a very prominent approach for understanding how specific genes and SNPs influence cancer development. We anticipate that a post-GWAS analysis will include both bulk and single-cell TWAS to help with the interpretation of findings from this study. If this is not feasible, we will perform bulk RNA-based TWAS analysis. There has been an increasing emphasis in understanding mechanisms by which genetic factors influence lung cancer risk. In particular, single-cell TWAS has emerged as a very prominent approach for understanding how specific genes and SNPs influence cancer development. We anticipate that a post-GWAS analysis will include both bulk and single-cell TWAS to help with the interpretation of findings from this study. If this is not feasible, we will perform bulk RNA-based TWAS analysis.
  
  
-Post-GWAS annotation of findings.+===== Post-GWAS annotation of findings. ===== 
  
 Our Harvard collaborators have developed FAVOR (https://favor.genohub.org/), which annotates GWAS findings using a wide range of features, including epigenetic changes, functional effects on proteins and the effects on transcription of the genetic changes observed. We will use this software to help in understanding the findings from our study. Our Harvard collaborators have developed FAVOR (https://favor.genohub.org/), which annotates GWAS findings using a wide range of features, including epigenetic changes, functional effects on proteins and the effects on transcription of the genetic changes observed. We will use this software to help in understanding the findings from our study.
  
  
-Molecular validation of findings.+===== Molecular validation of findings. ===== 
  
 If possible, we will establish collaboration with molecular biologists to validate some of the top new findings from this study. We will provide suggestions for validations to collaborators once the main analysis has been completed so that the molecular biologists have sufficient time to pursue studies. If possible, we will establish collaboration with molecular biologists to validate some of the top new findings from this study. We will provide suggestions for validations to collaborators once the main analysis has been completed so that the molecular biologists have sufficient time to pursue studies.
Line 153: Line 171:
 Overall PCA for independent markers: Overall PCA for independent markers:
  
 +{{ :31949_ind_pc1_vs_pc2.png?500 |}}
  
 The QC steps prior to imputation were refined based on Chris’ suggestions. The QC steps prior to imputation were refined based on Chris’ suggestions.
Line 164: Line 182:
  
  
-No samples were dropped; all samples will be included in the imputation+No samples were dropped; all samples will be included in the imputation, the marker lists of all steps can be found in file: {{ :integrarray_qc_guidelines_marker_list.xlsx |}}
- +
-· 735,370 markers in the original release (chr1–chr23) +
- +
-· 942 markers with ≥3 discordant calls among 149 duplicate pairs +
- +
-· 3,552 markers with p < 1×10⁻⁷ in controls +
- +
-· 2,052 markers with p < 1×10⁻¹² in cases +
- +
-· 23,112 markers with call rate < 0.95 +
- +
-· 27,119 total unique markers affected by filters (3), (4), (5), and (6)+
  
-· 708,672 markers remaining after removing the filtered markers+  *  **735,370** markers in the original release (chr1–chr23) 
 +  *  **942** markers with ≥3 discordant calls among 149 duplicate pairs, missing genotypes were not included 
 +  *  **3,552** markers with p < 1×10⁻⁷ in 14,537 unrelated caucasian controls 
 +  *  **2,052** markers with p < 1×10⁻¹² in 9895 unrelated caucasian cases 
 +  *  **23,112** markers with call rate < 0.95 
 +  *  **27,119** total unique markers affected by filters discordant calls, HWE and call rate 
 +  *  **708,672** markers remaining after removing the filtered markers 
 +  *  **632,677** markers retained after running the McCarthy Group Tool workflow against the TOPMed R3 reference panel
  
-· 615,894 markers retained after running the McCarthy Group Tool workflow against the TOPMed R3 reference panel 
integrarray_qc_guidelines.1765826199.txt.gz · Last modified: by 98.60.169.32