Site Tools


integrarray_qc_guidelines

Integrarray QC Guidelines – November 25, 2025

1. Genotype Calling

Genotypes were called for many samples at the Center for Inherited Disease Research (CIDR). The genotyping platform used was the Global Screening Plus Custom Array (here called the GSA array but also referenced as product by BPM Amos_Custom_20032937X382854_A1, genome build GRCh38/hg38), with the calling algorithm GenomeStudio version 2011.1, Genotyping Module version 1.9.4, GenTrain Version 1.0. The array consists of a total of 741,786 genotyped SNPs and 0 intensity-only probe. The custom component was configured with about 46,789 additional genotypes (based on 50,000 available beadtypes). The annotated file of SNPs configured for the array is denoted as‘Annotated file’. The requested SNPs and genes are included in ‘VariantListsforArray’. The CIDR quality control (QC) steps follow the R packages GWASTools (Gogarten SM et al. 2012), GENESIS (Conomos MP et al. 2017) and SNPRelate (Zheng X et al. 2012). The methods of QA/QC used by CIDR were described in (Laurie CC et al. 2010).

2. Sample QC

2.1 Initial call rate filtering (by groups)

We propose to use less stringent QC filters than those implemented by CIDR. Outlined below are the QC steps that will ensure adequate alignment with these processes.

  • Exclude samples with call rate <80%.
  • Then, exclude SNPs with call rate <80%.
  • Then, exclude samples with call rate <95%.
  • Then, exclude SNPs with call rate <95% .

2.2 Duplicate calling concordance

If possible, consortia should share at least 10,000 SNPs among de-identified individuals to allow for identification of potential duplicated individuals.

For duplicated samples in the CIDR genotyping pipeline, duplicates that showed more than 2 differences per sample were removed. This seems an appropriate threshold to minimize genotyping error, but we will relaxed that to allow for 3 errors among the known duplicates. The CIDR analysis included 274 reference duplicates and identified 191 SNPs to exclude.

2.3 Duplicate concordance

Identify duplicates within study.

Verify expected duplicates: if the duplicate samples are consistent, retain the sample with the higher call rate and exclude the sample with lower call rate.

Identify unexpected duplicates within studies: Liaise with study data-managers to attempt to resolve any discrepancies, remove both if discrepancies cannot be resolved.

This step has already been completed by CIDR for samples run by it.

2.4 Sex checks

Check that the reported sex of each participant aligns with the genetically inferred sex. The –check-sex option in PLINK will be used to assess consistency between reported and inferred sex. The pseudoautosomal regions need to be excluded first using the –split-x function.

Males with low Y-intensity can be retained. When there are irreconcilable differences between the reported and inferred sex, we will drop those individuals as they may indicate errors in sample processing. The inbreeding coefficient (F value) derived from X-chromosome heterozygosity, together with the Y-chromosome call rate, can be used jointly to infer biological sex based on defined thresholds.

We will include XO, XXY, and XYY karyotypes when they occur but mark these unusual sex chromosome patterns. These karyotypes were likely called correctly in GenomeStudio; however, their inferred sex based on the X-chromosome inbreeding coefficient (F value) typically falls within the 0.2–0.8 range (ambiguous). Therefore, such samples should be manually retained. Samples with low X heterozygosity (<5%) or rare karyotypes (e.g., XXX) wil be removed as they are difficult to genotype will be excluded. Pipelines need to track all deleted individuals and the reasons for their deletion, as some investigators are specifically interested in sex chromosome anomalies and how these may relate to cancer risk and aging.

2.4 Ancestry

A set of ~ 10,000 uncorrelated markers will be selected for ancestry inference. These markers will be used to classify individuals as European/East Asian/African American ancestry, or other ancestry using Fastpop (https://github.com/biomedicaldatascience/FastPop4) or Structure. The UNM group will perform principal components analysis (PCA) to identify continental ancestry using R. In addition, the UNM group will implement a PCA-based approach using a panel of about 10,000 GSA-specific markers or 25,732 pre-selected intra-European ancestry informative markers (https://github.com/biomedicaldatascience/AIPS) for deriving intra-European ancestry.

When performing PCA for ancestry inference, we usually exclude the 6p (HLA) region and regions of 8p23 and 17q21 that show inversions and strong LD patterns. There are additional regions with strong LD patterns that should to be considered for exclusion. Many of these regions have inversion polymorphisms driving the strong LD patterns (PMID: 38617337).

2.5 Heterozygosity

Exclude samples with heterozygosity <5% or > 40%, or with heterozygosity deviation if p<10-6, (|Z|>4.892). Inflated heterozygosity suggests contaminated samples, while low heterozygosity suggested poor DNA quality. Perform heterozygosity test within each ancestral groups separately.

2.6 Relatives

To identify relatives, it is important to establish a set of markers with low fixation index (FST) across major populations. The UNM group will provide a set of the markers from the GSA component of the array. We use KING to identify relatives. For fixed-effects analysis, relatives who share an identify-by-descent (IBD) coefficient greater than 0.125 will be removed to prevent inflation of type I error. For mixed-effect model, all individuals will be retained as the model accounts for relatedness among the samples.

2.7 Cross study duplicates

It is useful if there are some shared duplicates across groups. If HapMap controls were studied we can provide a list of those that were genotyped by CIDR.

3. SNP QC across groups

3.1 Call rate

  • Exclude SNPs zeroed by the cluster file with no genotypes.
  • Exclude samples with call rate <80%
  • Exclude SNPs with call rate <80%
  • Exclude samples with call rate <95%
  • Exclude SNPs with call rate <95%

3.2 Hardy-Weinberg

Check Hardy-Weinberg: exclude SNP if P < 10-7 in controls or P < 10-12 in cases.

4. SNP QC Exclusions Combined Across Groups

4.1 Combine list of failures

All consortia should exclude SNPS that fail call rate or HWE thresholds in any participating group.

4.2 Duplicate probes

There are a number of variants on the chip sharing the same probe position (or a few with the same alleles but the sequence from the opposite strand.) I will ask CIDR if they identified any duplicated probes. Note that while the standard Gentrain algorithms exclude triallelic SNPs, this is not strictly necessary, and we may choose to retain these.

5. Additional Steps Before Imputation

5.1 Rare SNPs with poor call rate

Exclude SNPs with call rate below 95% and MAF <0.001 in any group from the imputation input files. However, genotyped calls for these SNPs will remain available for analysis.

5.2 Non-ideal cluster plots

SNPs flagged as Possible (P) or Subset interference (S) in the second round of cluster plot checking will be excluded. These are either rare SNPs where there is no clear heterozygote cluster or SNPs with more than three clouds due to interference from other SNPs or potential copy number variation.

5.3 Imputation standard

Imputation preparation for variants from chromosomes 1–22 and X was performed using preparation tool (version 4.3.0) available at https://www.chg.ox.ac.uk/~wrayner/tools/. Imputation was conducted using the TOPMed R3 reference panel (GRCh38 build) on the Michigan Imputation Server, with EAGLE used for phasing and Minimac4 applied for imputation.

6. Principal Components

PCs will be Defined for the Integrarray and validated against some consortium-specific PC definitions. Separate PC sets will be generated for the European and Asian subsets, which could serve as covariates in ancestry-specific analysis, plus a global set to use for those of mixed ethnicity. It may also be important, where possible, to look at genomic inflation in individual studies to determine whether specific PCs are required (e.g., Finland, HMBCS).

The figure below describes either i) classification of ancestry using PCA (shown by most likely descent ellipses) or ii) assignment of continental origin to individuals based on the closest location on the continental ancestry triangle. We prefer the latter approach as the ancestry can then be used as a covariate in analyses or for subsequent selection.

(All lists will become available on the Integarray wiki: https://www.integralu19.org/index.php/Integarray/dokuwiki

currently request an email to receive the information from ciamos@salud.unm.edu)

Analytical plans

Main Integral Site The Integral analytical team has developed and implemented procedures using a mixed-effect model with the SAIGE program (https://saigegit.github.io/SAIGE-doc/). The advantage of the mixed effects model is that it allows all samples to be jointly analyzed irrespective of ancestry or relatedness, but there are several challenges associated with using this software package. First, one must derive a set of markers that show low levels of variation among ancestral groups. This set of markers is used with

KING to estimate the relationships among all the individuals in the sample and so ensure that test statistics for association are not inflated due to inclusion of related individuals (which can reduce standard errors and inflate test statistics). Second, one has to include principal components to adjust for ancestry and other factors that may also cause correlations among samples. To control possible noise, we will include the first 10 principal components. Because we will use data from several different studies in a mega-analysis, we will need to identify markers that are shared among the platforms to construct these principal components. Last, because we will perform mega-analysis we will include a study-specific covariate for each study. We will have at least 9 studies joined together for this analysis.

Collaborating sites

We imagine that collaborating centers will rather use a fixed effects model. With fixed effects modelling, the first step is to remove related individuals. The same approach to identify markers with low FST across ethnic populations should be applied, otherwise within population similarity in allele frequencies will inflate the IBD estimates derived from KING or other similar software, leading to an unnecessary reduction in sample size. Second, individuals showing IBD values more than 0.10 should be removed to reduce false positives due to relatives decreasing the standard errors. Third, Fastpop or similar programs should be used to classify individuals into disjoint populations that show similar allele frequencies to avoid confounding between allele frequencies and case/control prevalences (these are likely to vary among populations, resulting in confounding). GWAS are then conducted within populations and meta-analysis is conducted across populations. While the main Integral site will conduct its primary analysis using a mixed-effect analytical scheme, we will also conduct a fixed-effects analysis to understand variability in effect sizes among populations and to assist with combining data within populations across all the data sites.

Covariates and stratifying variables.

We have consistently found that different genetic factors are prominent by histological subgroups. Therefore, we will perform analyses for overall lung cancer and also according to major strata that include adenocarcinoma, squamous carcinoma and small cell carcinoma. We will not adjust for smoking behaviors because that will reduce the observed effects of genetic factors that affect lung cancer risk through their effects on smoking behavior. However, for variants that reach genome-wide significance, we will conduct post-GWAS analyses to identify the risk associated with variants according to smoking status (never, ever or never, current, former). We also do not anticipate adjusting for age or sex because these factors rarely substantially change the observed odds ratios for associations with lung cancer risk in large studies. Future studies will evaluate sex-specific, age-specific effects and smoking-specific effects on risk.

Transcriptome-wide association study (TWAS):

There has been an increasing emphasis in understanding mechanisms by which genetic factors influence lung cancer risk. In particular, single-cell TWAS has emerged as a very prominent approach for understanding how specific genes and SNPs influence cancer development. We anticipate that a post-GWAS analysis will include both bulk and single-cell TWAS to help with the interpretation of findings from this study. If this is not feasible, we will perform bulk RNA-based TWAS analysis.

Post-GWAS annotation of findings.

Our Harvard collaborators have developed FAVOR (https://favor.genohub.org/), which annotates GWAS findings using a wide range of features, including epigenetic changes, functional effects on proteins and the effects on transcription of the genetic changes observed. We will use this software to help in understanding the findings from our study.

Molecular validation of findings.

If possible, we will establish collaboration with molecular biologists to validate some of the top new findings from this study. We will provide suggestions for validations to collaborators once the main analysis has been completed so that the molecular biologists have sufficient time to pursue studies.

Status of pipeline: 11/25/2025

Overall PCA for independent markers:

The QC steps prior to imputation were refined based on Chris’ suggestions.

The main change is that unrelated samples, together with 89 CEU samples, were clustered and visually inspected.

Samples within the circle around the CEU centroid (highlighted in red) represent a stringent cutoff for defining a homogeneous European cluster.

The distance to the centroid is somewhat arbitrary, but is intentionally conservative in the current setting.

No samples were dropped; all samples will be included in the imputation, the marker lists of all steps can be found in file: integrarray_qc_guidelines_marker_list.xlsx

  • 735,370 markers in the original release (chr1–chr23)
  • 942 markers with ≥3 discordant calls among 149 duplicate pairs, missing genotypes were not included
  • 3,552 markers with p < 1×10⁻⁷ in 14,537 unrelated caucasian controls
  • 2,052 markers with p < 1×10⁻¹² in 9895 unrelated caucasian cases
  • 23,112 markers with call rate < 0.95
  • 27,119 total unique markers affected by filters discordant calls, HWE and call rate
  • 708,672 markers remaining after removing the filtered markers
  • 632,677 markers retained after running the McCarthy Group Tool workflow against the TOPMed R3 reference panel
integrarray_qc_guidelines.txt · Last modified: by 206.192.168.19