A test for balanced coverage across cases and controls as a qualifying criterion in collapsing analysis.

Size: px

Start display at page:

Download "A test for balanced coverage across cases and controls as a qualifying criterion in collapsing analysis."

Frank Cannon
6 years ago
Views:

1 A test for balanced coverage across cases and controls as a qualifying criterion in collapsing analysis. Background and Motivation: Collapsing analyses test the association of qualifying rare variants in defined genomic regions (e.g all consensus coding sequence or CCDS boundaries) with disease phenotype. Qualifying criteria for variants in these analysis include variant quality (e.g read depth, genotype quality), variant functional prediction (e.g protein coding change) and population frequency (minor allele frequency in ExAC or control datasets). However, before these analyses can be carried out, it is essential to control and minimize signal artifacts arising out of differences in sequencing coverage between cases and controls. In the current IGM collapsing analysis framework, the penultimate step before the collapsing analysis on cases and controls is a site coverage harmonization (SCH)[1]. For each genomic site being interrogated in the collapsing run (e.g CCDS sites), we calculate the fraction of cases and fraction of controls covered at a predetermined threshold coverage (e.g 10X) and then calculate the absolute difference in fractional coverage between cases and controls. We then calculate the mean absolute difference from all sites, and then subtract it from the absolute difference values for each site to reflect the deviation from the mean difference, which is then squared to define the variation value for each site. The resulting variation estimates across the million CCDS sites are sorted from largest to smallest and plotted as a cumulative sum of variation plot. The plot is then shifted on a 45 angle to find the peak maximum point. In other words, (y-x) is plotted against x. Here, the x value at which (y-x) is maximized points us to the suggested cutoff index. Any site where the absolute fractional difference is above this threshold is then excluded in subsequent collapsing analysis. This method is effective at pruning sites where the fractional difference between cases and controls is sufficiently high to induce biases in collapsing studies. However, by not normalizing the absolute fractional coverage difference to the cohort mean, we prune well-covered sites that might have a marginally larger coverage difference than poorly covered sites with a smaller difference. For instance, at an absolute difference threshold of 0.05, a site with fractional coverage 0.89 in cases and 0.95 in controls will be pruned while a site with fractional coverage 0.12 in cases and 0.17 in controls will be retained. Additionally, by computing fractional coverage difference across all sites, we add a high computational effort to the collapsing runs. In a typical rare variant collapsing run, we identify only about ~300K sites of the CCDS regions to have a qualifying variant. The CCDS comprises ~33M bases, so for every analysis, the pre-computation of coverage balance constitutes a 100 fold excess of computational load.

2 Methods: To reduce the retention of poorly covered sites at the expense of highly covered sites, we impose a statistical test of independence between case/control status and coverage. At a given site: For x cases covered at 10X, y controls covered at 10x, and s total number of cases, t total number of controls, we can model the number of covered cases X as a Binomial random variable: X ~ Bin(n = number covered samples, p = P(case covered)) If case/control status and coverage status are independent, then: P(case covered) = P(case) = s s + t This allows us to perform a Binomial test (two-sided) on the actual number of covered samples, x: BinomTest(k = x, n = x + y, p = s s + t ) A binomial test as described above can be executed independently at each site, enabling parallelization at the computing level. This method will also resolve the need to pre-compute fractional coverage difference at all CCDS sites to identify a threshold difference as required by the SCH method. We can perform the binomial test of coverage bias as an additional qualifying criteria only on those sites where there is an otherwise qualifying variant identified in a sample, resulting in a 100 fold decrease in computational burden. Results: We implemented a binomial test of coverage and case/control status independence as additional qualifying criterion in ATAV. We used two IGM cohorts to compare the binomial test method with the SCH method (1) A chronic kidney disease cohort with ~10,000 controls and ~1,700 cases, and (2) An idiopathic pulmonary fibrosis cohort with ~4,000 controls and ~200 cases. For each cohort, we analyzed CCDS sites (S) using the SCH method and compiled a list of sites (SSCH) that would be pruned before subsequent collapsing analysis. Independently, we performed the binomial coverage test described above for every CCDS site for the same cohort and identified sites (Sbinom) with a nominal p-value of 0.05 to be pruned prior to collapsing analysis. Finally, we executed a collapsing analysis on the cohort on all CCDS sites without any coverage analysis method (SQV).

3 SQV represents the set of sites where a qualifying variant satisfying typical qualifying criteria for variant quality, function and minor allele frequency is present in at least one sample. We then calculated Qualifying sites pruned with both methods = SSCH Sbinom Qualifying sites uniquely pruned by SCH method = SSCH - Sbinom Qualifying sites uniquely pruned by binomial test method = Sbinom - SSCH CKD cohort (MAF ) CKD cohort (MAF ) IPF cohort Sites pruned by both methods Sites pruned by SCH only Sites pruned by binom test only Table 1. Sites pruned by coverage analysis methods. For all analyses, we found that the SCH method pruned sites vastly in excess of those pruned by the binomial test method (SSCH - Sbinom >> Sbinom - SSCH). We then investigated the mean coverage of the pruned sites to evaluate the overall coverage of sites which are pruned by these methods. We are typically interested sites with high coverage across the cohort, where we have an increased probability for a sample to have a variant that satisfies qualifying criteria. We evaluated fractional coverage difference as determined by the SCH method against the binomial test p- value (Figure 1A) at each site. Sites pruned by the SCH method, but retained by binomial test had a high overall coverage across the cohort (mean fractional coverage across all sites = 0.86, Figure 1B), while sites pruned by binomial test but retained by SCH had low coverage (mean fractional coverage across all sites = 0.13, Figure 1C), implying that the binomial test is capable of rescuing sites with high coverage that are otherwise pruned by the SCH method.

4 Figure 1. (A) Scatter plot of absolute difference of coverage fraction against a binomial test p-value for 100,000 CCDS sites. Lower left quadrant represents sites that are pruned due a nominally significant p-value of 0.05 in binomial test, but retained in SCH method. Upper right quadrant represents sites that are retained by a binomial test but pruned by SCH method. (B) Frequency histogram of cohort fraction coverage for sites retained by SCH method and pruned by binomial test. (C) Frequency histogram of cohort fraction coverage for sites retained by binomial test method and pruned by SCH. Inflation: Additionally, we measured the inflation in collapsing results using lambda (the ratio of Observed/Expected p-value at the 50 th percentile of gene p-values after collapsing) to evaluate any unforeseen biases in the analyses through the use of the binomial test. In the two cohorts we evaluated, there was no significant difference in the inflation factor between the two methods, with the binomial test method performing nominally better. Lambda SCH Lambda Binom-test IPF cohort CKD cohort ( MAF) CKD cohort ( MAF) Table 2: Lambda from collapsing analysis using SCH or binomial test to control for coverage imbalance. Qualifying variants in top collapsing genes:

5 We counted the number of variants pruned uniquely by either SCH or the binomial test method within the top ten most significant collapsing analysis genes for each analysis. The binomial test method rescued several qualifying variants in top collapsing genes in each analysis, while the SCH method did not rescue any top gene QVs in any of the analyses. # Binom. test rescued QVs # SCH rescued QVs IPF cohort 42 0 CKD cohort ( MAF) 5 0 CKD cohort ( MAF) 2 0 Table 3: Number of rescued qualifying variants in top 10 most significant collapsing analysis genes ATAV runtime: Eliminating the SCH ATAV step significantly reduces the overall time needed to complete a full collapsing analysis. For the ~11,700 sample CKD cohort, elimination of the SCH step in favor of the binomial test method brought ATAV time down by ~26 hours, while runtime for the IPF cohort decreased by ~13 hours. These reductions are equivalent to around half of the total runtime. Though runtime measurements are affected by overall ATAV load at the time of analysis and are therefore subject to variation, it is clear that the binomial test method has the potential to greatly improve the speed of collapsing analysis. Conclusions: We implemented a test of independence of coverage and case/control status as a qualifying criterion in collapsing analysis. Our test of coverage independence rescued sites with reasonably balanced coverage that were pruned out by SCH method. In general, we found large overlap between sites that were pruned by either method for reasons of coverage imbalance. However, the binomial test uniquely retained fold more sites than it uniquely pruned when compared to SCH. The binomial test method could evaluate several thousand additional variant sites in the CCDS region that are pruned by SCH. The inflation factor, measured by lambda, was not significantly altered between the two methods. Typical collapsing runs require coverage data for the entire cohort to establish minor allele frequency for a variant. Therefore, adding a coverage comparison test on otherwise qualifying variants only marginally added to the compute time for an analysis. Implementing the coverage test as part of the collapsing run resulted in a 50% reduction in ATAV compute load and collapsing analysis time through the elimination of a previously necessary coverage harmonization step. The binomial test for independence of coverage and case-control status is thus a computationally efficient and robust method to control for coverage imbalance in collapsing analysis.

6 REFERENCES: 1. Petrovski, S., et al., An Exome Sequencing Study to Assess the Role of Rare Genetic Variation in Pulmonary Fibrosis. Am J Respir Crit Care Med, (1): p

MAS187/AEF258. University of Newcastle upon Tyne

MAS187/AEF258. University of Newcastle upon Tyne MAS187/AEF258 University of Newcastle upon Tyne 2005-6 Contents 1 Collecting and Presenting Data 5 1.1 Introduction...................................... 5 1.1.1 Examples...................................