Evaluation of three read-depth based CNV detection tools using whole-exome sequencing data
© The Author(s). 2017
Received: 26 April 2017
Accepted: 15 August 2017
Published: 23 August 2017
Whole exome sequencing (WES) has been widely accepted as a robust and cost-effective approach for clinical genetic testing of small sequence variants. Detection of copy number variants (CNV) within WES data have become possible through the development of various algorithms and software programs that utilize read-depth as the main information. The aim of this study was to evaluate three commonly used, WES read-depth based CNV detection programs using high-resolution chromosomal microarray analysis (CMA) as a standard.
Paired CMA and WES data were acquired for 45 samples. A total of 219 CNVs (size ranged from 2.3 kb – 35 mb) identified on three CMA platforms (Affymetrix, Agilent and Illumina) were used as standards. CNVs were called from WES data using XHMM, CoNIFER, and CNVnator with modified settings.
All three software packages detected an elevated proportion of small variants (< 20 kb) compared to CMA. XHMM and CoNIFER had poor detection sensitivity (22.2 and 14.6%), which correlated with the number of capturing probes involved. CNVnator detected most variants and had better sensitivity (87.7%); however, suffered from an overwhelming detection of small CNVs below 20 kb, which required further confirmation. Size estimation of variants was exaggerated by CNVnator and understated by XHMM and CoNIFER.
Low concordances of CNV, detected by three different read-depth based programs, indicate the immature status of WES-based CNV detection. Low sensitivity and uncertain specificity of WES-based CNV detection in comparison with CMA based CNV detection suggests that CMA will continue to play an important role in detecting clinical grade CNV in the NGS era, which is largely based on WES.
KeywordsClinical sequencing Copy number variants Whole exome sequencing Structural variation
Copy number variants are important human genomic variants known to be responsible for Mendelian disorders as well as for common genetic conditions such as autism, intellectual disability, and schizophrenia [1–3]. Chromosomal microarray analysis (CMA) has demonstrated its technical validity and has remained the method of choice for the detection of genome-wide copy number variants (CNVs) in clinical settings. It has also demonstrated its clinical validity for both pre- and postnatal diagnostic testing [4, 5]. CMA is currently regarded as the gold standard for detection of CNVs that range from several kilobases to several megabases in size [6, 7].
The advent of next-generation sequencing (NGS) technology has dramatically improved our capability for examining small-scale sequence variants; it has also provided new options for the evaluation of large scale structural variants such as CNVs . Whole-exome sequencing (WES) has been accepted as the most comprehensive test currently implemented in the clinical setting for small sequence variants [9, 10]. Much effort has been focused to generate CNV information from WES data ; however, low sensitivity and high false positive rates have been reported in previous studies using cancer cell lines , publicly available exome data , or comparing with whole genome sequencing data based CNV calling [14–16]. Thus, its technical validity has yet to be thoroughly evaluated.
Here, we evaluated three representative and popular read-depth based CNV detection programs: the eXome-Hidden Markov Model (XHMM), the Copy Number Inference From Exome Reads (CoNIFER), and CNVnator using clinical grade WES data. XHMM and CoNIFER detect rare CNVs based on a batched-comparison principle, while CNVnator detects CNVs based on a mean-shift approach within single samples. CNVs detected from the CMA platform were used as reference standard.
Samples and ethics statement
A total of 45 clinical diagnostic samples were enrolled from the Shanghai Children’s Medical Centre and the Maternal and Child Health Hospital of the Guangxi Zhuang autonomous region with the approval of respective institutional ethics review committees. Genomic DNA was extracted using the QIAamp Blood DNA Mini kit® (Qiagen GMBH, Hilden, Germany).
WES and WES-based CNV detection
Exome targets were captured using the Agilent SureSelect Human All Exon V4 or V5 kit (Agilent Technologies, Santa Clara, CA). Raw sequencing data (FASTQ format) were generated via the Illumina HiSeq 2000 platform (Illumina, Inc., San Diego, CA). The Burrows Wheeler Alignment tool (BWA) v0.2.10  was employed for sequencing data alignment to the Human Reference Genome (NCBI build 37, hg 19). All data were assessed using FastQC (version 0.11.2) (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) for quality.
CNVs were generated using the following three CNV detection programs: (1) XHMM v1.0 , (2) CoNIFER v0.2.2 , and (3) CNVnator v0.2.7 . XHMM includes several analytic steps and involves a number of parameters. In our study, we set all parameters to default (minTargetSize: 10; maxTargetSize: 10,000; minMeanTargetRD: 10; maxMeanTargetRD: 500; minMeanSampleRD: 25; maxMeanSampleRD: 20; maxSdSampleRD: 150) for filtering samples and targets, and prepared the data for normalization via XHMM. The only parameter that could be adjusted on Conifer was SVD, which was set to 1. For CNVnator, we set the bin size to 50–60 according to the average coverage depth of our sequencing data (45–70 X). XHMM and CoNIFER used a pooled sample calling approach as input, and CNVnator called CNVs sample by sample after individually generating a baseline.
CMA and CMA-based CNV detection
CMA were performed using three different array platforms including the SurePrint G3 customized array (Agilent Technologies, Santa Clara, CA), CytoScan HD (Affymetrix, Santa Clara, CA), and Infinium iSelect HD and HTS Custom Genotyping BeadChips (Illumina, San Diego, CA). Prior validated settings for each platform were consistently utilized for CNV detection and filtering. CNVs in the size range of 2 kb – 400 kb were detected via CMA and were further confirmed by manual inspection.
Quality control of WES data
Fourteen samples were prepared using the Agilent SureSelect Human All Exon V4 kit and the remaining samples were prepared using the V5 kit. The mean read depth of all samples ranged around 50 X and the average read quality was well above the standard of 20 X. Details of sequence data are available in the supplemental data (Additional file 1: Table S1).
Size distribution of CNV detected via CMA and WES
Precision of CNV detection
Characteristics of CNV missed by exome data (Fig. 3 (C))
Poor concordance among three programs (Fig. 3 (D))
Detection of clinical relevant variants
All variants were evaluated with our in-house standard for clinical relevant variants and eight CNVs were categorized as pathogenic or likely pathogenic variants ranging from 306 kb – 35 mb. Six of these variants were detected by WES programs. A 306 kb variant on chromosome 3 remained undetected due to particularly low capture probe coverage within the variant region. Another 11 mb variant on chromosome 2 remained undetected despite sufficient capture probes and depth coverage (Additional file 3: Table S3).
Copy number variants (CNVs) are a very important target in the clinical diagnosis of genetic diseases. CMA has been proven as the most stable and accurate platform for CNV detection and has been implemented as a clinical test for more than a decade. NGS now provided a new approach for detecting CNV, which can potentially replace CMA. Before implementing NGS-based CNV detection, extensive validation is required to evaluate the validity of the new method.
Numerous WES based CNV detection programs have been developed, including the 15 read-depth based CNV detection tools currently available . We selected three representative and well-known methods for this study. XHMM is the most commonly accepted software, which employs the classical hidden Markov model (HMM) for CNV identification and achieves a sensitivity of 8–14% via XHMM, reported against CNV detection based on WGS data . The XHMM framework starts with aligned BAM files to calculate the depth of coverage; then, utilizing normalized read depths via principal component analysis (PCA). Finally, XHMM uses the normalized data to train and run a Hidden Markov Model (HMM) for CNV detection. CoNIFER was the first developed tool to deal with rare CNVs from multiple samples and has been chosen as representative software, which can be used as reference in evaluating other new softwares . CoNIFER calculates the RPKM (reads per kilobase per million mapped reads) values for each sample, and utilizes the singular value decomposition (SVD) method (originating from linear algebra) to reduce data dimensions for detecting obvious CNV signals. Evaluation against the array CGH platform in breast cancer samples characterized CoNIFER as leading to high false positives, low sensitivity, and obvious duplication bias . Another study showed that CoNIFER achieves higher precision, but at a cost of reduced sensitivity below 5% . XHMM and CoNIFER have been evaluated in parallel in patients with nonsyndromic hearing loss showing poor concordance on size of detected CNV . However, both tools are noted for advantages of identification of rare CNV from a population of WES samples . CNVnator was previously used in whole genome data for CNVs identification based on read depth, and was accessed to achieve better resolution of CNV borders than the other WGS data-based tools . The main methodology for CNVnator is a mean-shift. The software first divides the whole genome into equal sized, non-overlapping bins, and treats the mapped reads of each bin as a read depth signal. To estimate copy number change in each genome segment, it then calculates the P-value for a one-sample t-test, testing whether the mean RD signal of a segment would be close to the genome average. In a comprehensive comparison study, CNVnator was accessed to be outstanding in break point position and copy number estimation; however, disconcordance of variants was also discovered among all tools evaluated in the study .
In our study, large differences were observed in number and size distribution of CNVs detected from CMA and three WES based tools. Microarray platforms have a smaller capacity to detect small variants that are not covered by a sufficient number of probes. Several studies have tried to understand the roles of these small variants. The detection of small, non-recurrent pathogenic or likely pathogenic CNVs could help to increase the diagnostic yield of CMA clinical testing by ~3% [27, 28]. WES-based tools, such as XHMM and CoNIFER, are capable of detecting small variants as long as a sufficient number of capturing probes (> 10) are covered in the region and enable a sensitivity of 14.6 and 22.2%, respectively, indicating the importance of probe number for CNVs detection. The overwhelming number of variants CNVnator detected from samples was due to the extreme resolution of the algorithm . This extreme resolution is affected by sequencing depth and high resolution could result in splitting large CNVs into small pieces, which are more sensitive in detecting smaller variants. Larger bin size setting in CNVnator could help to merge consecutive small CNVs as integrated variants; however, this parameter was limited by the average sequencing depth of our clinical WES data.
125 CMA confirmed CNVs that were not detected by XHMM and CoNIFER were further investigated for possible explanations. Low sequencing depth (< 10 X) and limited capture probes (< 10) were detected in 42 variants and these regions were automatically excluded during the normalization step of both tools. The detection for these CNV may be improved if sequence depth increased. The programs filter out capture probes located in recurrent variants that detected the same batch during data processing; thus, 43 polymorphism CNVs were neglected during the detection, which was also confirmed by our in-house array database [http://database.gdg-fudan.org/DB_HTML/DataSub.html]. Thus, only 40 (23.4%) CNVs remained theoretically undetected. Limitation of sample number and sequencing depth of XHMM and CoNIFER could be a possible explanation of these undetected variants. CoNIFER requires at least 50 million mapped reads and a minimum of eight exome samples to run at a time, while XHMM recommends ~50 exome samples with at least 60–100 X coverage [18, 29]. Characteristics of samples in each batch also contribute to the effectiveness of CNV detection in XHMM and CoNIFER. Recurrent pathogenic or likely pathogenic variants may be filtered out erroneously, if they existed in multiple samples; therefore, including non-abnormal reference samples as part of the batch could help to detect these CNVs. Conservative predefined thresholds in default settings of the CoNIFER might be a further reason for missing variants. Read-depth based tools are fairly limited to repeated regions of the reference genome ; thus, the sequence nature of specific locations also hinders detection of variants. CNVnator was designed for CNV discovery and genotyping from read-depth analysis based on a mean-shift approach. The number of nucleotides covered in each shift is called bin size (50–60 in our study), which can be determined by the average coverage of sequencing data (45–70 for our samples). CNVnator had the highest sensitivity of 87.7% since 150 of 171 CMA confirmed variants were detected by CNVnator. A Venn diagram was used to show the poor disconcordance among WES based tools, which was attributed to unsatisfying sequencing depth and inadequate number of batched samples. Therefore, CNV detection from WES based tools was affected by the following factors (ranked in the order of importance): probe number, reading depth, sample constituent in the batch, software parameters, and sequence nature of variants.
Using CMA detected variants as standard, the three tested WES based CNV detecting tools were not able to detect the accurate size of variants from WES data. XHMM and CoNIFER have lower sensitivity, but more accurate size of CNVs compared to CNVnator. CNVnator reached higher sensitivity at the cost of high false positive rates and exaggerated readout of the variant size. Poor concordance of CNV detection was observed in the study. Increasing the number of batch samples and valid sequencing depth were the most realizable approaches to improve performance of these WES based tools. At this stage, CMA still remains the first-choice and gold standard for CNV detection for clinical diagnostic purpose. CNV detection tools using WES data could be used as a screening tool.
Low concordances of CNV detection were observed via three different read-depth based programs indicating that WES-based CNV detection still remains immature and unstable compared to CMA. Since WES based CNV detection was evaluated to have low sensitivity and uncertain specificity in comparison with CMA based CNV detection, CMA will continue to play an important role in detecting clinical grade CNV in the NGS era, which is largely based on WES. CNV detection tools using WES data could be considered as a complementary way with only computational effort, but where further validation has been suggested for the purpose of clinical diagnosis.
The authors would like to thank all members of family for their participation in this study.
This research was supported by the National Natural Science Foundation of China (Grant No. 81371903 and 81,472,051), the Project of Shanghai Municipal Science and Technology Commission (Grant No. 15410722800), and the Project of Shanghai Municipal Education Commission- Gaofeng Clinical Medicine (Grant No. 20152529).
Availability of data and materials
All data that were generated or analyzed during this study are included in this publication and its supplemental data. Please contact the corresponding author for further data requests.
RY was the major contributor in writing the manuscript and analyzing the data; CZ, TY, and XH contributed to the computational analysis of CMA data and WES data; NL and XW helped in summarizing data and interpretation; JW and YS performed a critical review of the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
This study was approved by the Committee on Ethics of the Shanghai Children’s Medical Centre and the Maternal and Child Health Hospital of the Guangxi Zhuang autonomous region. Furthermore, we obtained written informed consent from the patients, performing all the experiments according to the regulations in the Declaration of Helsinki.
Consent for publication
The authors declare that they have no competing interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Feuk L, Carson AR, Scherer SW. Structural variation in the human genome. Nat Rev Genet. 2006;7(2):85–97.View ArticlePubMedGoogle Scholar
- Sharp AJ, Cheng Z, Eichler EE. Structural variation of the human genome. Annu Rev Genomics Hum Genet. 2006;7:407–42.View ArticlePubMedGoogle Scholar
- Martin CL, Kirkpatrick BE, Ledbetter DH. Copy number variants, aneuploidies, and human disease. Clin Perinatol. 2015;42(2):227–42. viiView ArticlePubMedPubMed CentralGoogle Scholar
- Fiorentino F, Napoletano S, Caiazzo F, Sessa M, Bono S, Spizzichino L, Gordon A, Nuccitelli A, Rizzo G, Baldi M. Chromosomal microarray analysis as a first-line test in pregnancies with a priori low risk for the detection of submicroscopic chromosomal abnormalities. Eur J Hum Genet. 2013;21(7):725–30.View ArticlePubMedGoogle Scholar
- Manning M. Hudgins L; professional practice and guidelines committee. Array-based technology and recommendations for utilization in medical genetics practice for detection of chromosomal abnormalities. Genet Med. 2010;12(11):742–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Liang D, Peng Y, Lv W, Deng L, Zhang Y, Li H, Yang P, Zhang J, Song Z, Xu G, Cram DS, Wu L. Copy number variation sequencing for comprehensive diagnosis of chromosome disease syndromes. J Mol Diagn. 2014;16(5):519–26.View ArticlePubMedGoogle Scholar
- Boone PM, Bacino CA, Shaw CA, Eng PA, Hixson PM, Pursley AN, Kang SH, Yang Y, Wiszniewska J, Nowakowska BA, del Gaudio D, Xia Z, Simpson-Patel G, Immken LL, Gibson JB, Tsai AC, Bowers JA, Reimschisel TE, Schaaf CP, Potocki L, Scaglia F, Gambin T, Sykulski M, Bartnik M, Derwinska K, Wisniowiecka-Kowalnik B, Lalani SR, Probst FJ, Bi W, Beaudet AL, Patel A, Lupski JR, Cheung SW, Stankiewicz P. Detection of clinically relevant exonic copy-number changes by array CGH. Hum Mutat. 2010;31(12):1326–42.View ArticlePubMedPubMed CentralGoogle Scholar
- Mills RE, Walter K, Stewart C, Handsaker RE, Chen K, Alkan C, Abyzov A, Yoon SC, Ye K, Cheetham RK, Chinwalla A, Conrad DF, Fu Y, Grubert F, Hajirasouliha I, Hormozdiari F, Iakoucheva LM, Iqbal Z, Kang S, Kidd JM, Konkel MK, Korn J, Khurana E, Kural D, Lam HY, Leng J, Li R, Li Y, Lin CY, Luo R, Mu XJ, Nemesh J, Peckham HE, Rausch T, Scally A, Shi X, Stromberg MP, Stütz AM, Urban AE, Walker JA, Wu J, Zhang Y, Zhang ZD, Batzer MA, Ding L, Marth GT, McVean G, Sebat J, Snyder M, Wang J, Ye K, Eichler EE, Gerstein MB, Hurles ME, Lee C, SA MC, Korbel JO, 1000 Genomes Project. Mapping copy number variation by population-scale genome sequencing. Nature. 2011;470(7332):59–65.View ArticlePubMedPubMed CentralGoogle Scholar
- Ng SB, Turner EH, Robertson PD, Flygare SD, Bigham AW, Lee C, Shaffer T, Wong M, Bhattacharjee A, Eichler EE, Bamshad M, Nickerson DA, Shendure J. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461(7261):272–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Rabbani B, Tekin M, Mahdieh N. The promise of whole-exome sequencing in medical genetics. J Hum Genet. 2014;59(1):5–15.View ArticlePubMedGoogle Scholar
- Miyatake S, Koshimizu E, Fujita A, Fukai R, Imagawa E, Ohba C, Kuki I, Nukui M, Araki A, Makita Y, Ogata T, Nakashima M, Tsurusaki Y, Miyake N, Saitsu H, Matsumoto N. Detecting copy-number variations in whole-exome sequencing data using the eXome hidden Markov model: an 'exome-first' approach. J Hum Genet. 2015;60(4):175–82.View ArticlePubMedGoogle Scholar
- Guo Y, Sheng Q, Samuels DC, Lehmann B, Bauer JA, Pietenpol J, Shyr Y. Comparative study of exome copy number variation estimation tools using array comparative genomic hybridization as control. Biomed Res Int. 2013;2013:915636.PubMedPubMed CentralGoogle Scholar
- Samarakoon PS, Sorte HS, Kristiansen BE, Skodje T, Sheng Y, Tjønnfjord GE, Stadheim B, Stray-Pedersen A, Rødningen OK, Lyle R. Identification of copy number variants from exome sequence data. BMC Genomics. 2014;15:661.View ArticlePubMedPubMed CentralGoogle Scholar
- Tan R, Wang Y, Kleinstein SE, Liu Y, Zhu X, Guo H, Jiang Q, Allen AS, Zhu M. An evaluation of copy number variation detection tools from whole-exome sequencing data. Hum Mutat. 2014;35(7):899–907.View ArticlePubMedGoogle Scholar
- Belkadi A, Bolze A, Itan Y, Cobat A, Vincent QB, Antipenko A, Shang L, Boisson B, Casanova JL, Abel L. Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants. Proc Natl Acad Sci U S A. 2015;112(17):5473–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Hehir-Kwa JY, Pfundt R, Veltman JA. Exome sequencing and whole genome sequencing for the detection of copy number variation. Expert Rev Mol Diagn. 2015;15(8):1023–32.View ArticlePubMedGoogle Scholar
- Li H, Durbin R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics. 2009;25(14):1754–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Fromer M, Moran JL, Chambert K, Banks E, Bergen SE, Ruderfer DM, Handsaker RE, McCarroll SA, O'Donovan MC, Owen MJ, Kirov G, Sullivan PF, Hultman CM, Sklar P, Purcell SM. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. Am J Hum Genet. 2012;91(4):597–607.View ArticlePubMedPubMed CentralGoogle Scholar
- Krumm N, Sudmant PH, Ko A, O'Roak BJ, Malig M, Coe BP; NHLBI Exome Sequencing Project., Quinlan AR, Nickerson DA, Eichler EE. Copy number variation detection and genotyping from exome sequence data. Genome Res 2012;22(8):1525-1532.Google Scholar
- Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011;21:974–84.View ArticlePubMedPubMed CentralGoogle Scholar
- Kadalayil L, Rafiq S, Rose-Zerilli MJ, Pengelly RJ, Parker H, Oscier D, Strefford JC, Tapper WJ, Gibson J, Ennis S, Collins A. Exome sequence read depth methods for identifying copy number changes. Brief Bioinform. 2015;16(3):380–92.View ArticlePubMedGoogle Scholar
- Bansal V, Dorn C, Grunert M, Klaassen S, Hetzer R, Berger F, Sperling SR. Outlier-based identification of copy number variations using targeted resequencing in a small cohort of patients with Tetralogy of Fallot. PLoS One. 2014;9(1):e85375.View ArticlePubMedPubMed CentralGoogle Scholar
- Bademci G, Diaz-Horta O, Guo S, Duman D, Van Booven D, Foster J 2nd, Cengiz FB, Blanton S, Tekin M. Identification of copy number variants through whole-exome sequencing in autosomal recessive nonsyndromic hearing loss. Genet Test Mol Biomarkers. 2014;18(9):658–61.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics. 2013;14(Suppl 11):S1.View ArticleGoogle Scholar
- Legault MA, Girard S, Lemieux Perreault LP, Rouleau GA, Dubé MP. Comparison of sequencing based CNV discovery methods using monozygotic twin quartets. PLoS One. 2015;10(3):e0122287.View ArticlePubMedPubMed CentralGoogle Scholar
- Duan J, Zhang JG, Deng HW, Wang YP. Comparative studies of copy number variation detection methods for next-generation sequencing technologies. PLoS One. 2013;8(3):e59128.View ArticlePubMedPubMed CentralGoogle Scholar
- Hollenbeck D, Williams CL, Drazba K, Descartes M, Korf BR, Rutledge SL, Lose EJ, Robin NH, Carroll AJ, Mikhail FM. Clinical relevance of small copy-number variants in chromosomal microarray clinical testing. Genet Med. 2017;19(4):377–85.View ArticlePubMedGoogle Scholar
- Poultney CS, Goldberg AP, Drapeau E, Kou Y, Harony-Nicolas H, Kajiwara Y, De Rubeis S, Durand S, Stevens C, Rehnström K, Palotie A, Daly MJ, Ma'ayan A, Fromer M, Buxbaum JD. Identification of small exonic CNV from whole-exome sequence data and application to autism spectrum disorder. Am J Hum Genet. 2013;93(4):607–19.View ArticlePubMedPubMed CentralGoogle Scholar
- Fromer M, Purcell SM. Using XHMM Software to detect copy number variation in whole-exome sequencing data. Curr Protoc Hum Genet. 2014;81:7.23.1–21.View ArticleGoogle Scholar
- Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman JO, Baker C, Malig M, Mutlu O, Sahinalp SC, Gibbs RA, Eichler EE. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet. 2009;41(10):1061–7.View ArticlePubMedPubMed CentralGoogle Scholar