We propose a CNV prioritization algorithm — “data laundering” — suitable both for diagnostic and basic research. The algorithm is based on an idea that brain diseases result from genomic alterations affecting directly the brain [13, 14] and, consequently, predominant expression of a gene in the central nervous system increases the probability of its contribution to a neurobehavioral phenotype [12]. We designate the algorithm as “laundering” because of the resemblance to machine-washing (each step processes the data from the previous stage to be filtered several times using different criteria). Figure 1 schematically outlines the procedure.
First, a pool of CNVs is obtained by molecular karyotyping. At this stage, CNVs are checked for recurrence by in-house and web databases. In-house databases of genomic variants obtained by similar microarray types are applied to spot recurrent aberrations. It is to note that the indexation of CNV in a Database of Genomic Variants or any other database dedicated to non-pathogenic genome variations is not a criterion for the exclusion at following stages. Further, the localization and ontology of CNV genes (e.g. using UCSC genome browser, NCBI gene, OMIM, PubMed etc.) are obtained. At this stage, genes lacking appropriate ontology, CNV encompassing introns, recurrent/non-pathogenic CNV are excluded from further analysis. It is worth mentioning that here, CNV are defined as copy number DNA gains/losses < 500 kbp.
Secondly, the genes are in silico analyzed in terms of the expression in the central nervous system. As brain pathology is suggested to be mainly associated with neurobehavioral phenotypes, it is recommended to proceed to the next step with a pool of genes highly expressed in the brain.
Third step is referred to as retrieving gene-gene interactions. Considering the differences in databases, it is suggested to use several resources (e.g. NCBI gene, BioGRID, STRING). Here, we have merged data from NCBI gene, BioGRID and STRING.
During the fourth step, the gene list is evaluated for uncovering interactions and interaction enriched gene clusters (sets of interacting genes). Further, only large groups of interacting genes are analyzed, leaving aside small clusters of interacting elements. This criterion is based on a hypothesis that highly interacting genes (proteins) are more likely to be involved in the same processes or influence a disease with similar symptoms [15].
Fifth, the pathway lists are obtained for the set of interacting genes. During database selection, one should consider such parameters as the nature and curability of pathway data. Here, Gene Ontology (GO), KEGG, Reactome, NCBI Biosystems were used.
Sixth, to process the pathway lists, we introduce a parameter (prioritization criterion) to determine significantly enriched pathways. To calculate the parameter, a total number of genes for each pathway are obtained. Pathways, in which less than 25 genes are affected by CNVs, are excluded. The remaining pathways are ranked using the index of pathway prioritization (IPP):
$$ {I}_{PP}=\frac{\sum {N}_{CNV\ genes}}{\sum {N}_{pathway\ genes}} $$
where IPP — index of pathway prioritization; NCNV genes — number of CNV genes in a pathway found in molecularly karyotyped cohort; Npathway genes — total number of pathway genes. If the IPP is higher than average (i.e. evaluated by three sigma rule), the pathway is prioritized.
Seventh, ontologies attributed to the elements of prioritized pathways are considered; pathways are clustered according to the involvement in shared networks (cascades of processes) [16]. Thus, the algorithm provides a set of enriched processes (clusters of pathways) in a disease or in an individual patient.
Using the algorithm and Affymetrix CytoScan HD microarray, we analyzed 191 genomes (DNA isolated from peripheral blood) of children with ID, ASD and congenital abnormalities without gross chromosomal and genomic rearrangements (i.e only the CNVs less than 500 kbp in size were included). The raw results of the algorithm processing are shown in Fig. 2.