Single copy probe design, development of probes for fluorescence in situ hybridization (FISH)
Methods for scFISH have been described previously [14,15,16,17, 31]. The overall process for FISH probe development involved precise definition of each single copy (sc) interval by specific human genome coordinates and range in length from ~ 1.4 to 4 kilobases (kb). The sequence of each sc interval was amplified from human genomic DNA with polymerase chain reactions (PCR) optimized for long products, followed by the gel purification of amplicons, and labelling by nick translation with a modified nucleotide (digoxigenin-11-dUTP) prior to performing hybridization to metaphase chromosomes. Following hybridization, probes were detected with a fluorescence labeled antibody against digoxigenin on metaphase chromosomes stained with 4’,6-diamidino-2-phenylindole (DAPI). Cells were imaged using a Metasystems computer assisted epifluorescence microscope system.
Sc DNA probes were comprised of either unique DNA sequences or highly divergent repetitive sequences (> 20%) that behave as unique sequence targets during chromosomal hybridization [14,15,16,17, 31]. Sc genomic intervals were excluded if they were present in copy number variants with ≥ 1% population frequency [14] and were observed in independent microarray datasets, including Ontario Population Genomics Platforms (n = 873 individuals of European ancestry; minimum 25 probes per CNV; Database of Genomic Variants), and Healthy sample set (n = ~ 400 individuals; minimum 35 probes per CNV, Affymetrix), which were used to identify common CNVs with ChAS (Chromosome Analysis Suite) software analysis of ThermoFisher (formerly Affymetrix) CytoScan HD arrays (Additional file 5: Table S4).
Oligonucleotide primer design and sc amplicon production
Primer pairs for each selected sc interval were designed using Primer-BLAST [39]. Sc intervals were identified using RepeatMasker (University of California Santa Cruz (UCSC) Genome Browser). The DNA sequence (GRCh37/hg19) for the full sc interval, obtained from the UCSC Genome Browser [40] was the PCR template used to generate all primer pair options. Generally, 15–20 primer pairs were designed for each sc interval. The maximum size of the PCR product was limited by the length of the sc interval in base pairs (bp) and the minimum length was 200–500 bp less than the maximum. The selected primer melting temperature (Tm) range was 58.0—65.0 °C, with an optimal Tm of 62.0 °C. The maximum Tm difference between a pair of primers was limited to 2 °C. Primer pair specificity was verified using the “RefSeq representation genome” database for alignment with the human genome by BLAST® (Basic Local Alignment Search Tool) [39] as well as separate assessment by BLAT (GRCh37/hg19 and CHM13 [41]). The nucleotide coordinates of the primer pairs reported from the GRCh38/hg38 genome assembly were converted to GRCh37/hg19 coordinates using the UCSC genome browser. Optimal primer pairs minimized the self-complementarity of individual primers and the Tm difference between the pair. Primers in which the PCR product had unintended targets and generally those outside the 40–60% GC content range were avoided. Longer primers (> 25 bp) were preferred. Primer pairs were synthesized by Integrated DNA Technologies, Inc (Toronto, ON). Long PCR reaction conditions using hot start DNA polymerase Kappa HiFi (Promega Corporation) according to the manufacturer’s instructions were optimized for each sc interval using a gradient PCR thermocycler (Eppendorf vapo.protect™ Hamburg, Germany). Optimized PCR conditions were then used for scale-up of the target amplicons. The amplicons were gel purified and labelled by nick translation for use in fluorescence in situ hybridization [14, 17, 42]. The primer details and PCR optimization cycling parameters are provided in Additional file 6: Table S5.
Cytogenetic preparations
Cytogenetic fixed cell preparations were obtained from phytohemagglutinin (PHA)-stimulated peripheral blood, bone marrow, and dermal fibroblast samples. The cytogenetic cell preparations were derived from de-identified residual cell pellets that remained after routine cytogenetic diagnostic procedures were completed at the London Health Sciences Center Clinical Cytogenetics Laboratory (University of Western Ontario Office of Research Ethics, CER approval #5453). Cytogenetically normal cell pellets were used for bone marrow samples. Cell pellets were produced following routine cytogenetic protocols for cell culture and harvest [14] and fixed with 3 parts methanol: 1 part glacial acetic acid (Carnoy’s fixative).
Fibroblast metaphase cells of normal adults were also prepared in the research laboratory by culturing dermal fibroblast cells stored in liquid nitrogen in our research laboratory cell bank [43]. Fibroblasts were cultured in T25 flasks at 37 °C/5% CO2 in DMEM – Dulbecco’s Modified Eagle Medium (Gibco #11,960–044) supplemented with 15% fetal bovine serum (Hyclone #SH30396.03) and 1% penicillin/streptomycin (Hyclone #SV30010). Cultures were grown until ~ 70% confluent, arrested in metaphase with colcemid (Gibco#15,212–012) and harvested [43]. Fibroblasts were treated with hypotonic solution at 37 °C (0.075 M KCl) and fixed with Carnoy’s fixative. Fixed cell preparations were placed on glass microscope slides and aged at room temperature (1–3 days) prior to performing scFISH.
Sc probe selection for examining DA domains
All sc probes in Table 1 were developed in this study, with the exception of sc probe 3.3_1p36 [14, 17], which is a control probe showing EA (Additional file 7, Figure S2). For each domain, these consisted of anchor probes with confirmed DA as well as multiple scFISH probes linked in the genome to these anchor sc probes. The anchor probes were designed and produced from genomic regions corresponding to published legacy chromosomal localization studies of XDH [22], HMGB1P5 and HMGB1P1 [23], FGF6 [24], TPM1 [25], and COX5A [26]. These genes map to chromosome bands 2p23 (XDH), 3p24 (HMGB1P5), 12p13 (FGF6), 15q22 (TPM1), 15q25 (COX5A), and 20q13 (HMGB1P1). Legacy publications that mapped genes on human chromosomes by FISH were identified thru PubMed and journal searches. Many of these gene mapping studies were published prior to the initial assembly of the complete human genome sequence in 2001. The ‘gene mapping’ FISH probes [22,23,24,25,26, 29] generally consisted of recombinant DNA with long human genomic inserts that ranged in length from ~ 50 kb to several hundred kb, and in which the full genomic sequence was not known. We scrutinized the FISH images in these publications to identify potential differences in the fluorescence hybridization intensities of signals hybridizing to each chromosome homolog, which are characteristic of DA. Images that appeared to exhibit differential hybridization were further characterized in our laboratory by scFISH to determine whether the published intensity differences met our criteria for DA. The locations of the FISH probe genomic targets were determined using the probe specific gene mapping details, such as restriction enzyme mapping and partial gene sequencing in these or related publications, which were then used to computationally localize sc intervals in the current human genome assembly. Sc probes were developed from within the large genomic target regions using previously published methods [14,15,16,17,18,19,20]. If DA was determined to be present by scFISH, the sc probe then served as an anchor probe from which to develop neighboring probes. The neighboring probes were used to determine if DA extended beyond the anchor sequence and formed a larger DA domain.
All sc probes developed in this study, were hybridized to lymphocyte metaphase chromosomes to confirm the expected chromosomal band location and then scored for hybridization pattern (ie. DA or EA) as summarized below using our previously described methods [14, 15, 17]. Domains were named based on the HUGO-approved gene name in the corresponding legacy gene mapping publication from which the anchor probe was derived. Sc probes are named according to their location within or adjacent to the gene from which it was derived. In intergenic regions, probes are identified by the coding gene closest to the sc interval, with centromeric (cen) or telomeric (tel) indicating the position of the probe relative to that gene, and followed by the distance in nucleotides between the gene and interval. Probes localizing within genes are named with the gene and the interval of exons and introns spanned, guided by conventions stipulated by Human Genome Variation Society (HGVS) nomenclature.
Scoring differential (DA) and equivalent accessibility (EA) of sc probe hybridization between metaphase homologous chromosomes—qualitative and quantitative
Evaluation of differences in the hybridized probe fluorescence intensity between homologs was performed as previously reported [14, 15]. Chromosome identification and scoring of the intensity of hybridized probe fluorescence signals (dim, medium, bright) was performed independently by a minimum of 2 analysts. A metaphase cell was considered to show differential accessibility (DA) if homologs were scored with different intensities (e.g. bright/medium, bright/dim, medium/dim, bright/nil). A cell was scored as equivalently accessible (EA) when homologs were scored with equivalent intensities (e.g. bright/bright, medium/medium). Any scores of dim/dim, nil/nil, or dim/nil were excluded. Cells with hybridized chromosomes involved in chromosome overlap at or near the location of probe hybridization were also excluded to rule out potential hybridization effects on the targets. Twenty-five or more cells were scored for most samples, and a minimum of 2 samples were evaluated per scFISH probe for probe validation. A two-tailed binominal test with normal approximation was used to determine if there was a significant difference between the proportion of DA cells compared to that of EA cells [14]. Additionally, a two proportion Z-test was used to test if the proportion of DA cells differed between samples. Both statistical tests were performed at α = 0.05.
Visual differences in hybridized probe fluorescence intensities between homologs within the same cell were quantified using the gradient vector flow algorithm (GVF) that we previously developed [14, 27]. GVF determines FISH probe boundaries for each chromosomal hybridization as a binary contour and integrates the probe fluorescence across the subset of pixels comprising each signal [27]. Integrated signal intensity for homologs 1 and 2 are defined as \({I1}\; {\text{and}}\; {I2}\), respectively. To determine differences between the signals of each homolog within a cell, a normalized intensity ratio was calculated:
$$ Intensity\,\, Ratio = \frac{|I1-I2|}{I1+I2}$$
Values close to 0 indicate homologs with EA, whereas values close to 1 are differences in signal intensity present in DA [14]. A bias in hybridization signal intensities between homologous regions was reported as statistically significant using a Mann–Whitney U test.
Sc probe selection for investigating DA in different cell types
To avoid confounding factors such as differential tissue expression that could influence chromatin accessibility, sc probes were selected from within genes that had little to no expression (0.0–5.0 transcripts per million [TPM]) across all tissues of interest (lymphocytes/blasts, bone marrow, fibroblast). Expression data in TPM were downloaded from the Genotype-Tissue Expression (GTEx) [44] and Human Protein Atlas [45, 46] databases. GTEx expression data were from EBV transformed lymphocytes and fibroblasts with multiple samples representing each tissue. The mean and standard deviation across samples was computed with a homebrew Python script. The Human Protein Atlas data were derived from multiple bone marrow samples and obtained as mean expression values. A subset of sc probes that demonstrated DA in T-lymphocytes developed during this study were selected to assess whether DA at these loci was conserved in bone marrow cells and fibroblasts. DA intervals present within genes (intronic and exonic) as well as in intergenic intervals, were selected to establish DA across different tissues in both gene coding and noncoding intervals. The probes selected within genes were XDH_IVS30-IVS27, PCK1_cen209-IVS6, and DUOX1_IVS1-IVS3. Intergenic DA regions that were assumed to be transcriptionally inactive from UCSC genome browser annotations included TPM1_tel3200 and CTCFL_cen34302. DUOX1_IVS1-IVS3 sc probe (chr 15q23) genomic region was developed and validated after review of historical FISH images within a SORD gene mapping study [29].
Sequence comparison of epigenetic open chromatin marks between single copy probe genomic intervals exhibiting DA or EA
Epigenetic features characteristic of open chromatin were analyzed following the same approach that we have previously reported for other EA and DA genomic intervals [14]. The open chromatin properties extracted from the Encyclopedia of DNA Elements (ENCODE) [28] that were compared with mitotic accessibility included: DNase I hypersensitivity (Duke, Dnase1 HS), Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE) (University of North Carolina, FAIRE seq) and histone marks H3K4me1, H3K9ac, H3K27ac, and H3K4me2 (Broad Institute, histone modifications). All open chromatin marks reported were derived from data collected from the Epstein-Barr virus (EBV) transformed lymphoblastoid cell line, GM12878, in which DA had previously been characterized [14], and untransformed dermal fibroblast lines: GM03348 (DNase I HS) and NHDF-Ad (H3K4me, H3K9ac, H3K27ac, and H3K4me2). All histone modification data were derived from ChIP-seq (chromatin immunoprecipitation assay with sequencing) signal intensities. The cumulative sum of signals for each open chromatin mark was determined for all sc intervals, and a mean integrated intensity was calculated for DA and EA groups individually. Box and Whisker plots of each mark for both DA and EA visualized these distributions. Unpaired t-tests with Welch correction were used to test for significant differences (α = 0.05) between the mean integrated intensity of each chromatin mark between DA and EA intervals in lymphocytes and fibroblasts as well as integrated intensity per base pair between full DA domains and scFISH domain coverage. The open chromatin marks for new DA probes developed in this investigation were compared to previously reported EA probe intervals [14]. Open chromatin mark data for SCAMP2_IVS2 were censored from the other DA interval data set prior to statistical testing between DA and EA loci. SCAMP2_IVS1 is within intron 1, a gene segment in which promoters have been identified [47, 48], which paired with the pronounced enrichment of open chromatin marks is consistent with SCAMP2_IVS1 localizing within the highly accessible SCAMP2 promoter. This sequence is not representative of the predominantly intergenic locations (n = 10) that characterize the other DA probes; therefore, SCAMP2_IVS1 was excluded from the analysis of the above interphase chromatin features, in order to prevent biased weighting of the total integrated intensities by probe sequences.
Higher order chromatin structures in DA domains
The organization of DA domain intervals with respect to higher-order chromatin structures, topologically associated domains (TADs), was analyzed using the public 3-D genome browser [38] with chromatin capture data (Hi-C) of lymphoblast cell line GM12878 [37]. Chromatin interaction frequency heatmaps were generated at a resolution of 25 kb spanning DA domain and sc probe locations (GRCh37/hg19) within the UCSC genome browser [40]. Correspondence of DA domains with TADs and other intra-TAD interactions were analyzed from scaled heat-map and genome browser outputs from the 3-D Genome Browser and UCSC Genome Browser, respectively [38, 40].