Chromatin architecture reorganization during stem cell differentiation

Jesse R. Dixon; Inkyung Jung; Siddarth Selvaraj; Yin Shen; Jessica E. Antosiewicz-Bourget; Ah Young Lee; Zhen Ye; Audrey Kim; Nisha Rajagopal; Wei Xie; Yarui Diao; Jing Liang; Huimin Zhao; Victor V. Lobanenkov; Joseph R. Ecker; James A. Thomson; Bing Ren

doi:10.1038/NATURE14222

Outline

Chromatin architecture reorganization during stem cell differentiation

Yin Shen

2015, Nature

https://doi.org/10.1038/NATURE14222

visibility

…

description

23 pages

ARTICLE OPEN doi:10.1038/nature14222 Chromatin architecture reorganization during stem cell differentiation Jesse R. Dixon1,2*, Inkyung Jung1*, Siddarth Selvaraj1,3*, Yin Shen1, Jessica E. Antosiewicz-Bourget4, Ah Young Lee1, Zhen Ye1, Audrey Kim1, Nisha Rajagopal1, Wei Xie5, Yarui Diao1, Jing Liang6, Huimin Zhao6, Victor V. Lobanenkov7, Joseph R. Ecker8, James A. Thomson4,9,10 & Bing Ren1,11 Higher-order chromatin structure is emerging as an important regulator of gene expression. Although dynamic chromatin structures have been identified in the genome, the full scope of chromatin dynamics during mammalian development and lineage specification remains to be determined. By mapping genome-wide chromatin interactions in human embryonic stem (ES) cells and four human ES-cell-derived lineages, we uncover extensive chromatin reorganization during lineage specification. We observe that although self-associating chromatin domains are stable during differentiation, chromatin interactions both within and between domains change in a striking manner, altering 36% of active and inactive chromo- somal compartments throughout the genome. By integrating chromatin interaction maps with haplotype-resolved epigenome and transcriptome data sets, we find widespread allelic bias in gene expression correlated with allele-biased chromatin states of linked promoters and distal enhancers. Our results therefore provide a global view of chromatin dynamics and a resource for studying long-range control of gene expression in distinct human cell lineages. Three-dimensional genome organization is in- and chromatin structure. This represents the most creasingly considered an important regulator EPIGENOME ROADMAP extensive data set generated to date, to our know- of gene expression1–4. Recent high-throughput A Nature special issue ledge, for the analysis of higher-order chromatin studies of chromatin structure have begun to nature.com/epigenomeroadmap structure, allele-specific chromatin structure and shed light on the global organization of our state, and allele-specific gene expression. genome4–10. For instance, we and others recently discovered that inter- phase chromosomes are partitioned into megabase-sized topological Data generation and validation domains and smaller sub-domains (also known as topologically assoc- We performed Hi-C experiments5 in two biological replicates in H1 iated domains or TADs)6–9. These TADs form the basis for higher-level human ES cells and each of the four H1-derived lineages, generating structures referred to as the ‘A’ and ‘B’ compartments5,6. The A and B a total of 3.85-billion unique read pairs (Supplementary Table 1). We compartments are closely linked to other functional partitions of the normalized the intrinsic biases in Hi-C data16, and confirmed the high genome, such as early or late DNA replication timing and nuclear lamina reproducibility and accuracy of our Hi-C data sets using several metrics association11,12. Despite these advances, our understanding of the dynamic (Extended Data Fig. 1a–d, Supplementary Information and Supplemen- nature of chromatin architecture across human cell types and its effect tary Table 2). on cellular identity is incomplete. Here we analyse genome-wide higher- order chromatin interactions in H1 human ES cells and four human ES- Extensive A/B compartment switching cell-derived lineages, mesendoderm (ME), mesenchymal stem (MS) cells, Hi-C interaction maps provide information on multiple hierarchical neural progenitor (NP) cells and trophoblast-like (TB) cells13. These line- levels of genome organization4. Previous studies demonstrated that the ages represent extra-embryonic and embryonic lineages at early stages of genome is organized into A and B compartments, containing relatively development and have been extensively characterized by the Epigenome active and inactive regions, respectively5,11. Currently, it is unclear if the Roadmap project13, with data sets including mRNA-seq, ChIP-seq for A and B compartments change during differentiation and how this relates 13-24 histone modifications, base-resolution methylC-seq and DNaseI to lineage specification. We observe a large degree of spatial plasticity in hypersensitivity (DHS) in each lineage13,14. As such, this experimental sys- the arrangement of the A/B compartments across cell types, with 36% of tem provides an opportunity to compare variability in higher-order the genome switching compartments in at least one of the lineages ana- chromatin structure with underlying gene expression and chromatin lysed (Methods; Fig. 1a and Extended Data Fig. 2a–c). Many of the A/B state in a genome-wide manner. Further, using a newly developed me- compartment transitions are lineage-restricted (Fig. 1b). Notably, there thod to phase two parental alleles into chromosome-span haplotypes from appears to be a large expansion of the B compartment upon differenti- high-resolution chromosome conformation capture (Hi-C) data15, we ation of human ES cells to MS cells or in IMR90 fibroblasts. These two cell have phased the H1 genome to allow for analysis of allele-specific activity types have previously been shown to undergo an expansion of repressive 1 Ludwig Institute for Cancer Research, 9500 Gilman Drive, La Jolla, California 92093-0653, USA. 2Medical Scientist Training Program, University of California, San Diego, 9500 Gilman Drive, La Jolla, California 92093, USA. 3Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, 9500 Gilman Drive, La Jolla, California 92093, USA. 4The Morgridge Institute for Research, 309 North Orchard Street, Madison, Wisconsin 53715, USA. 5Tsinghua University–Peking University Center for Life Sciences, School of Life Sciences, Tsinghua University, Beijing 100084, China. 6 Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, USA. 7Laboratory of Immunogenetics, National Institute of Allergy and Infectious Diseases, Twinbrook I NIAID Facility, Room 1417, 5640 Fishers Lane, Rockville, Maryland 20852, USA. 8Howard Hughes Medical Institute, The Salk Institute for Biological Studies, 10010 North Torrey Pines Road, La Jolla, California 92037, USA. 9Department of Cell and Regenerative Biology, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin 53706, USA. 10 Department of Molecular, Cellular, and Developmental Biology, University of California Santa Barbara, Santa Barbara, California 93106, USA. 11University of California, San Diego School of Medicine, Department of Cellular and Molecular Medicine, Institute of Genomic Medicine, 9500 Gilman Drive, La Jolla, California 92093-0653, USA. *These authors contributed equally to this work. 1 9 F E B R U A RY 2 0 1 5 | VO L 5 1 8 | N AT U R E | 3 3 1 ©2015 Macmillan Publishers Limited. All rights reserved RESEARCH ARTICLE a Chr7 5 Mb 0.02 b 90 ES –0.02 R MS ME NP ES c TB IM 0.02 ME –0.02 ES MS ES MS switch Stable 0.02 Cell-type-specific TB –0.02 Stable A activation 0.02 NP –0.02 0.02 Domain switch MS –0.02 ES Hi-C interaction New switch Switch heat maps at boundary loss ME Stable B IMR90/MS-specific TB repression NP b b MS –0.02 0.02 b b RefSeq genes Normalized interaction frequency PC1 value 15 0 e + d ME NP TB MS IMR90 ES ES – 1.2 × 10–7 *** *** *** ME + 4 *** ME (Differentiated cell/ 4.5 × 10–4 3 × 10 –4 9 × 10–10 *** – human ES cell) 1.6 × 10–3 NP + log2 RPKM 2 NP TB – TB + 0 MS – IMR90 MS + −2 – + IMR90 le B A le B A le B A le B A le B A – ab to to ab to to tab to to tab to to tab to to St A B St A B S A B S A B S A B TMEM260 OTX2 Figure 1 | Dynamic reorganization of chromatin structure during at least one lineage. c, K-means clustering of PC1 values surrounding TAD differentiation of human ES cells. a, First principal component (PC1) values boundaries (‘b’ denotes boundary location). d, Distribution of fold-change in and Hi-C interaction heat maps in H1 ES cells and H1-derived lineages. PC1 gene expression for genes that change compartment status (‘A to B’ or ‘B to A’) values are used to determine the A/B compartment status of a given region, or that remain the same (‘stable’) upon differentiation (***P , 2.2 3 10216, where positive PC1 values represent A compartment regions (blue), and P values by Wilcoxon test; whiskers correspond to interquartile range). negative values represent B compartment regions (yellow). Dashed lines e, Genome browser for two genes of which one (OTX2) shows concordance indicate TAD boundaries in ES cells. b, K-means clustering (k 5 20) of PC1 between expression and PC1 values, whereas a second (TMEM260) does not. values for 40-kb regions of the genome that change A/B compartment status in heterochromatin modifications during differentiation13,17. In this regard, types (Fig. 2a), numerous changes in chromatin structure occur within there appears to be a similar redistribution of the spatial organization of domains. We observed a phenomenon that within some domains, a large their genomes as well. We observe that the regions that change their A/B portion of the interactions appears to increase or decrease across the entire compartment status typically correspond to a single or series of TADs domain between cell types (Fig. 2b). This suggests that a subset of TADs (Fig. 1a, c and Extended Data Fig. 2d, e), suggesting that TADs are the in a given lineage undergo concerted, domain-wide changes in interaction units of dynamic alterations in chromosome compartments. Consis- frequency. Hundreds of TADs underwent such alterations in each lineage tent with previous studies of individual loci18–20, we found that genes (Fig. 2b and Extended Data Fig. 3d), with the changes in interaction fre- that change from compartment A to B tend to show reduced expression, quency correlated positively with active marks such as DHS, H3K27ac and whereas genes that change from B to A tend to show higher expression with CTCF binding, and negatively correlated with repressive chromatin (Fig. 1d). In addition, lineage-restricted compartment A regions tend modifications such as H3K27me3 and H3K9me3 (Fig. 2c, see Methods to include more lineage-restricted genes compared to other regions for details). TADs that have a concerted increase in intra-domain inter- (Extended Data Fig. 3a). Although statistically significant, the overall action frequency tend to shift from the B to A compartments, while patterns of change in expression are subtle. Reasoning that this modest domains that have a concerted decrease in interaction frequency tend correlation may be due to the possibility that only a subset of genes to shift from A to B (Extended Data Fig. 3e, f). Consistent with the changes may be affected by compartment changes, although most genes remain in chromatin state activity, genes within domains that have increased unaffected, we identified a subset of 718 genes with co-variation between intra-domain interaction frequency tend to be upregulated, while genes gene expression and compartment switching (Fig. 1e, Extended Data within domains that decrease intra-domain interaction frequency tend Fig. 3b, c, and Methods). These genes were enriched for low CpG con- to be downregulated (Extended Data Fig. 3g, h). tent promoters (21.8% versus 15.6% for non-concordant genes, P value 8 3 10211, Fisher’s exact test), and several significant Gene Ontology Chromatin state and dynamic interactions (GO) terms, most notably related to extracellular proteins and extra- In order to understand the relationship between chromatin dynamics cellular matrix (Supplementary Table 3). Taken together, these results and other genomic and epigenomic features, we performed integrative indicate that at a global level, there is a high degree of plasticity in the A analysis of the Hi-C data along with the histone modifications, DHS, and and B compartments, yet relatively subtle corresponding changes in CTCF binding data in the six lineages. Specifically, we asked if particular gene expression, indicating that the A and B compartments have a con- chromatin state patterns predict changes in chromatin interaction fre- tributory but not deterministic role in determining cell-type-specific quency. We divided the genome into 40-kb bins and computed changes patterns of gene expression. in chromatin features in each bin upon differentiation. We then built a Random Forest classification model based on chromatin features to clas- Domain-level chromatin dynamics sify local interacting bins as having either increased or decreased inter- We next examined higher-order chromatin structure at a sub-chromosomal action frequency (see Methods for details). The model was able to classify scale. Previous studies indicated that chromosomes are composed of cell- regions of the genome that increased or decreased interaction frequency type-invariant TADs6,8. Across the six lineages analysed in this study, we with 73% accuracy (Fig. 2d, 100% graph; Extended Data Fig. 4a), which observe that although the positioning of TADs remains stable between cell increased to over 80% when we consider only the highest confidence 3 3 2 | N AT U R E | VO L 5 1 8 | 1 9 F E B R U A RY 2 0 1 5 ©2015 Macmillan Publishers Limited. All rights reserved ARTICLE RESEARCH predictions as based on the vote frequency difference (Fig. 2d, 30% graph). The Random Forest model not only indicates that chromatin state features a provide information on changes in interaction frequency, it also allows us ES to determine which chromatin marks are most predictive. Specifically, the ‘mean decrease’ of the Gini index for each chromatin mark indicates the ME importance of a given feature during classification. In this regard, we found that change in H3K4me1 density is the most important feature in NP predicting changes in long-range chromatin interactions (Fig. 2e and Ex- tended Data Fig. 4b, c). As H3K4me1 is present mostly at poised or active enhancers21,22, and as enhancers are known to engage in looping interac- TB tions that exist in a cell-type-specific manner23, these results suggest that enhancer dynamics may play a role in regulating local interaction Normalized MS changes during lineage specification. Consistent with this hypothesis, interaction frequency 40-kb regions with increased interaction frequency tend to have in- 0 15 IMR90 creased enhancer density (Extended Data Fig. 4d, e). Chr12: 85,000,000 Domains 90,000,000 Allele-specific chromatin organization 30 ES –30 Normal diploid human cells contain two copies of each chromosome. 30 ME –30 The collection of variants on a given parental chromosome (also known 30 DI NP –30 as the parental haplotype) can be used to determine functional differ- 30 TB –30 ences between two homologous chromosomes. Previous studies have 30 MS –30 revealed substantial differences between alleles in gene expression, DNA IMR90 30 –30 methylation, and chromatin states24–29. Apart from studies of individual b +4 loci in the genome30–32, little is known about the variability in higher- Δ Normalized order chromatin structure between homologous chromosomes. Recent interaction frequency –4 work from our laboratory15 has demonstrated that Hi-C data can be re-purposed to reconstruct chromosome-span haplotypes, which allows Chr1: 5 Mb 60,000,000 0.85 0.81 65,000,000 0.92 for the study of chromatin state and gene expression as a true diploid. Domains 0.95 0.88 0.95 0.82 We generated chromosome-span haplotypes incorporating ,93.5% of 10 ES 0.5 DNaseI 10 all heterozygous variants for H1 from a combination of Hi-C data sets, MS 0.5 30 ES –30 whole genome sequencing, and local conditional phasing15 (Fig. 3a). DI 30 MS –30 We observe a high level of concordance among the predicted haplo- RefSeq genes types and paired sequence reads from data sets with ‘long insert’ sizes c d e (Extended Data Fig. 5a), indicating that the reconstructed haplotypes 0.6 Pearson correlation H3K4me1 are of high quality. Next, we re-analysed data sets from Hi-C, mRNA-seq, Classification accuracy between LOI and GOI 0.8 Ranked variables 0.4 DHS 0.2 H3K27me3 ChIP-seq, methylC-seq, and DNase-seq experiments and determined 0.6 0 H3K9me3 from which parental haplotype each sequence read was derived (arbi- 0.4 H3K27ac −0.2 H3K36me3 trarily termed the ‘p1’ and ‘p2’ allele, as we cannot determine which 0.2 CTCF is the maternal or paternal copy from sequence information alone) 3 H3 7me3 H3 K4me 3 1 3 K ac S H3 CTCF me H3 6me H3 4me DH H3K4me3 (Fig. 3b and Extended Data Fig. 5b). 7 K2 K9 0 K2 K3 30% 40% 50% 100% 6,000 6,500 7,000 From the haplotype-resolved A and B compartment patterns across H3 Top labelled predictions (%) Mean decrease Gini index the p1 and p2 alleles in each lineage, we found that homologous chro- mosomes have highly similar A/B compartment patterns (Fig. 3c and Figure 2 | Domain-wide alterations in chromatin interaction frequency and chromatin state. a, Chromatin interaction heat maps in H1 lineages Extended Data Fig. 5c–e), with only 0.6–2.3% of the genome having and IMR90 fibroblasts. Also shown are domain calls in ES cells and the different A/B compartments between alleles in any given cell type (Ex- directionality index (DI) in each lineage. b, Changes in interaction frequency tended Data Fig. 5f). Notably, rare regions of the genome do show changes between ES and MS cells. Regions with higher interaction frequency in ES in A/B compartment status between alleles (Fig. 3d), but are not enriched cells are shown in blue, while regions with higher interaction frequency in MS for either allele-biased or known imprinted genes (Extended Data cells are shown in yellow. TADs having a concerted increase or decrease in Fig. 5g, h). On the contrary, regions of the genome containing allele- intra-domain interaction frequency are labelled yellow or blue, respectively, biased or imprinted genes have a subtle but statistically significant increase with the fraction of the domain showing increased or decreased interaction in the variability of A/B compartment scores between alleles (Fig. 3e). frequency listed. Domains that do not show a concerted change are shown in Likewise, the genomic regions with allelic chromatin states have greater grey. c, Boxplots of Pearson correlations coefficients between interaction variability in A/B compartment scores (Fig. 3f). This indicates that frequency changes and chromatin mark changes across TADs for each chromosome (n 5 23). Whiskers correspond to the highest and lowest points although most allele-biased and imprinted genes do not have differen- within 1.53 the interquartile range. d, Classification accuracy of the tial compartment status between alleles, there may be subtle differences Random Forest model in predicting whether a bin increases or decreases in in higher-order chromatin structure between homologous chromo- interaction frequency (n 5 768,793), tested on 10 randomly selected subsets of somes at allele-biased regions, reflecting their underlying allele biases in Hi-C data. Accuracy was also checked using actual data (blue), circularized activity. Lastly, similar to A/B compartment patterns, topological domain permutation (green) and a random permutation (yellow) of the data. As patterns appear consistent between alleles (Extended Data Fig. 6a, b). expected, randomly permuting the data yields 50% accuracy. Accuracy was Together, these results suggest that the global folding patterns of homol- also assessed considering the top 30, 40, 50% or all predictions based on ogous chromosomes are highly similar. vote frequency difference (error bars show the standard deviation of accuracies from the 10 randomly selected data subsets). e, Ranked chromatin features shown according to importance in classification as boxplots of the Allelic imbalances in gene expression mean decrease in Gini index from 10 randomly selected data subsets. Previous studies of allele-resolved gene expression have identified wide- Whiskers correspond to the highest and lowest points within 1.53 the spread imbalances in gene expression between different alleles24–27,33. interquartile range. However, it remains unclear to what degree allele-biased gene expression 1 9 F E B R U A RY 2 0 1 5 | VO L 5 1 8 | N AT U R E | 3 3 3 ©2015 Macmillan Publishers Limited. All rights reserved RESEARCH ARTICLE _ a All variants 700 a 8% 6% 14% b c 0 __ 1.0 Constitutively testable genes Phased 700 Median fold 0 __ change = 1.74 ES Density ES ME NP ME Unphased 700 0_ NP Chromosome 1 13% 14% 24% TB MS 500 kb hg18 0 224,000,000 224,500,000 225,000,000 MS TB Total +1 –1 _ 0 14 All 1 log2(p1/p2) _ Allelic Non-allelic log2 allelic expression ratio Phased 1 _ Unphased 1 10 kb hg18 d e f 1,800 g Imprinted genes (~1%) 1.0 Fraction of SNPs 123,740,000 *** Allelic SNPs b 100 kb hg18 Other SNPs Allelic genes Chr15 22,700,000 22,800,000 p1 0 1 22,900,000 Imprinted (%) –30 ES Het. variants 0 1,760 p2 –30 Me me3 DNaseI p1 15 * 0 15 0 150 p2 p1 –30 0 P < 2.2 × 10–16 MS 20 0 0.0 H3K36 DNA H3K4 p1 0 p2 –30 0 20 Allelic 0 1 p2 0 PARP9 Distance to 100 Non-allelic allelic TSS (Mb) p1 0 p2 100 0 h i 10 kb p1 10 TDG (chr12) me3 0 Both active/repressive p1 mRNA p2 10 0 Active marks p2 mRNA + 500 Repressive marks p1 H3K4me3 mRNA-Seq p1 0 Other mechanisms p2 H3K4me3 – 0 p1 Ac –500 MS p2 Ac + 500 TB p1 DHS p2 0 0 NP p2 DHS – –500 p1 DNA meth SNRPN PWAR5 IPW ES p2 DNA meth SNURF SNORD109A-B PWAR1 ME p1 H3K9me3 p2 H3K9me3 SNORD64,107-8 SNORD116 1-29 PWARSN 0 40 80 120 p1 Gseq Testable allelic genes p2 Gseq c e *** PC1 value difference 100 Mb Chr2 *** 50,000,000 100,000,000 0.025 Figure 4 | Allelic biases in gene expression in H1 lineages. a, Proportion of 0.05 p1 genes with detectable allelic expression with statistically significant allelic bias. –0.05 0.05 b, Density plot of the absolute value of the fold change in expression (log2) p2 –0.05 between alleles. c, Heat map showing k-means (k 5 20) clustering of the allelic 0 expression ratios (log2) at genes with constitutively testable expression d 206,500,000 208,500,000 Other Allelic 0.01 Imprinted (a minimum of 10 reads in each lineage). d, Genome browser image of variable p1 –0.01 f *** allelic expression of the PARP9 gene. e, Fraction of imprinted genes among 0.01 PC1 value difference p2 *** –0.01 0.02 allele-biased genes and other genes. (P 5 4.4 3 1025, Fisher’s exact test). Refseq f, Fraction of allele-biased genes that are known imprinted genes. g, Cumulative density plot of distances from variants to the nearest allele-specific gene. mRNA-seq 206,860,000 206,880,000 ZDBF2 Allele specific variants are defined using histone acetylation, H3K9me3, p1 30 0 H3K27me3, DHS and H3K4me3 (n 5 3,920, P , 2.2 3 10216, KS test). p2 30 Other Allelic Top 0.1% allelic h, Number of allele-biased genes showing consistent allele specific chromatin states in their promoter regions. Active variants are defined by H3K4me3, Figure 3 | Haplotype-resolved chromatin organization in H1 lineages. DHS or histone acetylation. Inactive promoter variants are defined by DNA a, Variants per megabase for all (green), phased (orange) and unphased variants methylation and H3K9me3/27me3. i, Genome browser image of mRNA-seq (purple) along chromosome 1. The inset zooms in on a ,1 Mb region, and chromatin features surrounding the TDG gene. where the presence of a variant at each base is indicated by a value of 1. b, Genome browser image of allele specific chromatin features and strand- specific mRNA sequencing. c, Genome browser image of PC1 values along ,1% of allele-biased genes (Fig. 4f). Although imprinted genes often chromosome 2 for the p1 and p2 allele. d, Allele specific compartment A/B occur in clusters, the majority of allele-biased gene expression is not patterns and mRNA-seq surrounding the imprinted ZDBF2 gene. e, Boxplots of clustered in the genome (Extended Data Fig. 6e). Taken together, these the difference between alleles of PC1 values. Regions with imprinted genes data suggest that most instances of allele-biased gene expression are due (P 5 0.003) and allelic genes (P 5 0.002) have more variable PC1 values to mechanisms other than genomic imprinting. One possible regulatory (Kolmogorov–Smirnov (KS) test). Whiskers correspond to the highest and mechanism that could give rise to allele-biased expression would be lowest points within 1.53 the interquartile range. f, Similar to e, but for regions allelic bias in activity of cis-regulatory elements near these genes. Indeed, with differential allelic chromatin activity (the number of allelic biased variants per 200-kb bin). Regions in the top 0.1% of differential allelic activities regions of the genome that show allele bias in histone acetylation, histone (orange) show greater differences in PC1 values compared other regions methylation, CTCF binding, and DHS are closer to allele-biased genes (P 5 1.6 3 1028 and P 5 0.0015, respectively, KS test). than randomly selected genomic regions (Fig. 4g). Furthermore, allelic gene expression is strongly correlated with DNA methylation or chro- varies among different lineages of a single individual. To address this, we matin modification state at promoters (Fig. 4h, i). Of the 247 genes that re-analysed haplotype-resolved mRNA-seq data and identified allelic contain heterozygous variants in their promoter regions and display biases in gene expression across the five H1 lineages. A total of 1,787 biased transcription in at least one lineage, a majority exhibit allele- genes showed allelic bias in gene expression in one or more lineages biased chromatin modifications or DNA methylation at the promoter studied here, representing ,24% of all testable genes (false discovery rate (Fig. 4h). Interestingly, 29% of the testable genes that have allele-biased (FDR) 10%, Fig. 4a). Most allelic differences in expression are not ‘on/off’ expression show no evidence of allelic bias in chromatin state or DNA events, but instead reflect biases in the level of expression from each allele methylation at the promoter (Fig. 4h), raising the possibility that elements (Fig. 4b). Further, allele-biased genes include both lineage-specific and outside of promoters may be responsible for the allelic gene expression. constitutively expressed genes (Extended Data Fig. 6c, d), and patterns of We identified 726, 969, and 5,769 allelic enhancers13 that showed allele allelic bias can also be constitutive or cell-type variable (Fig. 4c, d). Only bias in histone acetylation, DHS, and DNA methylation, respectively in rare cases do genes switch expression from one allele to the other (Fig. 5a). We observed a general concordance in allelic biases between between cell types. enhancers exhibiting allelic histone acetylation and enhancers showing As expected, genes subject to genomic imprinting are enriched among allelic DHS (Fig. 5a). However, we observe only modest concordance genes with allelic biases in expression (Fig. 4e), though these represent between DHS or acetylation defined enhancers with those identified 3 3 4 | N AT U R E | VO L 5 1 8 | 1 9 F E B R U A RY 2 0 1 5 ©2015 Macmillan Publishers Limited. All rights reserved ARTICLE RESEARCH a Allelic enhancers b c Gene linked to: d Acetylation DNaseI DNA Acetylation Short-range enhancer Pearson correlation coefficients Pearson correlation coefficients Acetylation DHS methylation 0.10 Long-range enhancer –1.0 0 1.0 –1.0 0 1.0 1.2 Random P = 4 × 10–4 Cumulative density Acetylation Both None P = 0.026 mapped reads (10–3) 0.035 8 × 10–3 Reads per million MS TB 0 NP ME Strong Intermediate Weak 3.5 ES Acetylation mRNA 4C f DHS P = 5 × 10–4 P = 0.003 P = 0.18 0 100 200 300 Acetylation read counts 0.00 200 200 Interaction frequency HAPLN1 mRNA RPKM 0 Distance (Mb) 1.0 Testable allelic expressed genes 0 160 100 160 1.0 e Chr5 500 kb methylation BIRs 120 90 120 500 DNA 4C-seq 80 80 0 H3K4me3 16 80 4C bait locus 40 40 0.3 (MS) 0 –2.5 kb +2.5 kb Active allele Inactive allele VCAN EDIL3 0 70 0 HAPLN1 p1 p2 p1 p2 p1 p2 Figure 5 | Allele biases at enhancers in H1 lineages. a, Enrichment of n 5 1,601). Gene-enhancer pairs are grouped into strongly interacting (top acetylation (top row), DHS (middle) and DNA methylation (bottom) at 30%), weakly interacting (bottom 30%), and intermediately interacting pairs enhancers defined as allelic by acetylation (left column), DHS (middle), or (others) based on Hi-C interaction frequency (P values using Welch’s t-test). DNA methylation (right). The active allele is in blue, inactive allele in red. Whiskers correspond to the highest and lowest points within 1.53 the b, The distance between allelic genes and enhancers as defined by allelic histone interquartile range. e, Normalized 4C-seq interaction frequencies near the acetylation (purple) compared with randomly selected enhancers (grey). HAPLN1 gene. The 4C-seq bait region is in an allele-biased enhancer near the Random enhancers were selected to match the read coverage of allele-biased 39 end of the EDIL3 gene. Specific interactions called by the LOWESS enhancers. c, Number of allele specific genes linked to concordantly biased regression model are shown in black as ‘bait interacting regions’ (BIRs). allele specific enhancers. Genes linked by ‘long-range enhancers’ are defined f, Allele-biased expression of the two alleles of the HAPLN1 gene, histone using Hi-C interaction frequencies, whereas ‘short-range enhancers’ are acetylation levels at the nearby interacting allele-biased enhancer and allele defined as any enhancer less than 20 kb from a genes transcription start resolved 4C-seq data (4C-seq P value from t-test, n 5 2 for p1 allele, n 5 2 for site. d, Boxplots of the Pearson correlation coefficients between allelic p2 allele). gene-enhancer pairs defined by acetylation (left, n 5 1,388) or DHS (right, based on allelic DNA methylation (Fig. 5a). This may reflect greater Discussion power in identifying differentially methylated regions between the two We have presented genome-wide chromatin interaction maps in H1 alleles. Alternatively, this may reflect the presence of ‘poised’ enhancers, human ES cells and four H1-derived lineages. We observed dynamic re- where there is not a strict relationship between differences in DNA meth- organization of higher-order chromatin structure during ES cell differ- ylation and enhancer or DHS state34,35. Enhancers with allele-biased entiation at multiple hierarchical scales. We found extensive switching acetylation are generally located closer to genes that also show allele-biased between the A and B compartments during ES cell differentiation, and expression when compared with enhancers that lack allele bias (Fig. 5b observed that distinct subsets of genes have concordant A/B compart- and Extended Data Fig. 6f). A majority (66%) of the 640 allelic genes ments status and expression levels. In this regard, these results are similar that display strong Hi-C interactions with allelic enhancers also show to what has been seen with nuclear lamina tethering studies20,37–39, where concordant allelic activity between the enhancer and promoter (Fig. 5c, the expression of only a subset of genes is affected by compartment changes, Extended Data Fig. 7, and Methods). Additionally, enhancer-gene pairs while other genes remain unaffected. Changes in compartment status may linked by relatively strong Hi-C interactions show greater correlation influence the accessibility of genomic regions to transcription factors or between allelic enhancer activity and allelic gene expression compared other regulatory proteins, which may be particularly important for certain with pairs linked by weaker Hi-C interactions (Fig. 5d). To test if allelic subsets of genes. enhancers indeed form specific contacts with allele-biased genes, we In addition, we have observed local alterations in chromatin interaction performed 4C-seq31,36 with 6 allele-biased enhancers and identified that frequency within TADs. These local changes are best predicted by changes 4 out of these 6 allelic enhancers showed specific 4C interactions with a in levels of H3K4me1 and the density of enhancer elements. This is in agree- nearby allele-biased gene (Fig. 5e, Extended Data Fig. 8 and Supplemen- ment with recent 5C studies demonstrating that cell-type specific inter- tary Table 4). Taken together, our results strongly support that allele- action regions are enriched for Smc1, mediator, and transcription factor biased enhancer activity is a possible mechanism underlying allele-biased binding sites7. Taken together, these results suggest that enhancer elements gene expression. likely play an important role in shaping local higher-order chromatin To determine if part of the mechanism of regulation by allele-biased structure throughout the genome. In addition, by analysing patterns of enhancers also involved allelic chromatin looping between distal enhancers chromatin interactions on each parental allele, we observe relatively minor and putative target genes, we tested for the presence of allele-biased Hi-C global changes in higher-order chromatin structure between alleles. reads at allele-biased enhancers throughout the H1 genome by aggre- The chromatin interaction maps generated in this study also allowed gating all Hi-C reads between allelic enhancers and the promoters of the reconstruction of chromosome-span haplotypes for the H1 genome. nearby allelic genes. We observed that alleles containing enhancer activity This data set represents one of the first studies of allele-biased expression generally have higher numbers of chromatin interactions with the tar- across multiple cell types of a single individual, as well as analysis of chro- get promoters (Extended Data Fig. 9a). This result is confirmed by re- matin state at the linked cis regulatory elements. Our data set will serve as a analysis of previous high-resolution 4C-seq results31. Two loci (HAPLN1 valuable tool for the community to better understand the gene regulatory and MAN1C1) show a similar trend between allele bias in enhancer– networks controlling pluripotency and differentiation of human embry- promoter interactions with the allelic enhancer acetylation and gene onic stem cells. expression levels (Fig. 5f and Extended Data Fig. 9), though the trend in Online Content Methods, along with any additional Extended Data display items the allelic 4C-seq does not meet statistical significance. The remaining and Source Data, are available in the online version of the paper; references unique two loci (FAM65B, PXK) appear to have nearly equal interaction fre- to these sections appear only in the online paper. quencies with the target promoters. Taken together, these results suggest Received 25 November 2013; accepted 12 January 2015. that the allele-biased enhancers can impart allele-biased gene expression either through stable higher-order DNA looping between the two alleles 1. Smallwood, A. & Ren, B. Genome organization and long-range regulation of gene or through potential allele-specific enhancer–promoter interactions. expression by enhancers. Curr. Opin. Cell Biol. 25, 387–394 (2013). 1 9 F E B R U A RY 2 0 1 5 | VO L 5 1 8 | N AT U R E | 3 3 5 ©2015 Macmillan Publishers Limited. All rights reserved RESEARCH ARTICLE 2. Phillips, J. E. & Corces, V. G. CTCF: master weaver of the genome. Cell 137, 29. Xie, W. et al. Base-resolution analyses of sequence and parent-of-origin dependent 1194–1211 (2009). DNA methylation in the mouse genome. Cell 148, 816–831 (2012). 3. Lettice, L. A. et al. Disruption of a long-range cis-acting regulator for Shh causes 30. Splinter, E. et al. The inactive X chromosome adopts a unique three-dimensional preaxial polydactyly. Proc. Natl Acad. Sci. USA 99, 7548–7553 (2002). conformation that is dependent on Xist RNA. Genes Dev. 25, 1371–1383 (2011). 4. Gorkin, D. U., Leung, D. & Ren, B. The 3D genome in transcriptional regulation and 31. Holwerda, S. J. et al. Allelic exclusion of the immunoglobulin heavy chain locus is pluripotency. Cell Stem Cell 14, 762–775 (2014). independent of its nuclear localization in mature B cells. Nucleic Acids Res. 41, 5. Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions 6905–6916 (2013). reveals folding principles of the human genome. Science 326, 289–293 32. de Wit, E. et al. The pluripotent genome in three dimensions is shaped around (2009). pluripotency factors. Nature 501, 227–231 (2013). 6. Dixon, J. R. et al. Topological domains in mammalian genomes identified by 33. Gimelbrant, A., Hutchinson, J. N., Thompson, B. R. & Chess, A. Widespread analysis of chromatin interactions. Nature 485, 376–380 (2012). monoallelic expression on human autosomes. Science 318, 1136–1140 (2007). 7. Phillips-Cremins, J. E. et al. Architectural protein subclasses shape 3D organization 34. Hon, G. C. et al. Epigenetic memory at embryonic enhancers identified in DNA of genomes during lineage commitment. Cell 153, 1281–1295 (2013). methylation maps from adult mouse tissues. Nature Genet. 45, 1198–1206 8. Nora, E. P. et al. Spatial partitioning of the regulatory landscape of the (2013). X-inactivation centre. Nature 485, 381–385 (2012). 35. Thurman, R. E. et al. The accessible chromatin landscape of the human genome. 9. Sexton, T. et al. Three-dimensional folding and functional organization principles Nature 489, 75–82 (2012). of the Drosophila genome. Cell 148, 458–472 (2012). 36. van de Werken, H. J. et al. Robust 4C-seq data analysis to screen for regulatory DNA 10. Rao, S. S. et al. A 3D map of the human genome at kilobase resolution reveals interactions. Nature Methods 9, 969–972 (2012). principles of chromatin looping. Cell 159, 1665–1680 (2014). 37. Reddy, K. L., Zullo, J. M., Bertolino, E. & Singh, H. Transcriptional repression 11. Ryba, T. et al. Evolutionarily conserved replication timing profiles predict long- mediated by repositioning of genes to the nuclear lamina. Nature 452, 243–247 range chromatin interactions and distinguish closely related cell types. Genome (2008). Res. 20, 761–770 (2010). 38. Finlan, L. E. et al. Recruitment to the nuclear periphery can alter expression of 12. Peric-Hupkes, D. et al. Molecular maps of the reorganization of genome-nuclear genes in human cells. PLoS Genet. 4, e1000039 (2008). lamina interactions during differentiation. Mol. Cell 38, 603–613 (2010). 13. Xie, W. et al. Epigenomic analysis of multilineage differentiation of human 39. Kumaran, R. I. & Spector, D. L. A genetic locus targeted to the nuclear periphery in embryonic stem cells. Cell 153, 1134–1148 (2013). living cells maintains its transcriptional competence. J. Cell Biol. 180, 51–65 14. Maurano, M. T. et al. Systematic localization of common disease-associated (2008). variation in regulatory DNA. Science 337, 1190–1195 (2012). Supplementary Information is available in the online version of the paper. 15. Selvaraj, S., Dixon, J. R., Bansal, V. & Ren, B. Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nature Acknowledgements We thank the members of the Ren laboratory for support and Biotechnol. 31, 1111–1118 (2013). suggestions throughout the course of this work. This work is funded in part by the 16. Hu, M. et al. HiCNorm: removing biases in Hi-C data via Poisson regression. Ludwig Institute for Cancer Research, the NIH Roadmap Epigenome Project (U01 Bioinformatics 28, 3131–3133 (2012). ES017166 and R01 ES024984), and the California Institute of Regenerative Medicine 17. Hawkins, R. D. et al. Distinct epigenomic landscapes of pluripotent and lineage- (RN2-00905). Y.D. is supported by a postdoctoral fellowship from the Human Frontier committed human cells. Cell Stem Cell 6, 479–491 (2010). Science Program (LT000576/2014-L). 18. Brown, K. E. et al. Association of transcriptionally silent genes with Ikaros complexes at centromeric heterochromatin. Cell 91, 845–854 (1997). Author Contributions J.R.D., I.J., S.S., and B.R. conceived the study. J.R.D. performed 19. Kosak, S. T. et al. Subnuclear compartmentalization of immunoglobulin loci during Hi-C and 4C experiments with assistance from A.K. J.R.D., I.J., and S.S. performed the lymphocyte development. Science 296, 158–162 (2002). data analysis. A.Y.L. performed ChIP-seq experiments for CTCF with assistance from 20. Holwerda, S. & de Laat, W. Chromatin loops, gene positioning, and gene Z.Y. J.E.A.-B. performed ES cell culture, differentiation and sample collection. N.R. and expression. Front. Genet. 3, 217 (2012). W.X. provided expertise and assistance in data analysis. Y.S., Y.D., V.V.L. J.L., H.Z., J.R.E., 21. Heintzman, N. D. et al. Distinct and predictive chromatin signatures of and J.T. provided insights and support for the design and execution of the project. transcriptional promoters and enhancers in the human genome. Nature Genet. 39, J.R.D. and I.J. prepared the manuscript with assistance from S.S. and B.R. 311–318 (2007). 22. Heintzman, N. D. et al. Histone modifications at human enhancers reflect global Author Information All data from this study have been deposited in the GEO database cell-type-specific gene expression. Nature 459, 108–112 (2009). under accession number GSE52457. Reprints and permissions information is 23. Sanyal, A., Lajoie, B. R., Jain, G. & Dekker, J. The long-range interaction landscape of available at www.nature.com/reprints. The authors declare competing financial gene promoters. Nature 489, 109–113 (2012). interests: details are available in the online version of the paper. Readers are welcome 24. Heinz, S. et al. Effect of natural genetic variation on enhancer selection and to comment on the online version of the paper. Correspondence and requests for function. Nature 503, 487–492 (2013). materials should be addressed to B.R. ([email protected]). 25. McVicker, G. et al. Identification of genetic variants that affect histone modifications in human cells. Science 342, 747–749 (2013). This work is licensed under a Creative Commons Attribution- 26. Kasowski, M. et al. Extensive variation in chromatin states across humans. Science NonCommercial-ShareAlike 3.0 Unported licence. The images or other 342, 750–752 (2013). third party material in this article are included in the article’s Creative Commons licence, 27. Kilpinen, H. et al. Coordinated effects of sequence variation on DNA binding, unless indicated otherwise in the credit line; if the material is not included under the chromatin structure, and transcription. Science 342, 744–747 (2013). Creative Commons licence, users will need to obtain permission from the licence holder 28. Kuleshov, V. et al. Whole-genome haplotyping using long reads and statistical to reproduce the material. To view a copy of this licence, visit http://creativecommons. methods. Nature Biotechnol. 32, 261–266 (2014). org/licenses/by-nc-sa/3.0 3 3 6 | N AT U R E | VO L 5 1 8 | 1 9 F E B R U A RY 2 0 1 5 ©2015 Macmillan Publishers Limited. All rights reserved ARTICLE RESEARCH METHODS Estimation of random collision events in Hi-C data. We estimated random col- lision events by calculating the intermolecular ligation rate between a nuclear chro- Cell culture and previous data sets analysed. H1 human ES cells and H1-derived mosome (chrN) and the mitochondrial chromosome (chrM). The interacting space cells were cultured as previously described13. ChIP-seq experiments for CTCF were between chrN and chrM can be defined by multiplying (roughly 16 kb per chrM 3 performed using previously published methods and antibodies13,40. Hi-C libraries number of chrM per chrN) and (roughly 6.16 Gb per diploid nucleus). The number were generated as previously described5. Two biological replicates of Hi-C data of chrM per chrN was calculated from ChIP-seq input sequencing data. were generated for each lineages in order to assess the reproducibility of the data. Hi-C and ChIP-seq libraries were sequenced on the Illumina Hi-Seq 2000 and Hi- Number of chrM per chrN 5 Number of read counts for chrM/number of read Seq 2500 platforms. mRNA-seq, ChIP-seq for histone modifications and methylC- counts for chrN 3 6.16 Gb per chrN 3 16 kb per chrM. seq data sets have been previously published13. DNase-seq experiments have been The number of random collision events between any given two loci (40-kb bin previously described elsewhere14. size) was estimated as following. Sequence read alignment. The following description applies for the alignment of Number of random collision events per 40 kb2 5 number of intermolecular inter- DNA methylation, ChIP-seq and DNase-seq data sets. Single-end sequencing data was actions between chrN and chrM/interacting space between chrN and chrM 3 40 kb2. mapped to a variant masked reference genome (hg18) using Novoalign. Unmapped The estimated random collision events are summarized in Supplementary Table 2. and non-uniquely mapping reads were removed, and PCR duplicate reads were Topological domain calling. We systemically identified topological domains based removed with Picard. Reads were processed with the Genome Analysis Toolkit on the directionality index (DI) score and a Hidden Markov Model (HMM) as pre- (GATK)41. Specifically, reads underwent indel recalibration and variant realignment. viously described6. The number of identified topological domains across human Lastly, reads that overlapped with variant loci were split into the ‘p1’ and ‘p2’ allele genome was 2,468, 2,489, 2,202, 2,144 and 2,407 for ES, ME, MS, NP and TB cells, according to whether the base in each sequencing read matched the sequence from respectively. According to the topological domain patterns, genomes were partitioned either the p1 or the p2 alleles. into domains, boundaries and unstructured regions as previously described. For Hi-C data sets, read pairs were mapped independently to the variant masked Identification of A and B compartments. Identification of A and B compartments genome using Novoalign. Reads were then manually paired using in house scripts. was performed conceptually similarly to what has been previously described5, though Non-uniquely mapping, unmapped reads and PCR duplicate read pairs were removed. with several modifications. We used the normalized 40-kb interaction matrices for Reads pairs were then split into single reads and processed through the same GATK each cell type and calculated the expected interaction frequency between two 40-kb pipeline described above including indel re-alignment and variant recalibration. Finally, bins given the distance separating them in the genome. We used a sliding window read pairs were manually re-paired using in house scripts. approach with a bin size of 400 kb and a step size of 40-kb to generate an observed/ For mRNA-seq, we mapped the paired-end data to a variant masked reference. We expected matrix. The observed frequency was the sum of all observed interaction used Useq software to first process the variant masked genome to create a splice frequencies of the 40-kb bins making up the larger 400-kb bin. Likewise, the junction reference. Reads were then mapped to the Useq processed reference genome expected frequency was the sum of the expected frequencies of each of the 40- using Novoalign. Lastly, we converted the read alignment locations from the Useq kb bins making up the larger 400 kb bin. This value was used to generate the processed genome back to hg18 coordinates using Useq. observed/expected. This was then converted to a Pearson correlation matrix and Whole-genome sequencing, genotyping and haplotyping. Whole genome sequen- subsequently used for principal component analysis as previously described5. Speci- cing (WGS) data for the H1 genome was downloaded from the Sequence Read fically, we used the ‘cov’ function in R to generate a covariance matrix from the Pearson Archive database (SRA049981). Reads were mapped to the hg18 reference genome correlation matrix, and then we used the ‘eigen’ function in R to generate Eigen vectors using Novoalign. Unmapped and non-uniquely mapping reads were removed using and Eigen values from the covariance matrix. The first principal component for each in house scripts. PCR duplicate reads were removed using Picard. The data was chromosome was used to identify regions of the genome as belonging to either the A or processed through the Genome Analysis Toolkit (GATK) best practices guidelines. B compartment. The direction of the Eigen values is arbitrary, and therefore positive We performed indel recalibration, variant realignment, variant calling using the values were set to ‘A’ and negative values were set to ‘B’ based on their association with Unified Genotyper, and variant recalibration. gene density. Haplotyping was performed using the previously described HaploSeq method15. To identify regions of the genome that switched A/B compartment status with Briefly, Hi-C reads from each of the H1 derived lineages were used as input sequencing differentiation, we first identified regions with statistically significant variability in into the HapCUT software42 in order to generate haplotype predictions. For final PC1 values across all cell types using ANOVA. Second, we considered only regions haplotype calls, Hi-C data was combined with WGS mate-pair data for the H1 gen- where both biological replicates showed changes in PC1 values from positive to ome. HapCUT generates several ‘blocks’ for each chromosome. The vast majority of negative or vice versa. This allowed us to define the 36% of the genome that changes variants on each chromosome are in the ‘most variants phased’ (MVP) block. The compartment status in at least one lineage. MVP block for each chromosome was used as a ‘seed haplotype’ for local conditional Identification of genes with concordant expression and A/B compartment status. phasing using population sequencing data from the 1000 genomes project using the To define genes with concordant changes in expression and compartment status, Beagle v.4.0 software43. This generates two haplotypes for each chromosome, one for we calculated the covariance between the vector of the log2 of gene expression values the maternal allele and one for the paternal allele. As we do not have information and vector of PC1 values for each gene across the six lineages analysed. We use this regarding the parent of origin in the H1 genome, we arbitrarily define each allele as the calculated covariance as a metric to quantitatively define ‘concordance’. To calcu- p1 or p2 allele (p1 and p2 for parent 1 and 2, respectively). The p1 and p2 allele for late a P value for the covariance for each gene, we compared these observed covari- different chromosomes are not necessarily derived from the same parent, as this ance values to a random background distribution. The background distribution information is only accessible if the sequence of H1’s parents were also available. was generated by randomly shuffling the vector of log2 of gene expression for each Haplotype alignment bias. Although we mapped the ChIP-seq, DNase-seq, Hi-C gene and then calculating the covariance between the random gene expression and DNA methylation data sets to a variant masked genome, we recognize that vector and the PC1 values. This was repeated 1,000 times for each gene, and a rank- there could still be local alignment biases favouring a given allele. To account for based P value could then be calculated for the observed covariance values. These this, we performed a two-step filtering process. First, we generated simulated reads genes were shown to be enriched for low CpG content promoters, which is defined that span each position surrounding a variant location in the genome. SNPs and here by an observed/expected CpG content of ,0.35. GO terms analysis of this indels that showed .5% and .10% biases, respectively, were excluded from all subset of genes was performed using the DAVID GO terms website. downstream analyses, as these variants show an inherent mapping bias. Second, Identification of A and B compartments in each allele. Identification of A and B for each variant in the genome, we calculated the coverage over the variant based compartments in each allele was performed similarly as described in the above sec- on WGS data. Based on the WGS data, we expect each variant to have near equal tion, though with several modifications. Due to the low density of Hi-C interaction coverage between the two alleles. Any variant that had sequencing coverage greater frequencies in each allele, we used a sliding window approach with a bin size of than 3 standard deviations above the mean for each haplotype along a chromo- 1-Mb and a step size of 200-kb to generate an observed/expected matrix. The first some was excluded, as were variants that showed a Benjamini corrected binomial principal component in each allele was used to identify regions of the genome as P value of #0.05 when comparing the WGS read coverage on each allele. Lastly, belonging to either the A or B compartment. The direction of the Eigen values is analysis of allele-biased coverage at a SNP level can be very sensitive to genotyping arbitrary, and therefore the direction was determined according to the correlation errors, in particular if a homozygous variant is erroneously called as heterozygous. coefficient values with the PC1 values generated in the above section. To account for this we made a null hypothesis that all called heterozygous variants Changes in intra-domain interaction frequency. To compute the change in inter- were actually homozygous. We excluded any heterozygous variant with a GATK action frequency between cell types, we first merged the Hi-C data between two derived genotype P value of greater than 0.05 (after Benjamini correction). This replicates for each cell type. The merged, normalized interaction matrices were quan- excluded roughly 2% of all heterozygous SNPs in the genome as having genome tile normalized between all lineages to accommodate for differences in frequency sequencing coverage that could be expected for a homozygous variant. strictly due to sequencing depth. The differences between cell types were computed ©2015 Macmillan Publishers Limited. All rights reserved RESEARCH ARTICLE by simply subtracting the interaction frequency of each bin Iij of ES cells from the As a final result, the Random Forest model gives vote frequencies for predicting differentiated cell types (as shown in Fig. 2b). whether a given interaction is increased or decreased. The difference in vote frequency To assess for concerted domain-wide changes in interaction frequency, we calcu- between the two states reflects the confidence of the model in a given prediction, with a lated two values for each domain: the fraction of interacting bins in the domain that larger vote frequency difference indicating a higher degree of confidence. The sum of showed an increase in interaction frequency and the fraction of bins that showed a the vote frequencies is equal to 1. As an example, in the case where the model could not decrease in interaction frequency. To compare these numbers to what would be predict any changes in interaction frequency, the vote frequencies would be expected expected at random, we calculated the same two values for each domain where the to both equal 0.5. If the vote frequency for the ‘loss of interaction’ class was greater than bins of the domain where made up of randomly selected intra-domain interacting 0.5, the interacting bin would be classified as having undergone a loss of interaction. bins from throughout the genome, keeping the portion of bins in each domain Likewise, if the ‘gain of interaction’ vote frequency was greater than 0.5, the bin was separated by a given genomic distance constant. This randomization was performed classified as a gain of interaction. Again, the difference in vote frequencies between the 10,000 times for each domain. At random, each domain on average had roughly 50% two classes reflects the degree of confidence of the model in a given prediction. of bins that increased in interaction frequency and 50% that decreased in interaction When we built the classification model, the balance for the number of inputs frequency. By seeing deviations from these expected values, we could assess for ‘con- between two classes is important. If the model includes more gain of interaction fea- certed’ changes in interaction frequency. We assigned a rank-based P value of the tures rather than loss of interaction features, the model is more likely trained to predict degree of ‘concertedness’ for each domain by comparing the actual observed portion of a gain of interactions. To avoid this issue, we randomly selected the same number of the domain that was either increased or decreased in interaction frequency with what gain of interaction and loss of interaction feature vectors while building the classifica- was observed at random for each domain. These P values were adjusted for multiple tion model. testing using Benjamini correction, and we considered any domain as having under- The Random Forest model also provides a measure of the importance of each gone a concerted change if the final corrected P value was less than 0.001 (0.1% FDR). variable during classification as the ‘mean decrease’ metric of the Gini index. For a Changes in intra-domain interaction frequency between alleles. Domain-wide given variable, higher the mean decrease in Gini index, the more important the interaction frequency differences between alleles were calculated by using the variable is during classification. same approach described in the above section. If the domain-wide average inter- Identification of allelic biased genes, enhancers and SNPs action frequency difference between alleles was significantly more than randomized Allelic genes. We considered the two replicates of mRNA-seq data and used a data (P value 0.001), the corresponding domains are considered as having allele negative binomial distribution (10% FDR) to calculate significantly biased genes specific domain-wide interaction frequency changes. between the two alleles, where genes are defined by merging isoforms (from RefSeq). Correlation coefficient between domain-wide interaction frequency changes We used the edgeR software package in R for calculating the P values. and modification changes. The domain-wide correlations shown in Fig. 2c between Allelic SNPs. We estimated if a SNP is allele-biased on different types of readouts. changes in interaction frequency and various chromatin marks were calculated as In particular, we used ChIP-seq, DHS, and CTCF data sets independently to obtain follows. For each domain, the intra-domain interaction frequency differences between readouts of each SNP between the two alleles. We then used a binomial statistic ES cells and each differentiated lineage was calculated for each 40-kb interacting (with an expectation P 5 0.5) to identify significantly biased SNPs for a given data bin of the domain (where we define a single ‘interacting bin’ as being formed by the set. FDR was based on 1,000 random permutations. interaction of two underlying 40-kb genomic bins). These values were considered Differential methylation among alleles (DMRs). Bisulfite sequencing reads were as the first vector for the correlations. The vector of histone modification values mapped using Novoalign methylation aligner to an H1 variant masked hg18 ref- was calculated as follows. For each 40-kb interacting bin, the enrichment of a given erence genome. Duplicated and poorly mapped reads were removed, and the reads chromatin mark in the two 40-kb bins that compose the interaction was averaged. that contain SNPs were retained for downstream analyses. Reads were then assigned The average enrichment was then multiplied by a weight proportional to the genomic to either the p1 or the p2 allele on the basis of the SNPs present in each read. During distance between the two 40-kb bins. This weight was based on the global average this assignment, certain SNPs could not be resolved between the two alleles because of Hi-C interaction frequencies from six lineages analysed between loci separated of considerations of bisulfite conversion. Specifically, when a SNP is C/T (or listed by a given genomic distance. The two vectors were used to calculate a Pearson cor- as A/G on the reverse strand), the conversion of methyl-C to T by bisulfite will relation in each chromosome, which reflects how change in domain-wide inter- make it impossible to distinguish whether a given read is a methylated cytosine action frequency correlates with domain-wide chromatin mark changes. from one allele or a thymidine from the other allele. In these cases, these SNPs were The Random Forest classification model. We built a Random Forest model to excluded from distinguishing from which allele a given read was derived. After resolv- better understand which chromatin modifications may be most predictive of changes ing into each allele, CpGs were called and nearby CpG were merged (within 100 bp). in interaction frequency between any two given loci. The aim of the Random Forest Of note, in instances where a SNP contains a cytosine, it would be impossible to model was to classify 40-kb interacting bins as either increased or decreased in inter- distinguish whether a difference between two alleles is due to the polymorphism action frequency given information about the enrichment of various chromatin marks, or due to the change in methylation. As such, any position in the genome with a DHS and CTCF binding sites. The utility of the Random Forest model is twofold: first, SNP was excluded from our calculation of the percentage methylation over a given by assessing the accuracy of the model using observed data, we can learn whether the window. We called ASM in each of these CpGs using Fisher’s exact test with 10% information supplied to the model (in this case the chromatin state, DHS and CTCF FDR after multiple testing correction as a threshold for significance. We randomly data) is predictive of the outcome, namely changes in interaction frequency. The shuffled the methylation and unmethylation values for a given haplotype (for a second powerful aspect of the model is that it allows us to assess which input data CpG) and used these random estimates to obtain FDR. supplied to the model is most informative, allowing us to determine which chromatin Allelic enhancers. To study allele bias at enhancers, we first calculated the com- state features may be most predictive of changes in higher-order chromatin structure. bined coverage of whole genome sequencing data and bisulfite sequencing (without The model was built as follows: 40-kb interacting bins in the genome were classified regard for methylation status). Any enhancer where one of the two alleles con- into two groups, ones that increased in interaction frequency, and ones that decreased tained less than 35% of the total allele resolved reads at the enhancer was excluded in interaction frequency. These changes were defined if the 40-kb based interaction as having an inherent bias in mapping between the two alleles. To systematically frequencies increased or decreased more than twofold in the differentiated lineage study allelic enhancers, we combined several enhancer marks to obtain a combined compared to those in H1 ES cells. We only considered interacting bins separated by acetylation bam file. This combined bam file gives us the required coverage in an less than 2 Mb. We added a pseudocount value to the average interaction frequencies allelic context to perform an in-depth analyses. In particular, we combined data from when we calculate fold changes to allow for comparison of zero values. The resulting H4K8ac, H4K91ac, H2BK120ac, H3K18ac, H3K23ac, H3K27ac, H3K4ac, H2AK5ac criteria yielded 768,793 interacting bins as either losses or gains. Chromatin state and H3K9ac marks. Using this combined bam file, we examined allelic SNPs described changes of H3K4me1, H3K4me3, H3K9me3, H3K27ac, H3K27me3, H3K36me3, as above. For evaluating allelic enhancers, we obtained readout for enhancers defined DHS signal, and CTCF were also calculated. For each 40-kb bin, RPKM values for in ref. 13 (6 2.5 kb from enhancer peaks) between the two alleles. Then we used each chromatin mark were calculated. Fold changes of RPKM values were calculated binomial to obtain significance at an FDR of 10%, as evaluated by the random per- by comparing with RPKM values in H1 ES cells. Those 8 chromatin marks were mutation analyses (1,000 permutations). The same analysis was used to call allele- assigned to each interacting region, thus for each interaction (consisting of two inter- biased enhancers based on DHS data. For the analysis of allele bias in DNA methylation acting 40-kb bins) we can construct a 1 by 16 feature vector. at enhancers, we considered any enhancer as having allele-biased DNA methyla- Using those feature vectors, we built a classification model between gain and loss of tion if at least one ASM bin overlapped with the enhancer. If more than one bin of interactions using a Random Forest R package with default parameters except for ASM overlapped an enhancer, we checked to see whether the patterns of ASM were specifying the model to use 500 trees. The performance was measured according to two concurrent between all bins. If there were divergent patterns between ASM bins at criteria, the out-of-back error rate achieved from the Random forest model and the an enhancer, these enhancers were excluded. tenfold cross validation. We compared out-of-back error rate to tenfold cross valid- Distance of allelic enhancers to allelic genes. We compared the distance between ation and observed very similar results (shown in Extended Data Fig. 4a). allele-biased enhancers, as identified by histone acetylation levels with randomly ©2015 Macmillan Publishers Limited. All rights reserved ARTICLE RESEARCH selected enhancers, to test the hypothesis that if allele-biased enhancers regulate the interactions of all versus all targeted regions. For the H1 ES cell lines, we down- allele-biased genes, they should generally be closer to allele-biased genes than should loaded previously generated 5C data23 and compared this to our interaction frequency randomly chosen enhancers (Fig. 5b). This analysis was complicated by the fact that between enhancers and promoters. We observe very strong correlative patterns the rates of heterozygous SNPs near allele-biased genes are higher than for non- between 5C and our interaction frequency scores (Extended Data Fig. 1b). We can allele-biased genes in the genome (Extended Data Fig. 6f). This creates a situation also observe that interacting pairs tend to show higher 5C scores compared to non- of possible ascertainment bias, owing to the fact that enhancers near allele-biased interacting pairs. genes will therefore tend to have slightly higher allele-resolved read coverage as We also compared interaction frequency scores to dsQTL relationships. dsQTL compared with randomly chosen enhancers throughout the genome. To account data provide functional relationships between DHS and their target promoters based for this, when comparing the distance of allele-biased enhancers to allele-biased on QTL information45. We calculated enhancer–promoter interactions scores again genes with randomly chosen enhancers, we selected random enhancers to match for all pairs of DHS and promoter regions based on dsQTL data. Interactions defined the coverage profile of allele-biased enhancers. This was accomplished by binning by dsQTL were considered as target relationships, otherwise off-target relationships. all enhancers into increments of 50 sequencing reads, from 0 to 49, 50 to 99, etc, up According to interaction distance, enhancer–promoter interaction scores were calcu- to 1,700 reads. For each identified allele-biased enhancer, we selected 100 random lated between target and off-target gene relationships. We observe that target gene enhancers from the same coverage bin. This limits the effects of local variation in relationships tend to have higher interaction frequency scores (Extended Data Fig. 7d). heterozygosity rates throughout the genome on the likelihood of identifying allele- Pearson correlation coefficient between allele-biased gene-enhancer pairs. We biased enhancers near allele-biased genes. As such, the results in Fig. 5b are probably calculated Pearson correlation coefficients between allele-biased gene-enhancer pairs. not due to the possibility of having greater statistical power for calling allele-biased First we generate 1 by 10 vectors for each allelic gene and allelic enhancer using the enhancers near allele-biased genes (because of greater heterozygosity rates and higher data from H1 human ES cells and the four H1-derived lineages. For each lineage, numbers of allele-resolved reads). we assigned log2 (p2 allele read counts / p1 allele read counts) and log2 (p1 allele Enhancers, gene expression levels, lineage-specific genes, housekeeping genes read counts / p2 allele read counts) values as allelic bias information. After con- and imprinting genes. The enhancer regions were defined as previously described44. structing two 1 by 10 vectors for both allelic gene and allelic enhancer, we calcu- Briefly, enhancer chromatin signatures were trained for p300 binding sites in H1 lated the Pearson correlation coefficient between them. ES cells using RFECS algorithm based on H3K4me1, H3K4me3 and H3K27ac sig- Identification of allelic Hi-C interactions. We investigated allelic Hi-C inter- nals at 100-bp bin size. Next, these modification signals in all cell lines were tested actions for allelic gene-enhancer pairs. We considered allelic interactions between to predict enhancers. The predicted enhancers that overlap with H3K4me3 peaks 10-kb surrounding regions for both the transcription start site and enhancer, respec- or within 2.5 kb of the transcription start site were removed. Enhancers were merged tively. Many of allelic gene-enhancer pairs do not have any allelic interactions, but from all cell types if they are located close to each other (,2 kb) by taking the mid- allelic gene-enhancer pairs show concordance if they are connected by allelic Hi-C point at the centre of the new enhancer. interactions. For the gene list, gene expression levels, housekeeping genes and lineage-specific 4C-seq experiments and data analysis. 4C seq was performed essentially as described genes we used the same data set as described in ref. 13. For imprinting genes, we previously36. Six bait regions were chosen at allele-biased enhancer elements con- obtained known imprinted genes downloaded from publicly available imprinting taining SNPs that would allow for performance of allele specific 4C-seq, as has been gene database (http://www.geneimprint.com/). previously described31. Primers were designed such that the first read of a paired- Linking between allelically expressed genes and allele-biased promoter activ- end read would sequence the primer sequence derived from the bait region and ities. To investigate how many allele-biased gene promoter activities are consistent read into the target region of interest. The second read in the pair would read a with allelic gene expression levels, first we selected allelic genes that contain at least one portion of the bait region containing the SNP of interest (see Extended Data Fig. 8 allelic SNP in their promoter regions (1.5 kb upstream and downstream from tran- for a diagram of the experimental strategy). The primers were designed to include scription start site). We only considered allelic SNPs defined by DHS, H3K4me3, the Illumina adaptor sequences necessary for sequencing as well as the presence of histone acetylation, combined H3K9me3 and H3K27me3, and DNA methylation barcodes derived from Illumina’s TruSeq adaptors that allowed for multiplexing of because the signatures of those chromatin marks at the promoter regions are well 4C-seq reactions. We used two 4 base cutters, NlaIII and DpnII, for the first or defined. If promoters are marked by allelic SNPs from H3K9me3/H3K27me3 or DNA second restriction enzyme digestion, depending on the locus in question (See Sup- methylation and the allelic gene expression levels are consistent with the allele-biased plementary Table 4). 4C-seq templates were prepared as previously described36. 16 promoter activities, the genes can be explained by allelic repressive marks. If promoters PCR reactions using 200 ng of 4C template were performed for 30 cycles for each are marked by allelic SNP from histone acetylations, H3K4me3, and DHS, and allelic bait region and pooled together. The PCR reactions underwent a final purification gene expression levels are consistent with the allele-biased promoter activities, the step using AMPure beads (Beckman-Coulter) according to the manufacturer’s instruc- genes can be explained by allelic active marks. tions (using a bead-to-sample ratio of 1.8). The concentrations of each 4C library Identification of enhancer–promoter interactions. To investigate the linking were calculated using the KAPA qPCR system using a standard curve. The libraries between allelic genes and allelic enhancers we first defined enhancer-promoter inter- were then combined and spiked in with other non-4C sequencing libraries for actions using Hi-C interaction frequency data. Hi-C interaction frequencies were sequencing on the Illumina Hi-Seq 2500 machine. calculated in terms of 5-kb windows and normalized using HiCNorm. After that, Sequence reads were processed as follows. For each read, the first and second we considered all pairs of promoters and enhancers in each chromosome. Pro- sequencing reads were checked to identify the presence of the primer sequences and moter regions were fixed as 6 5 kb surrounding transcriptional start sites and any expected portion of the bait region. Any sequence with greater than 20% mis- enhancer regions were defined by using different window sizes as: 5 kb, 10 kb, matches to the expected bait region was discarded. The reads were trimmed such that 20 kb, 30 kb, 40 kb, 50 kb, 75 kb, 100 kb, 300 kb and 500 kb surrounding the centre each read was represented as a 36-mer, with 20 bp derived from the bait region and the of each enhancer (Extended Data Fig. 7). The interaction frequencies between a subsequent 16 bp, presumably containing the target region of interest. Based on the promoter and an enhancer at a certain window size were calculated as (interaction SNP identified in the second sequencing read derived from the bait region, each of frequency / window size of an enhancer) 3 5 kb. Final interaction scores were de- these files were split into allele specific 4C-seq FASTQ files for further analysis. fined as summation of interaction frequencies between promoter and enhancer 4C-seq data was mapped to a version of the hg18 genome with known SNPs in the with multiple window sizes. To calculate significance of each enhancer–promoter H1 genome masked to N, similar to other the strategy of mapping other sequence read interaction, we generated a random interaction frequency score by randomly permu- data sets performed in this study. Custom indexes for this H1-masked hg18 genome tated interaction frequencies between the promoter and enhancer in each window size. were built using the 4cseqpipe ‘‘-build_re_db’’ command. The reads were mapped The distribution of random interaction frequency scores was fit to Weibull distribution using the 4Cseqpipe software ‘‘-map’’ command to custom built indexes. Normalized and then P values of the interaction frequency in each enhancer–promoter pair were contact intensities were derived using the 4seqpipe ‘‘-nearcis’’ command for a 1-Mb calculated. At a given P value cutoff, we defined enhancer–promoter interactions. At a region upstream and downstream of the bait locus. We then took the normalized P value cutoff of 1 3 1023, more than 80% of interactions are reproducible between fragment level interaction frequency tables and removed any fragments where a SNP two biological replicates (Extended Data Fig. 7b). By taking this P value cutoff, we either could create or disrupt a potential restriction enzyme site between the two alleles. defined 339,761, 354,529, 319,169, 158,453, 250,495, and 210,010 enhancer–promoter In addition, given the short sequencing read length, any fragment with an insertion or interactions for ES, ME, MS, NP, TB and IMR90 cell lines between 103,982 enhancers deletion mapping within 16 bp of the fragment end was removed. These final filtered and 18,532 promoter regions. These enhancer–promoter interaction numbers can be sets of normalized fragment level interaction frequencies were then processed using a changed according to cutoff P values. sliding window approach with the window size of 5 kb and step size of 1 kb using the Comparison of enhancer–promoter interaction with other experiments. To average fragment interaction frequency over the 5-kb window. These sliding inter- validate predicted enhancer–promoter interactions we compared the interaction fre- action frequency files were then quantile normalized across all replicates in order for quency scores to 5C scores and DNaseI quantitative trait loci (dsQTL) informa- comparison between experiments using the ‘‘normalize.quantiles.robust’’ function tion. 5C is a chromosome conformation capture (3C)-based approach to measure (with use.median 5 TRUE) in the ‘‘preprocessCore’’ library in R. For display purposes, ©2015 Macmillan Publishers Limited. All rights reserved RESEARCH ARTICLE the average of two replicates was converted to bedGraph format and displayed in the for each experiment). In Supplementary Table 5, the Illumina barcode adaptors UCSC genome browser. are shown in red, with the region matching the bait region shown in blue. There is To identify regions that showed specific interactions with the bait region con- also an additional variable region in the Illumina TruSeq adaptors that has been trolling for the genomic distance between loci, we developed a LOWESS regression incorporated and shown in green. The phosphorothioate bond is indicated by an model. We pooled the sliding window interaction frequency files from each of the asterisk. 4C-seq replicates and performed LOWESS regression in R with the function ‘‘low- ess’’ (with f 5 0.01) on the log10 transformed interaction frequencies controlling 40. Kim, T. H. et al. Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell 128, 1231–1245 (2007). for the distance between the bait and potential interaction locus. We considered 41. DePristo, M. A. et al. A framework for variation discovery and genotyping any region as showing ‘specific’ interactions if it showed an increase in interaction using next-generation DNA sequencing data. Nature Genet. 43, 491–498 frequency greater than 2.5-fold over expected given the distance between the bait (2011). and target loci. These were considered to be the bait interacting regions. 42. Bansal, V. & Bafna, V. HapCUT: an efficient and accurate algorithm for the To test for any allelic bias in 4C-seq interaction frequencies, the average nor- haplotype assembly problem. Bioinformatics 24, i153–i159 (2008). malized fragment level interaction frequency was calculated for each allele of each 43. Browning, B. L. & Browning, S. R. A unified approach to genotype imputation and replicate over the bait interacting regions nearest to the transcription start site haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210–223 (2009). (TSS) of the putative target gene. A t-test was performed using these average values 44. Rajagopal, N. et al. RFECS: a random-forest based algorithm for enhancer (n 5 2 for each allele) to determine statistical significance. identification from chromatin state. PLOS Comput. Biol. 9, e1002968 (2013). The primers used for each 4C-seq experiment are listed in Supplementary 45. Degner, J. F. et al. DNase I sensitivity QTLs are a major determinant of human Table 5 (please see Supplementary Table 4 for information regarding the bait regions expression variation. Nature 482, 390–394 (2012). ©2015 Macmillan Publishers Limited. All rights reserved ARTICLE RESEARCH a b Hi-C rep1 P C C = 0.94 P C C = 0.72 P C C = 0.58 Promoter region (chr7) Hi-C rep2 P C C = 0.71 P C C = 0.57 5C rep1 P C C = 0.68 5C rep2 Low High : Hi-C score / 5C score Distal region (chr7) d ES ME MSC 0.06 0.06 0.04 c 0.00 0.00 0.00 0.8 Fraction of reproducible boundaries −0.04 −0.06 −0.06 Pearson = 0.99 Pearson = 0.97 Pearson = 0.99 between rep1 and rep2 0.6 −0.06 0.00 0.06 −0.06 0.00 0.06 −0.04 0.00 0.04 NPC TB IMR90 0.06 0.06 0.04 0.4 0.00 0.00 0.00 0.2 −0.06 −0.04 −0.06 0.0 Pearson = 0.99 Pearson = 0.99 Pearson = 0.99 ES ME MSC NPC TB −0.06 0.00 0.06 −0.04 0.00 0.04 −0.04 0.00 0.04 Extended Data Figure 1 | Reproducibility of Hi-C data. a, Scatter plots 0.58, 0.57) are similar to the correlation coefficient between two biological showing the correlation of Hi-C interaction frequencies between two biological replicates of 5C data (PCC 0.68). c, Bar plots showing the fraction of topological replicates for H1 ES cells and H1-derived lineages. The Pearson correlation domain boundaries that are reproducible between biological replicates over coefficient between replicates is shown in each plot. b, Heat maps showing H1 and H1-derived lineages (x axis). d, Scatter plot showing the PC1 values interaction frequencies of Hi-C and 5C data over the chromosome 7 ENCODE derived from compartment A/B analysis between biological replicates. PC1 loci. Pearson correlation coefficients between heat maps are shown together. values are used to determine the A and B compartments in each cell type. The The correlation coefficients between Hi-C data and 5C data (PCC 0.72, 0.71, Pearson correlation coefficient is shown in each graph. ©2015 Macmillan Publishers Limited. All rights reserved RESEARCH ARTICLE a Scale 20 Mb hg18 chr7: 100,000,000 150,000,000 0.2 - 1Mb -0.2 _ 40kb 0.03 - Sliding Window -0.03 _ b ES ME MSC c d 0.0 0.2 0.4 0.6 0.8 1.0 Empirical Cumulative Density ME IMR90 51% 41% 54% 38% 60% 32% 3.8% changes (1.1% B to A) ES 21.8% changes (6.4% B to A) (2.7% A to B) (15.4% A to B) MSC TB 8% 8% 8% 25.0% changes 14.1% changes (8.1% B to A) NPC NPC TB IMR90 (6.8% B to A) (16.9% A to B) 10.0% changes (7.3% A to B) (3.8% B to A) (6.2% A to B) 53% 39% 51% 41% 62% 32% 0 10 20 30 40 50 Distance (in Mbp) In total, 36% of the genome changes compartment in at least 1 cell type 8% 8% 6% e ES ME ES NPC ES TB ES IMR90 Stable A Stable A Stable A Stable A Stable B Stable B Stable B switch Stable Stable B Switch A/B switch Stable switch Stable switch Switch Stable Loss Switch A/B 0.02 Switch PC1 Value A/B Switch Loss Switch switch Loss New -0.02 Switch Loss Switch A/B switch New Boundary Boundary Boundary Boundary Boundary Boundary Boundary Boundary Extended Data Figure 2 | A/B compartments changes are concordant with randomly generated topological domain boundaries. Domain boundaries are topological domain boundaries. a, Genome browser image of the A/B closer to A/B compartment transitions when compared with random (P value compartments determined using the previously described 1-Mb bin algorithm 2.2 3 10216, Wilcoxon rank sum test). e, K-means clustering of PC1 values in (1-Mb track) compared with the 40-kb sliding window approach used in human ES cells and differentiated lineages surrounding topological domain our analysis (40-kb sliding window track). b, Pie-charts demonstrating the boundaries. Similar to Fig. 1c, domain boundaries correspond to the transition fraction of the genome in the A (blue) or B (yellow) compartments in each points between the A/B compartments, and changes in A/B compartments that of the six lineages studied. Shown in black are regions with a PC1 of zero, occur during differentiation tend occur at domain boundaries. Regions that often corresponding to centromeric and telomeric regions of the chromosomes. stay as A or B compartment are termed stable A or stable B. Regions that stay as c, Percentage of the genome that changes A/B compartment upon A/B compartment switching are labelled as stable switch. Regions where the differentiation of ES cells into each of the five differentiated lineages. boundary becomes a new switching point for the A/B compartment are labelled d, Cumulative density plot of the distance between topological domain new switch. Regions that previously were A/B compartment switching but are boundaries and transition points between the A and B compartments. The red no longer after differentiation are labelled switch loss. Regions that entirely line represents the observed distances and the grey line represents distances for switch from A to B or vice versa are labelled as switch A/B. ©2015 Macmillan Publishers Limited. All rights reserved ARTICLE RESEARCH a * ** **** *** **** **** 6 8 6 restricted genes (%) 8 Fraction of lineage 8 4 4 4 4 4 4 2 2 0 0 0 0 0 0 ES ME MSC NPC TB IMR90 Lineage-restricted compartment A regions Other remaining genomic regions b c d NPC IMR90 2500 ME 1.0 362 525 564 685 777 Empirical Cumulative Density 1,028 2000 0.8 1,495 1,211 616 0.6 1500 Frequency TB MSC 0.4 1000 550 974 777 846 0.2 500 1,025 0.0 670 −0.02 −0.01 0.00 0.01 0.02 0 Increased Interaction Frequency Covariance 0.0 0.2 0.4 0.6 0.8 1.0 Decreased Interaction Frequency p-value No Change e ME NPC MSC TB IMR90 Total B->A A->B B->A A->B B->A A->B B->A A->B B->A A->B B->A A->B Increase 5 2 Increase 13 9 Increase 116 35 Increase 55 14 Increase 87 41 Increase 276 101 Decrease 0 5 Decrease 9 28 Decrease 1 167 Decrease 1 37 Decrease 0 114 Decrease 11 341 OR: Not a number OR: 4.37 OR: 533 OR: 137 OR: Not a number OR: 84 p-value 0.446 p-value 0.012 p-value 2.2e-16 p-value 5.3e-16 p-value 2.2e-16 p-value 2.2e-16 f g ME NPC TB MSC IMR90 10 Average difference of intra-domain A -> B + o - + o - + o - + o - + o - interaction frequencies (H1- diff) B -> A 2.2e-16 6.7e-14 6 2.2e-16 1.6e-4 log2(Diff FPKM/ES FPKM) 5 1.4e-8 2.2e-16 5.6e-7 4 0.0059 2 0 0 −2 -5 −4 ME MSC NPC TB IMR90 −6 h ES ME MSC NPC TB IMR90 **** **** **** **** **** **** Lineage-restricted genes Z-score of intra-domain 2 2 2 2 2 2 Other remaining genes interactions 1 1 1 1 1 1 0 0 0 0 0 0 -1 -1 -1 -1 -1 -1 -2 -2 -2 -2 ©2015 Macmillan Publishers Limited. All rights reserved RESEARCH ARTICLE Extended Data Figure 3 | A/B compartment changes and gene expression. e, Relationship between A/B compartment and intra-domain interactions. a, Fraction of lineage-restricted genes in lineage-restricted compartment A 2 3 2 tables for each lineage (or for all lineages combined, labelled as total) for regions and other remaining regions. If only one or two cell lines are assigned as domains that show a concerted increase or decrease in interaction frequency compartment A across the six lineages, the region is defined as a lineage- and whether they show a change from A to B or B to A compartments. Domains restricted compartment A region. For all six lineages, lineage-restricted genes are considered to undergo a compartment change if .80% of individual tend to be enriched in lineage-restricted compartment A regions compared to bins within the domain change compartment. Odds ratio (OR) for each lineage other genomic regions (P values ,0.05, Fisher’s exact test). b, Empirical and the total are listed, as are P values for the association (Fisher’s exact test). cumulative density plot of covariance values between gene expression and f, Box plots showing the average difference of intra-domain interaction PC1 score across the 6 lineages analysed. Shown in red are the observed frequencies between H1 human ES cells and the five lineages analysed. Regions covariance values, while in grey are covariance values calculated after randomly that change from compartment B to A (blue) tend to show increased intra- shuffling the vector of gene expression values for each lineage. The slight shift of domain interaction frequencies compared to regions that change from the red curve to the right indicates that the observed data has a subset of compartment A to B (orange). P values are less than 2.2 3 10216 for all lineages genes with higher covariance values than would be expected at random, (KS test). g, Box plots of the fold-change in gene expression of genes located indicating that a subset of genes have concordant gene expression and PC1 in domains that have a significant increase (1), decrease (2), or no change (0) values. c, Histogram of P values for the covariance between gene expression and in intra-domain interaction frequency between ES cells and each of the PC1 values for each gene. To calculate the P value, a random background differentiated lineages specified. The fold-change in expression is the log2 of the distribution of covariance values was generated by calculating the covariance expression of a gene in the differentiated cells over ES cells (P values from between the PC1 values and a randomly shuffling of the vector of gene Wilcoxon rank-sum test). h, Box plots showing Z-scores of intra-domain expression values for each gene. This shuffling was performed 1,000 times. The interactions between lineage-restricted genes and other remaining genes. actual observed covariance can then be assigned a rank based P value given the The average intra-domain interaction frequency was calculated for each random background distribution for that gene. The plot shows that a subset domain in six lineages analysed and converted to a Z-score. The Z-score of each of genes is enriched for having low P values, consistent with the idea that a gene was assigned by the Z-score of corresponding domain that includes the subset of genes shows concordant gene expression and compartment status. gene. Lineage-restricted genes tend to reside in domains with higher Z-scores d, Pie charts showing the fraction of domains that are identified as having a compared to other remaining genes. The P values were less than 1 3 1024 concerted increase (yellow) or decrease (blue) in intra-domain interaction from the KS test. frequency between H1 human ES cells and the five lineages analysed. ©2015 Macmillan Publishers Limited. All rights reserved ARTICLE RESEARCH a b 0.8 ES chr14 Interaction Freq. 30 ∆ Normalized between LOI and GOI (40kb) 200kb Classification accuracy 0.6 0 -30 0.4 MSC 0.2 H3K4me1 10 ES 10 0.0 MSC 30% 40% 50% 100% 10 Sorted by vote frequency difference ES CTCF OOB (Out-of-back) 10 10-fold cross validation MSC 56,200,000 56,600,000 c d e *** 0.4 Ranked variables in randomized data H3K4me1 Fraction of interactions DHS 0.3 H3K27me3 ∆ Number of enhancers 5 0.2 (diff - ES per 40kb) H3K9me3 H3K27ac 0 0.1 H3K36me3 CTCF 0.0 -5 -5 -4 -3 -2 -1 1 2 3 4 5 H3K4me3 ∆ Number of lineage specific of enhancers (diff - ES per 40kb) 6000 6200 6400 in n of Ga ctio ss n Gain of interaction Mean decrease Gini index t era Lo ctio i n era Loss of interaction int Extended Data Figure 4 | Random Forest model to predict Hi-C interaction d, Box plots demonstrating the difference in the number of enhancers in each changes. a, Comparison of the classification accuracy between tenfold cross 40-kb bin that undergoes a gain of interactions (GOI) or loss of interactions validation and the out-of-back (OOB) error rates. The two methods show (LOI) upon differentiation. We observe that regions that are involved gain of similar classification accuracies at each vote frequency threshold. b, Heat map interactions tend to contain more enhancers in differentiated lineages showing the difference of normalized interaction frequencies between H1 compared to H1 cells. We only considered regions containing more than and MS cells. The boxes indicate the regions with relatively strong higher 10 enhancers in total for H1 and differentiated cell lines (***P value interaction frequencies in H1 ES cells. H3K4me1 and CTCF ChIP-seq signals , 2.2 3 10216, KS test). e, Histogram showing the fraction of interactions are also shown together. H3K4me1 ChIP-seq signals are highly correlated with classified as GOI (orange) or LOI (blue) according to the difference of the changes of interaction frequency. c, Similar to Fig. 2e, ranked Gini index of number of lineage-specific enhancers between differentiated lineages and H1 different chromatin features of the Random Forest model when using ES cells. The regions with more lineage-specific enhancers in differentiated randomized data. The red line represents the centre of the Gini index. lineages are enriched by gain of interactions. ©2015 Macmillan Publishers Limited. All rights reserved RESEARCH ARTICLE a b TAAGATCAGATGAGTCAGTTTATTGATCTGGGTGG (i) mRNA 99.06% concordance (ii) WGS 98.85% concordance * GGGCCATAAGATCAGATGAGTCAGTTTATTGATCTG CGGTTGGGGCCATAAGATCAGATGAGTCAGTTTATT p1 TCCCCGGTTGGGGCCATAAGATCAGATGAGTCAGT AGTTCCCCGGTTGGGGCCATAAGATCAGATGAGTCA CTGAGTCAGTTCCCCGGTTGGGGCCATAAGATCAGA CGCNGAGTCAGTTCCCCGGTTGGGGCGATAAGATCA TGCGCNGAGTCAGTTCCCCGGTGGGGGCCATAAGA TCTTGCGCTGAGTCAGTTCCCCGGTTGGGGCCATAA TCTTGCGCTGAGTCAGTTCCCCGGTTGGGGCCATAAGATCAGATGAGTCAGTTTATTGATCTGGGTGG Haplotype concordant reads p2 TGGGGCCATGAGATCAGATGAGTCAGTTTATTGATC GTCAGTTCCCCGGTTGGGGCCATGAGATCAGATG Haplotype disconcordant reads TCTTGCGCTGAGTCAGTTCCCCGGTTGGGGCCATGAGATCAGATGAGTCAGTTTATTGATCTGGGTGG c ES p1 allele 0.1 ME p1 allele MSC p1 allele NPC p1 allele TB p1 allele 0.1 0.1 0.1 0.1 r=0.92 r=0.92 r=0.98 r=0.89 r=0.94 0.0 0.0 0.0 -0.05 0.0 0.0 PC1 score of replicate 2 -0.1 -0.1 -0.1 -0.1 -0.1 0.0 0.1 -0.1 0.0 0.1 -0.05 0.0 0.05 -0.1 0.0 0.1 -0.1 0.0 0.1 0.1 ES p2 allele ME p2 allele MSC p2 allele NPC p2 allele TB p2 allele 0.1 0.1 0.1 0.1 r=0.88 r=0.92 r=0.98 r=0.89 r=0.95 0.0 0.0 0.0 0.0 -0.05 0.0 -0.1 -0.1 -0.1 -0.1 -0.1 0.0 0.1 -0.1 0.0 0.1 -0.1 0.0 0.05 -0.1 0.0 0.1 -0.1 0.0 0.1 PC1 score of replicate 1 d 100 Mb hg18 chr2 50,000,000 100,000,000 150,000,000 200,000,000 ES total PC1 ES p1 allele PC1 ES p2 allele PC1 ME total PC1 ME p1 allele PC1 ME p2 allele PC1 MSC total PC1 MSC p1 allele PC1 MSC p2 allele PC1 NPC total PC1 NPC p1 allele PC1 NPC p2 allele PC1 TB total PC1 TB p1 allele PC1 TB p2 allele PC1 e g h Allele-Biased Genes Imprinted Genes 0.1 r = 0.96 Other Genes Other Genes Percent of Genes with Compartment Change p2 allele PC1 3 Percent of Genes with Compartment Change 3 p=0.01 0.0 2 2 -0.1 1 1 -0.1 0.0 0.1 0 0 p1 allele PC1 ES ME MS NP TB ES ME MS NP TB f 1.3% 0.6% 2.3% 1.9% 1.6% Compartment A/B changes between alleles Same compartment A/B patterns between alleles ES ME NPC MSC TB ©2015 Macmillan Publishers Limited. All rights reserved ARTICLE RESEARCH Extended Data Figure 5 | Allele-specific chromatin structure. a, Validation without resolving the two alleles. PC1 scores are highly consistent, suggesting of haplotypes by (i) RNA-sequencing (i) and whole-genome sequencing (WGS) that homologous chromosomes fold in highly similar patterns. e, Scatter plot of (ii). Shown in dark blue is the percentage of reads connecting variants in the PC1 values between the p1 and p2 alleles in H1 and H1-derived lineages. same predicted haplotype, while in light blue is the percentage of reads The Pearson correlation coefficient value is 0.96. f, Fraction of the genome that connecting variants predicted to be on different haplotypes. b, Inset labelled shows changes in A/B compartment status across alleles. For ES, ME, MS, with an asterisk is from Fig. 3b showing DHS sequencing reads over a SNP NP and TB cells, 1.3%, 0.6%, 1.9%, 2.3% and 1.6% of total genomic regions upstream from the SNRPN gene, demonstrating how different chromatin shows allelic compartment A/B patterns, respectively. g, Percentage of features can be assigned to a given haplotype. c, Scatter plots showing the allele-biased (purple) or non-allele-biased (grey) genes that have different A/B correlation coefficient of PC1 values obtained from compartment A/B analysis compartment status in each lineage. Only in ES cells is there a significant between the two biological replicates for each allele. Despite the reduction in association between allele-biased genes and regions with variable A/B reads when Hi-C data are split into two alleles, the PC1 scores were highly compartment between alleles (Fisher’s exact test). h, Similar to g, but showing reproducible between replicates. d, Shown is a genome browser image of PC1 the association between imprinted genes and changes in A/B compartment values in chromosome 2 for the p1 allele, p2 allele, and for all Hi-C reads between alleles. No lineage shows a significant association. ©2015 Macmillan Publishers Limited. All rights reserved RESEARCH ARTICLE a b Allele specific domain-wide interactions ES ME MSC 6 r = 0.94 p2 allele domain intactness 1.6% 1.2% 2.2% 4 1.6% 0.9% 1.9% 96.8% 97.9% 95.9% 2 NPC TB 0 1.3% 1.3% -2 0.6% 0.7% 98.1% 98.0% -4 -4 -2 0 2 4 6 p1 specific domain-wide interaction p1 allele domain intactness p2 specific domain-wide interaction Not significant c d ES ME NPC TB MSC 12 12 Housekeeping (%) Lineage-specific (%) 8 8 4 4 0 0 Allelically expressed genes Non-allelically expressed genes e Allelic Genes Fraction of allelic gene Randomized allelic genes 0.6 0.4 0.2 p value = 0.0482 0.0 +3 -3 Normalized Expression level 0k 200k 400k RPKM/average RPKM per gene Distance to nearest allelic TSS f Heterozygous SNPs 0.9 0.9 0.8 0.8 SNP density per 1kb SNP density per 1kb 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 -500k -200k 0 0 200k 500k Distance from TSS Distance from TES Allelic genes Non-allelic genes ©2015 Macmillan Publishers Limited. All rights reserved ARTICLE RESEARCH Extended Data Figure 6 | Domain-wide structural changes and allele-biased level across each of the five lineages. Allele-biased genes consist of both cell-type genes. a, Scatterplot showing domain ‘intactness’ between the p1 and p2 alleles. specific and constitutively expressed genes. d, Fraction of housekeeping genes Domain intactness is defined as the log2 ratio of the total number of intra- and lineage-restricted genes that show allele-biased expression. There is no domain interactions versus total number of inter-domain interactions for each statistically significant enrichment between allele-biased genes (orange) and topological domain. The highly correlated domain intactness scores between non-allele-biased genes (blue) among housekeeping or lineage-restricted genes. the p1 and p2 alleles support the similar topological domain patterns e, Empirical cumulative density plot of the distance between each allele-biased between two homologous chromosomes. b, Pie charts showing the fraction of gene and the nearest allele-biased gene (purple) as compared with randomly domains that are identified as having a concerted p1 allele specific increase chosen genes (yellow). The difference from an allele-biased gene to the (blue) or p2 allele specific increase (yellow) in interaction frequency. Grey in the nearest allele-biased gene is less than what would be expected at random pie charts indicates the fraction of domains that do not show allele specific (P 5 0.0482, Wilcoxon rank sum test), however, the difference is subtle, patterns compared to the random model (P value cutoff is 0.001). c, Heat map indicating that most allele-biased expression does not occur in clusters. f, Rate showing K-means clustering (k 5 12) of gene expression levels of allele-biased of heterozygous SNPs near both allele-biased (gold) and non-allele-biased genes across each of the five H1 lineages. The expression levels are shown as (black) genes. See Supplementary Information for further details. the fold-change of expression in each lineage relative to the average expression ©2015 Macmillan Publishers Limited. All rights reserved RESEARCH ARTICLE a 1 Mb hg18 60,500,000 61,000,000 61,500,000 62,000,000 Enhancer H3K27ac RefSeq genes ES mRNA-seq 5kb Hi-C 10kb inter. freq. 20kb ... 500kb sum(Hi-C) Random permutation sum(random) Anchor point : IDO1 gene promoter region b enhancer-promoter interactions Reproducibility of predicted 0.8 0.8 0.8 0.8 0.8 0.4 0.4 0.4 0.4 0.4 ES ME MSC NPC TB 0.0 0.0 0.0 0.0 0.0 -2 -2 -3 -3 -4 -4 -2 -2 -3 -3 -4 -4 -2 -2 -3 -3 -4 -4 -2 -2 -3 -3 -4 -4 -2 -2 -3 -3 -4 -4 5E 1E 5E 1E 5E 1E 5E 1E 5E 1E 5E 1E 5E 1E 5E 1E 5E 1E 5E 1E 5E 1E 5E 1E 5E 1E 5E 1E 5E 1E p value cutoff threshold to define interactions from Hi-C data 40 Hi-C interaction score (GM06990) c d dsQTL off-target dsQTL target 200 30 5C signal 100 20 10 0 Interaction Others Interaction Others 0 1e-3 1e-4 50kb 100kb 200kb 200kb< P value cutoff to define interactions from Hi-C Extended Data Figure 7 | Identification of enhancer–promoter biological replicates for each chromosome with different P value cutoffs. If the interactions. a, Shown is the Hi-C interaction frequency between the IDO1 P value is less than 0.001, the reproducibility between replicates is over 80%. gene promoter regions and 6 1 Mb surrounding regions. Each entry in the heat c, Distribution of 5C signals between interacting pairs (interaction) and non- map of Hi-C inter. freq. indicates Hi-C interaction frequency between the interacting pairs (others) defined by Hi-C interaction frequency score with promoter and the surrounding regions. Each row indicates the Hi-C interaction different P value cutoffs. Interaction pairs defined by Hi-C interactions are also frequencies for a given window size. The heat map of random permutation strongly enriched by 5C signals at both P value cutoffs (n 5 11,461 for 1 3 1023 was generated by randomizing each row in Hi-C interaction frequency. The and n 5 1,841 for 1 3 1024). d, Relationship between Hi-C interaction sum (Hi-C) and sum (random) indicate summation of Hi-C interaction frequency scores and dsQTL target-gene pairs according to distance between frequencies for each 5-kb window. Predicted enhancers, H3K27ac, RNA-seq, gene and its target DHS regions. Target-gene relationships tend to show higher and RefSeq gene information are shown together. b, Box plots showing the Hi-C interaction frequency scores compared to off-target-gene relationships. reproducibility of predicted enhancer–promoter interactions between two ©2015 Macmillan Publishers Limited. All rights reserved ARTICLE RESEARCH b c a Pearson Pearson (log) barcode 1,000,000 GCLM 0.999 0.850 Prim Normalized Interaction Freuqency 100,000 er 2 MAN1C1 0.999 0.598 Sequence r1 Read 2 PXK 0.998 0.929 10,000 me Bait SNP Pri FAM65B 0.970 0.824 1,000 Sequence HAPLN1 0.999 0.898 Read 1 MT2A 0.999 0.878 100 10 1 0 200,000 400,000 600,000 800,000 Genomic Distance (bp) d Scale 200 kb hg18 mRNA Acetylation chr6: 24,900,000 25,000,000 25,100,000 25,200,000 p = 2.8e-75 p = 5e-5 Bait Interacting 90 200 Regions 80 Acetylation Read Counts FAM65B mRNA RPKM 500 70 160 Normalized 60 4C-seq 50 120 0 40 16 4C bait 30 80 H3K4me3 (ES) 20 40 0 10 GMNN FAM65B 0 FAM65B CMAHP 0 le le le le FAM65B le le le le FAM65B al al al al C6orf229 p1 p2 p1 p2 e Scale chr1: 500 kb hg18 26,000,000 mRNA Acetylation Bait Interacting p = 0.0026 p = 0.00038 Regions 9 100 Acetylation Read Counts MAN1C1 mRNA RPKM 500 8 Normalized 7 80 4C-seq 6 0 60 5 16 4C bait 4 40 H3K4me3 (MS) 3 0 2 20 LDLRAP1 SEPN1 EXTL1 1 RHCE MAN1C1 PAQR7 PAFAH2 0 0 RHD TMEM57 STMN1 SLC30A2 le le le le le le MTFR1L TRIM63 le le al al LOC646471 al al AUNIP p1 p2 p1 p2 f MIR3917 Scale 100 kb hg18 mRNA Acetylation chr3: 58,250,000 58,300,000 58,350,000 58,400,000 58,450,000 58,500,000 p = 9.5e-11 p = 2.5e-6 Bait Interacting 250 300 Acetylation Read Counts Regions 200 250 500 PXK mRNA RPKM Normalized 200 150 4C-seq 150 0 100 16 4C bait 100 H3K4me3 (MS) 50 0 50 RPP14 PDHB KCTD6 0 0 e e le le ABHD6 PXK ACOX2 l el l el le le al al al al p1 p2 p1 p2 g Scale 200 kb chr1: hg18 mRNA Acetylation 94,200,000 94,300,000 94,400,000 94,500,000 94,600,000 p = 0.0001 p = 9.2e-19 Bait Interacting 30 Acetylation Read Counts Regions 400 25 GCLM mRNA RPKM 500 Normalized 20 300 4C-seq 0 15 200 16 4C bait 10 H3K4me3 (MS) 100 0 5 MIR760 ABCA4 0 0 DNTTIP2 ARHGAP29 ABCD3 le le le le le le le le al al al al GCLM p1 p2 p1 p2 h 200 kb hg18 55,100,000 55,200,000 55,300,000 55,400,000 mRNA Acetylation Bait Interacting p = 4.2e-7 p = 2.4e-5 Regions 6000 160 Acetylation Read Counts MT2A mRNA RPKM 500 140 5000 Normalized 120 4C-seq 4000 100 0 3000 4C bait 80 16 H3K4me3 (ES) 2000 60 0 40 1000 BBS2 MT4 MT2A MT1B NUP93 20 NUDT21 MT3 MT1E MT1G 0 0 OGFOD1 MT1L MT1F le le le le MT1M MT1H le le le le MT1JP MT1IP al al al al MT1A MT1X p1 p2 p1 p2 MT1DP ©2015 Macmillan Publishers Limited. All rights reserved RESEARCH ARTICLE Extended Data Figure 8 | 4C-seq between allelic enhancers and allelic genes. regions (BIRs). d, Normalized 4C-seq interaction frequencies surrounding a a, Diagram of experimental design for 4C-seq and allelic 4C-seq. The orange bait region located in an allelic enhancer near the FAM65B gene. The region depicts the 4C bait locus, and the green region is the interacting location of the bait is labelled as 4C bait. Regions with significant interactions target region. Primers containing the Illumina adaptor sequences and a according to the LOWESS regression model are labelled as black lines in the bait-specific sequence are used for inverse PCR of the target region. Barcodes track marked bait interacting regions. Shown to the right is the level of based on the Illumina TruSeq adaptors are incorporated into the primer mRNA-seq data for each allele of the FAM65B gene, the level of histone sequences to allow for multiplexing. The second primer will read a sequence acetylation at the allelic enhancer bait region. Significance for mRNA-seq data from the bait region with a SNP that determines the allele from with the bait was was calculated using the edgeR software package in R. Acetylation P values derived. b, Pearson correlation coefficients between replicates for each of the were calculated using a binomial test. e, Similar to d, but for a 4C seq bait located loci tested. Also shown is the Pearson correlation coefficient between replicates in the MAN1C1 gene. f, Similar to d, but for an allelic enhancer located in the after log-transformation of the interaction frequency. c, Scatter plot of PXK gene. g, Similar to d, but for an enhancer located in near the GCLM LOWESS regression of 4C-seq data. The x axis shows the genomic distance gene. Of note, this allele-biased enhancer forms no specific contacts with any between the bait region and the putative target region. The y axis is the log10 of allelic genes. h, Similar to d, but for an enhancer located near the MT2A the quantile normalized interaction frequencies. LOWESS was performed to gene. There are no specific interactions between the allelic enhancer and the generate an expected interaction frequency at each genomic distance (green MT2A gene. There are specific interactions between the allelic enhancer and the line). A cut off of 2.5-fold over expected (shown in the red dashed line) is used to MT1H and MT1G genes. However, neither gene has an exonic SNP and determine if a region shows specific interactions, so-called bait interacting therefore we cannot determine if these genes have allele-biased expression. ©2015 Macmillan Publishers Limited. All rights reserved ARTICLE RESEARCH a Acetylation biased DNase I biased DNA Methylation biased All allelic biased enhancers enhancers enhancers enhancers p1 p2 p1 p2 p1 p2 p1 p2 p1 p1 p1 p1 Allelic Hi-C Allelic Hi-C Allelic Hi-C Allelic Hi-C 325 101 213 142 625 610 1,163 853 Reads Reads Reads Reads 140 105 155 131 545 591 840 827 p2 p2 p2 p2 OR: 2.41 OR: 1.26 OR: 1.11 OR: 1.34 P value = 3.9E-7 P value = 0.14 P value = 0.2 P value = 9.86e-6 b 4C p=0.18 Interaction Frequency 1000 Normalized 4C-seq 200 Interaction Frequency 160 120 0 80 Scale 100 kb hg18 chr5: 83,000,000 83,050,000 83,100,000 83,150,000 83,200,000 83,250,000 40 HAPLN1 EDIL3 0 p1 p2 c 4C Interaction Frequency 500 Normalized 4C-seq p = 0.52 Normalized Interaction Frequency 70 60 50 40 0 30 Scale 100 kb hg18 chr1: 25,800,000 25,850,000 25,900,000 25,950,000 26,000,000 20 10 0 LDLRAP1 MAN1C1 MTFR1L le le SEPN1 le le LOC646471 al al p1 p2 d 300 4C Interaction Frequency Normalized 4C-seq Normalized Interaction Frequency p = 0.93 350 300 250 200 0 Scale 50 kb hg18 150 chr6: 25,000,000 25,050,000 25,100,000 100 50 FAM65B 0 FAM65B le le le le FAM65B al al FAM65B p1 p2 FAM65B e 4C Interaction Frequency 500 Normalized 4C-seq Normalized Interaction Frequency p = 0.82 25 20 15 0 10 Scale 100 kb hg18 chr3: 58,250,000 58,300,000 58,350,000 58,400,000 5 0 ABHD6 PXK le le RPP14 le le PDHB al al p1 p2 Extended Data Figure 9 | 4C-seq interacting regions from allelic enhancers. confidence intervals for the interaction frequency. Shown to the right are the a, Allelic Hi-C interaction reads shown for allelic gene-enhancer pairs defined allele-specific normalized 4C interaction frequencies for each allele. 4C-seq using either allelic histone acetylation, DHS or DNA methylation. Odds ratios interaction frequencies for each allele were computed over the significant bait (OR) and P values (Fisher’s exact test) are shown. For enhancers defined by interacting regions nearest to the target gene TSS. Significance testing for histone acetylation and the pooled set of enhancers, a statistically significant allelic 4C-seq data was performed by t-test (n 5 2 for each allele). Black bars association between allele-biased Hi-C reads and allele-biased enhancer activity below the plot indicate regions identified as bait-interacting regions (BIRs). is observed. b, Normalized 4C-seq interaction frequencies surrounding a bait Of note, the panel to the right is the same as that found in Fig. 5f. c, Similar to region located in an allelic enhancer near the HAPLN1 gene. The blue line b, but for an allelic enhancer located in the MAN1C1 gene. d, Similar to b, but shows the interaction frequency for the p1 allele and the red line shows for a 4C seq bait located in the FAM65B gene. e, Similar to b, but for an interaction frequencies for the p2 allele. The shaded regions represent 95% enhancer located in near the PXK gene. ©2015 Macmillan Publishers Limited. All rights reserved

References (44)

Phillips, J. E. & Corces, V. G. CTCF: master weaver of the genome. Cell 137, 1194-1211 (2009).
Lettice, L. A. et al. Disruption of a long-range cis-acting regulator for Shh causes preaxial polydactyly. Proc. Natl Acad. Sci. USA 99, 7548-7553 (2002).
Gorkin, D. U., Leung, D. & Ren, B. The 3D genome in transcriptional regulation and pluripotency. Cell Stem Cell 14, 762-775 (2014).
Lieberman-Aiden, E. et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289-293 (2009).
Dixon, J. R. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376-380 (2012).
Phillips-Cremins, J. E. et al. Architectural protein subclasses shape 3D organization of genomes during lineage commitment. Cell 153, 1281-1295 (2013).
Nora, E. P. et al. Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature 485, 381-385 (2012).
Sexton, T. et al. Three-dimensional folding and functional organization principles of the Drosophila genome. Cell 148, 458-472 (2012).
Rao, S. S. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665-1680 (2014).
Ryba, T. et al. Evolutionarily conserved replication timing profiles predict long- range chromatin interactions and distinguish closely related cell types. Genome Res. 20, 761-770 (2010).
Peric-Hupkes, D. et al. Molecular maps of the reorganization of genome-nuclear lamina interactions during differentiation. Mol. Cell 38, 603-613 (2010).
Xie, W. et al. Epigenomic analysis of multilineage differentiation of human embryonic stem cells. Cell 153, 1134-1148 (2013).
Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190-1195 (2012).
Selvaraj, S., Dixon, J. R., Bansal, V. & Ren, B. Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nature Biotechnol. 31, 1111-1118 (2013).
Hu, M. et al. HiCNorm: removing biases in Hi-C data via Poisson regression. Bioinformatics 28, 3131-3133 (2012).
Hawkins, R. D. et al. Distinct epigenomic landscapes of pluripotent and lineage- committed human cells. Cell Stem Cell 6, 479-491 (2010).
Brown, K. E. et al. Association of transcriptionally silent genes with Ikaros complexes at centromeric heterochromatin. Cell 91, 845-854 (1997).
Kosak, S. T. et al. Subnuclear compartmentalization of immunoglobulin loci during lymphocyte development. Science 296, 158-162 (2002).
Holwerda, S. & de Laat, W. Chromatin loops, gene positioning, and gene expression. Front. Genet. 3, 217 (2012).
Heintzman, N. D. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genet. 39, 311-318 (2007).
Heintzman, N. D. et al. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 459, 108-112 (2009).
Sanyal, A., Lajoie, B. R., Jain, G. & Dekker, J. The long-range interaction landscape of gene promoters. Nature 489, 109-113 (2012).
Heinz, S. et al. Effect of natural genetic variation on enhancer selection and function. Nature 503, 487-492 (2013).
McVicker, G. et al. Identification of genetic variants that affect histone modifications in human cells. Science 342, 747-749 (2013).
Kasowski, M. et al. Extensive variation in chromatin states across humans. Science 342, 750-752 (2013).
Kilpinen, H. et al. Coordinated effects of sequence variation on DNA binding, chromatin structure, and transcription. Science 342, 744-747 (2013).
Kuleshov, V. et al. Whole-genome haplotyping using long reads and statistical methods. Nature Biotechnol. 32, 261-266 (2014).
Xie, W. et al. Base-resolution analyses of sequence and parent-of-origin dependent DNA methylation in the mouse genome. Cell 148, 816-831 (2012).
Splinter, E. et al. The inactive X chromosome adopts a unique three-dimensional conformation that is dependent on Xist RNA. Genes Dev. 25, 1371-1383 (2011).
Holwerda, S. J. et al. Allelic exclusion of the immunoglobulin heavy chain locus is independent of its nuclear localization in mature B cells. Nucleic Acids Res. 41, 6905-6916 (2013).
de Wit, E. et al. The pluripotent genome in three dimensions is shaped around pluripotency factors. Nature 501, 227-231 (2013).
Gimelbrant, A., Hutchinson, J. N., Thompson, B. R. & Chess, A. Widespread monoallelic expression on human autosomes. Science 318, 1136-1140 (2007).
Hon, G. C. et al. Epigenetic memory at embryonic enhancers identified in DNA methylation maps from adult mouse tissues. Nature Genet. 45, 1198-1206 (2013).
Thurman, R. E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75-82 (2012).
van de Werken, H. J. et al. Robust 4C-seq data analysis to screen for regulatory DNA interactions. Nature Methods 9, 969-972 (2012).
Reddy, K. L., Zullo, J. M., Bertolino, E. & Singh, H. Transcriptional repression mediated by repositioning of genes to the nuclear lamina. Nature 452, 243-247 (2008).
Finlan, L. E. et al. Recruitment to the nuclear periphery can alter expression of genes in human cells. PLoS Genet. 4, e1000039 (2008).
Kumaran, R. I. & Spector, D. L. A genetic locus targeted to the nuclear periphery in living cells maintains its transcriptional competence. J. Cell Biol. 180, 51-65 (2008).
Kim, T. H. et al. Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell 128, 1231-1245 (2007).
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genet. 43, 491-498 (2011).
Bansal, V. & Bafna, V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24, i153-i159 (2008).
Browning, B. L. & Browning, S. R. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am. J. Hum. Genet. 84, 210-223 (2009).
Rajagopal, N. et al. RFECS: a random-forest based algorithm for enhancer identification from chromatin state. PLOS Comput. Biol. 9, e1002968 (2013).
Degner, J. F. et al. DNase I sensitivity QTLs are a major determinant of human expression variation. Nature 482, 390-394 (2012).

About the author

Yin Shen

University of California, San Francisco, Faculty Member

Papers

Followers

View all papers from Yin Shenarrow_forward

Chromatin architecture reorganization during stem cell differentiation

Sign up for access to the world's latest research

Related papers

References (44)

Related papers

Related topics

Cited by