4dn:phase1:working_groups:omics_data_standards:minutes-08-13-2018

Omics Data Standards WG - Minutes 08-13-2018

AGENDA

1.- PLAC-seq data analysis by Ivan Juric and Miao Yu.

INTRODUCTION

PLAC-Seq technology detect 3d interactions across the genome
There are multiple steps in PLAC-Seq, some of them happen in nucleus and some of them happen after that, such as restriction enzyme cutting, ligation, sonication, DNA purification and pull down and finally we end with paired reads and sequencing.
The advantage of this method is that it doesn't need a lot of \cells to start as other methods to detect reliable CD interactions. * PLAC-Seq data is biased: gc, amppabilitty, restriction cut site frequency who are in 1D distance between interaction regions, but also there is biased towards ChIP peak regions, which need to be take into account. PLAC-Seq DATA ANALYSIS When used PLAC-Seq contact matrix, we restrict our analysis to rows and columns \of the matrix that intersect with ChIP-seq peaks. There are two type of sets that intersect with ChiP-Seq peaks:XOR set and AND set

XOR set, is when one of the bins intersect with ChiP-seq, while the other do not.

AND set, is the intersection of two binds that intersect with ChiP-seq, so both are contained in ChIP-seq.

There is another region, which is called NOT set, this is a region that do not intersect with ChiP-seq peaks and this region is ignored.

There is another set called SHORT set, which are pair-end reads that have a distance of 1KB between the two ends and are very close to each other. We have seen that we need this information to be included into our analysis because it biases the results. Once the SHORT set is included, the results are much better.

DETECTING 3D INTERACTIONS

It starts with pre-processing and mapping: Mapping is done with BWA, followed by removing invalid reads and bins (they can be removed for multiple reasons, such as low mappability or not restriction enzyme cut etc) and then reads are split in two types: ‘long’ and ‘short’ reads. Long reads are reads that have a distance of 10 or 20 kb to 1 MB \between two ends and Short reads are reads that have a distance between two ends and is less than 1kb. Modeling of expected the number of interactions between bin pairs: XOR set and AND set need to be modeling different because we have seen that they statistically behave different. How to model them? * First, they are bin in pairs and in appropriate resolution (5 or 10 kb resolution) * Second, we use Poisson regression to model the count, something that follows a positive poisson distribution and also other variables such as gc, mappability, restriction enzyme cut, frequency, 1D distance between bins and number of short reads. FINDING SIGNIFICANT 3D INTERACTIONS Significant interactions come as singletons or clusters. When looking at Singletons, we are less confident that those interactions are meaningful compared to clusters. We find both at FDR < 1%, then some filter is applied to both pairs (observed/expected count ratio >= 2, number of reads >= 12, inspired by HiCCUPS), however, to ensure confidence in singletons, an additional criteria of FDR < 1e-4 is added to them. Therefore, we require more stringent filtering for singletons to be significant. MAPS PIPELINE Is a computational pipeline for analyzing PLAC-seq like data and starts with fastq files and ends with peak calls and various analysis. MAPS contain different parts including: mapping and pre-processing , data \normalization, peak calling and visualization and other analysis tools.

H1 H3K4me3 PLAC-Seq data analysis: For this data set, we have two biological replicas plus merged the dataset that have been mapped to hg38. Interactions were called at autosomal chromosomes at 10Kb resolution and we used ENCODE H3K4me3 ChiP-seq used as 1D anchors. Overview of the two different replicas showed reproducibility. The set was also combined in dataset to call the number of bin pairs and the percentage of initial bin pairs for AND and XOR sets, for significant interaction, singletons and cluster summits. For the sensitivity analysis (# HiCCUPS loops overlapped with MAPS peaks), we used 819 filtered HiCCUPS loops for analysis and for MAPS peaks, chromosome 4 was removed.

CTCF motif orientation at identified interactions: We subset our peaks such that both in the bin pair overlap with CTCF ChiP-seq bin and for that subset we look for the CTCF motif. Thus, we end up with a set of bin pairs where both bins contain CTCF motifs. Now, we ask what is the orientation of the CTCF motifs. \__We expect that 25% are convergent, but when we looking at the bin pair filter, then we see that all bin pairs are in the convergent orientation. If we restrict to analysis to cluster summit or singletons then you see that the number goes to up 59%, this is encouraging for us, because we know that CTCF loops are associated CTCF motifs, which are in the convergent orientation and we see CTCF motifs are much more in convergent orientation.

NOTE: Not all the bin pairs contain CTCF motifs.

Bin pair overlaps CTCF mediated loops? Yes. Each bin contain a CTCF motifs

Are these loops call by HICCUPS or in the MAPS? These loops are called in the MAPS, not all loops are called by HICCUPS.

COMPARISON WITH HICHIPPER

Additional analysis was done using Hichipper, which is a preprocessing pipeline for calling DNA loops form HiChip data (Published in Nature). This method is calling for 1D peak HiChiP data and use Mango to call 3D interactions. We used 3 PLAC-Seq/HiChiP datasets to compare MAPS and Hichipper. We look at three different datasets in this method.

MAPS vs Hichipper peaks: Hichiper peaks show bumps, a feature that appears consistently, but not in MAPS. MAPS outperforms hichipper (higher reproducibility, higher HICCUPS loops recovery and higher percentage of convergent CTCF motifs).

‘XOR’ target bin enriched for cis-regulatory element: Briefly, and XOR set is a bin pair that intersect with ChiP-seq, while the other do not. We have a XOR set, which we have look at it, thus if you look at the bin pair [2,6], Bin 6 overlap with ChiP-seq peak and is called Anchor bin, while Bin 2 is not overlapping with a Chip-Seq peak is called Target bin (min 46 in the webex ppt). Then we collected all these XOR bin pairs and for all the targets bin, we look if there is an enrichment for cis-regulatory elements. This was done in the GM12878 datasets. We see is that the target bins are enriched for multiple marks associated with cis-regulatory elements.