Omics Data Standards WG - Minutes 06-26-2017

AGENDA:

  • Discussion of norm test (Burak/DCIC).
  • Finishing up discussion of hg19 vs hg38,
  • Compartment calling,
  • Voting on specific quantitative thresholds for successful hic experiments.
  • Concepts of chromatin loops, domains. \__(if time permits)

Discussion of norm test (Burak/DCIC)

ENCODE agreed to provide files in formats requested by 4DN. Contents of the files will be discussed further if more convergence is desired. If pipelines are agreed between 4DN and ENCODE it would also be better.

For side-by-side normalization comparison, Burak’s group ran into a bug in Juicer that had been fixed immediately and proceeded to data processing

Using four metrics currently in use by ENCODE 3 to check reproducibility, but they appears to be inadequate.

The form of normalization that would be “perfectly reproducible” would be replacing everything by 1. We would be mindful that correlation can rise from the process of the normalization as well. Therefore, a visual assessment might be better to evaluate the normalization steps.

Burak’s group will produce Juicer files for visualization but will need to do the conversion from HiC to those. Erez’s group are considering providing the conversion tools (from HiC files to Juicer) in future. The visualization files are expected to be available for next week, however is hard to predict possible.

Pipeline for 4DN may need BAM file output, however, Juicer do not produce BAM files. This is being planned since it appears to be highly desired. The work would take not more than a week to make BAM as an input/output option possible. Therefore, they will be expected by next month.

Producing BAM files do not affect to run visualization side by side.

(Job) How do we evaluate if one normalization is better than the other one if we don’t know how to look at it?

(Erez) The visualization check is more or less a sanity check when you can pick an obvious winner. If not, we can run an statistical analysis. The other method is to build simulated data sets with known biases to see if any of them can be correctly by normalization.

(Mike from Mirny’s group) it is hard to simulate biases that we don’t know. For example, the length of flanking sequencing will also introduce biases but we didn’t know previously.

(Erez) Exploratory methods will be important that we can learn from those to see those biases.

(Max from Mirny’s group) How do we distinguish different data or data with different biases? Also restriction enzyme will cause different results (4-base cutter vs 6-base cutter, for example)

(Erez) 6-base cutter costs much higher for a loop-resolution map therefore it’s very rare these days that people would run it as a control.

The best might be to eyeballing the results first and rule out some of the approaches. To CRISPR out the regions may be the best approach to verify the features. However, CRISPR will also generate other changes.

If CRISPR removed the motif as the anchor of a particular loop and the signal disappears upon HiC, such experiments will generate much confidence that the loop calls are reliable.

However, this is testing two things: the loop caller, and the loop being mediated by CTCF, either one fails and the loop will be called false positive (it may not be a false positive under the CTCF hypothesis).

Such compound assertions are necessary because sometimes the atomic hypothesis are not easy to test, therefore, compromise has to be made to use the compound assertions instead.

We should do some of this (CRISPR) as the verification process and maybe joint analysis group would be interested in that. This would be more valuable than running statistical tests only.

Finishing up discussion of hg19 vs hg38.

Encode will be using hg38, they are reprocessing their data with hg38. To process with hg19 at the same time would be costly. Important legacy paper will be reprocessed by DCIC if a short list can be determined.

The issue is that most deep processing HiC papers are still published with hg19.

There is no one bottleneck to support both references, however there are more manpower requirements for supporting both (front end, etc.). We would also like to lead the community in shifting to hg38.

Is it possible to convert the coordinates from hg19 to hg38 simply?

There are liftOver tools for 1-D tracks, for 2-D tracks, you might need to do a double liftOver. There might be a problem when only one of the coordinates can be liftOver’ed. If 4DN does not support hg19 then a good liftOver tools would be needed to convert all old coordinates to new.

ENCODE is going to use a flat hg38 (removing all the alternative references) and 4DN will use a flat one as well.

DCIC would write up a summary for the reference choice of hg38 patch 15, and how to proceed in the future.

Compartment calling

Who has the code for compartment calling that they think the consortium should adapt?

Mirny lab has one that not thinking compartments as a strict division but a compartment profile based on eigenvectors.It would also be possible to classify regions into two compartments simultaneously.

Erez lab has also code to call compartments and nested A/B compartments will behave differently.

Here are probably large number of compartments than A and B but the community are not great at finding them.

There is also a possibility to use one continuous vector to represent all compartments. This helps when there are compartments that are between A and B but behave differently (for example, one close to A and the other close to B). With compartmentalization, it is very hard to represent the 2-D structure onto 1-D genome.

TADs appear to be more defined and suitable for this type of jobs. Compartments are not as clean.

Compartments actually add some information with the compartment profiles. They may need to be called in different ways and current paper are not utilizing them very much. This may be due to there being little reliable data/tools for compartments. DCIC could contribute a great deal by looking carefully at the compartment callers to find ways to improve the status quo.

Compartments may also benefit TAD calling. For example, if compartment boundaries overlap on TAD boundary the TAD call will be much stronger. There are some papers that call compartments and TADs with the same algorithm and there appear to be much confusion about this in the field.

We can include the reliability of compartment callers to our evaluation of reliability of domain callers since the former can contribute to the latter.