Table of Contents

Omics Data Standards WG - Minutes 11-27-2017

Agenda

  1. Update on PLAC sub working group discussion (Miao Yu)
  2. Update on single cell Hi-C (Burak Alver)
  3. Update on allelic analysis of Omics data (Burak Alver or someone)
  4. Feedbacks and discussion of DNase-HiC and sciHi-C

Update on PLAC Hi-ChIP sub-working group

The first discussion was held in mid August, the main focus was on experiment protocols. Some groups are still optimizing their protocols.

The most popular marks and TFs are H3K4me3, H3K27Ac, Pol2 and CTCF.

For the Hi-C part, like cell lysis and digestion, most parts are similar to Hi-C protocols.

The most uncertain part is on the IP section, which depends on the type of tissues and cells being studied. These include parameters like the sonication, IP conditions, etc.

The sub-working group is collecting ideas from the participants and will discuss them on the next meeting.

Ren lab shared their protocol for H3K4me3, H3K27Ac and Pol2. Groups interested can follow those protocols if they haven’t their own ready. They are also open to other protocols as long as the QC metrics can be used to evaluate the results.

The QC metrics for PLAC seq data includes in situ Hi-C metrics, such as mappability, trans rate, the ratio of long-range reach. The second part of QC will be the IP part, for example, the enrichment on the specific transcription factor / marker in the final data, if no enrichment is observed, the IP would be considered failed.

For IP part of PLAC, a regular ChIP-seq experiment is suggested to be conducted on the antibody as a test. If the ChIP-seq fails, the antibody should not be used.

Ideally, when people choose antibodies for PLAC seq, they choose the ones that have been validated by ENCODE. However, those batches are very likely to be sold out already.

There may be a separate 4DN process to approve antibodies (4DN and ENCODE may have different interested antibodies). ChIA-PET antibodies may also be considered in building this common standard QC of antibodies.

The procedure of validating the antibodies are also very important instead of the actual antibodies. ENCODE has such a procedure, including what kind of assay that needs to be done for a new antibody. 4DN will be using the same procedure from ENCODE, instead of the ENCODE-validated antibodies only.

Tomorrow the sub-working group meeting will hopefully finalize the protocols for markers and transcriptional factors. More emphasis will be also put on antibody validation as well. Different antibody for markers, such as CTCF, has been tested and the sensitivity appears to good and the performance is still being evaluated.

The ENCODE protocol also includes specificity besides sensitivity. This should be not a problem for CTCF, but for other markers, more validation may be needed.

For specificity testing, there is a company called Ciphergen that builds synthetic nucleosomes for controls in antibody testing. They have tested the ENCODE datasets and found that a significant portion of the antibodies used by ENCODE have cross-activities with their synthetic controls. ENCODE may change based on this finding and 4DN may likely need to respond to this finding as well by finding a systematic way to evaluate specificity.

Single-cell Hi-C update

The information about this sub-working group is all available on wiki and the OMICS calendar. The first meeting was on data standards. The experiments had very different protocols and different representations (thousands of cells, cells with different barcodes, one cell, etc.) How to present multiplex cells in single cell experiments?

One possible solution is to use demultiplexed FASTQ files in every experiment as a meeting point, the other being process each different type separately to a downstream common format.

After demultiplexing, do we have one file per cell or all data from the same lane / batch in one file? Different groups are making different choices, most are using the one cell per experiment way. However, this may cause data from multiple experiments to be distributed among different files. More data standards discussion will be needed to address these.

Currently there has been less of effort to standardize single cell datasets, nor are there no public repository dedicated to single cell datasets (GEO has a huge number of single cell dataset entries).

There are multiple groups that planned to produce single-cell datasets (Jay’s group and Bill’s group, Peter Fraser’s group ad David Gilbert’s group).

The sub-working group may need two to three more calls to finalize the standard.

Allelic analysis update

Topics for this sub-working group:

  • Samples. Which samples have been phased? How to measure phasing quality?
  • Standards for reporting the data
  • Data processing standards

David Gilbert’s group has presented their data processing approach and Bill Noble’s group will present in the next meeting.

DCIC suggests to separate samples from hybrid mice (high SNP density and easier to phase) and human samples (low SNP density and harder to phase) and decide on two separate standards. This would be a question to be discussed within the sub-working group.

The sub-working group may need six to eight more calls to finalize the standard.

Erez’s group has some published workflows for allelic data and would discuss within the group to see if they will present in the sub-working group.

DNase Hi-C and sciHi-C protocols

The DNase Hi-C is quite mature and has been published in Nature Protocols. It has lots of details including troubleshooting and QC metrics. The only QC would be the same as in situ Hi-C protocols, that is, the trans-reads vs cis-reads ratio.

sciHi-C protocol has less detail but is still detailed enough for people to start the experiments.

Any comments and questions are welcome. Please mark them on the PDF file and send the marked file back to Bing, Erez or the OMICS working group.

Both protocols will be discussed and voted on the next meeting (Dec. 11th, 2017).