4dn:phase1:working_groups:omics_data_standards:minutes-06-13-2016

Omics Data Standards WG - Minutes 06-13-2016
Experiment Procedures, meta-data
Data formats
Discussion

Omics Data Standards WG - Minutes 06-13-2016

Experiment Procedures, meta-data

Finalizing the experiment protocol. Derive the Hi-C standard from existing, published in situ Hi-C protocols. Determine what is variable and what should be unified and precise documentation of such variables
- Digestion time for DNAs may vary, depending on the exact type of DNAs (e.g. mitochondrial vs genomic).
- in-situ is a fairly standard protocol that different labs are using. Two variations were brought up as being worthwhile improvements:
- Job Dekker: Dangling sequence removal is an important step that should be added to the standard protocol to improve library quality.
- ?: For temperature during overhang filling, 23C is better than 37 C (enzyme dependent?)
Other similar procedures that may need a different protocol. May put different categories (like the narrow and broad for ChIP-seq in ENCODE) and may be based on Hi-C protocol before
- Micro-C procedure
- DNase Hi-C (a different protocol from the ones with restriction enzymes)
- How will the protocol be encoded in the meta-data structure?
- Trade-off between tracking information in searchable fields vs. ease of logging.
- Follow ENCODE's example: standard protocols are provided as pdf, variations become meta-data fields.
- Medata for ENCODE ChIP-seq:
  - Biosample: description of cell, condition, treatment. We may need to expand to describe genome editing.
  - Antibody: We will not need.
  - Protocol: fragmentation method, fragment length. handful others(?)
- For Hi-C, we will need to add a field for restriction enzyme (information about crosslinking time, temperature etc, may be in pdf or have separate fields, probably pdf is fine.)
- Do we need to keep metadata for Hi-C QC prior to sequencing.
  - e.g. like antibody validation in ENCODE
  - We can't think of examples for Hi-C that need to be reported.
  - Assume data prep passed basic wetlab QC in the prep stages. (there are natural checkpoints, go/no-go. All data submitted to DCIC is go/go/go.
- computational QC data (e.g. read depth, PCR duplicate proportion, reproducibility, intra/inter)
  - can be determined after data is taken in by DCIC.
Cross consortium data sharing/merging with ENCODE
- DCIC: this will naturally be easy since we are starting with their metadata and database model.

Data formats

ENCODE provide raw files (fastq) and processed files in standard formats (fastq, bam, bed, narrowPeak etc.)
DCIC proposes doing something analogous. For now, keep all intermediate files. We can get rid of some later (e.g. bam).
- fastq: raw reads submitted to DCIC
- bam: aligned, specifics of alignment TBD
- valid pairs: gzipped text file of filtered reads (e.g. PCR duplicate removal)
- sparse contact matrices as gzipped text
- a binary representation of contact matrices that allows indexing and fast access

Trade-off between utilizing commonly used format vs. developing a “better one”.
It is unclear if commonly used formats do exist for some of the above.
We need to agree on the columns of the specific text formats above.
Bill Noble and Job Dekker strongly support valid pairs being made available. (yes, they will be.)
DCIC has listed some requirements for binary representation of contact matrices and tools they should come with. It is not clear a format that meets all exists. But even if not, DCIC can serve one/multiple of existing formats. New ones may come about or the tools around existing ones may grow.
Discussion to be continued.

Table of Contents

Omics Data Standards WG - Minutes 06-13-2016

Experiment Procedures, meta-data

Data formats