Table of Contents

Omics Data Standards WG - Minutes 06-13-2016

Experiment Procedures, meta-data

  • Finalizing the experiment protocol. Derive the Hi-C standard from existing, published in situ Hi-C protocols. Determine what is variable and what should be unified and precise documentation of such variables
    • Digestion time for DNAs may vary, depending on the exact type of DNAs (e.g. mitochondrial vs genomic).
    • in-situ is a fairly standard protocol that different labs are using. Two variations were brought up as being worthwhile improvements:
    • Job Dekker: Dangling sequence removal is an important step that should be added to the standard protocol to improve library quality.
    • ?: For temperature during overhang filling, 23C is better than 37 C (enzyme dependent?)
  • Other similar procedures that may need a different protocol. May put different categories (like the narrow and broad for ChIP-seq in ENCODE) and may be based on Hi-C protocol before
    • Micro-C procedure
    • DNase Hi-C (a different protocol from the ones with restriction enzymes)
    • How will the protocol be encoded in the meta-data structure?
    • Trade-off between tracking information in searchable fields vs. ease of logging.
    • Follow ENCODE's example: standard protocols are provided as pdf, variations become meta-data fields.
    • Medata for ENCODE ChIP-seq:
      • Biosample: description of cell, condition, treatment. We may need to expand to describe genome editing.
      • Antibody: We will not need.
      • Protocol: fragmentation method, fragment length. handful others(?)
    • For Hi-C, we will need to add a field for restriction enzyme (information about crosslinking time, temperature etc, may be in pdf or have separate fields, probably pdf is fine.)
    • Do we need to keep metadata for Hi-C QC prior to sequencing.
      • e.g. like antibody validation in ENCODE
      • We can't think of examples for Hi-C that need to be reported.
      • Assume data prep passed basic wetlab QC in the prep stages. (there are natural checkpoints, go/no-go. All data submitted to DCIC is go/go/go.
    • computational QC data (e.g. read depth, PCR duplicate proportion, reproducibility, intra/inter)
      • can be determined after data is taken in by DCIC.
  • Cross consortium data sharing/merging with ENCODE
    • DCIC: this will naturally be easy since we are starting with their metadata and database model.

Data formats

  • ENCODE provide raw files (fastq) and processed files in standard formats (fastq, bam, bed, narrowPeak etc.)
  • DCIC proposes doing something analogous. For now, keep all intermediate files. We can get rid of some later (e.g. bam).
    • fastq: raw reads submitted to DCIC
    • bam: aligned, specifics of alignment TBD
    • valid pairs: gzipped text file of filtered reads (e.g. PCR duplicate removal)
    • sparse contact matrices as gzipped text
    • a binary representation of contact matrices that allows indexing and fast access
  • Trade-off between utilizing commonly used format vs. developing a “better one”.
  • It is unclear if commonly used formats do exist for some of the above.
  • We need to agree on the columns of the specific text formats above.
  • Bill Noble and Job Dekker strongly support valid pairs being made available. (yes, they will be.)
  • DCIC has listed some requirements for binary representation of contact matrices and tools they should come with. It is not clear a format that meets all exists. But even if not, DCIC can serve one/multiple of existing formats. New ones may come about or the tools around existing ones may grow.
  • Discussion to be continued.