==== Omics Data Standards WG - Minutes 09-26-2016 ====

|~~TABLE_CELL_WRAP_START~~<WRAP>
Omics Year 1 Report by Bing Ren

Achievements

  * Common experiment protocol
  * Data file formats

Goals for Year 2

  * Data standards for Hi-C and ChIA-PET
  * Quality metric for chromatin organization features

Main challenges and obstacles for OMICS group

  * Gold standards for chromatin features
  * Definition for “reference 4D genome” (probably a reference sweep list for terms and etc. rather than a detailed “reference genome”

Build consensus in common terminology

Data standards for Hi-C:

  * Needed when external experimental groups submit their own Hi-C data
  * The data standards should reflect what types of feature the datasets are trying to resolve because different features will require vastly different levels of resolution (low for compartment/domains but high for loops)
      * Do we need to enforce a minimum sequencing depth for Hi-C data? ENCODE has a requirement of 20M reads for phase II and 40M for phase III to ensure the number of binding sites (peaks) does not appear limited.
      * Datasets may be divided into two (or more) categories by resolution.
  * Data will have a large amount of heterogeneity
  * Standard libraries can provide a good “sanity test” for newly generated data for better aggregation analysis
  * Categorize quality control libraries
      * Do groups do their internal controls individually or mandate a QC standard for all datasets

Determine the minimum numbers of reads required for datasets

  * Distinct reads appears to be also very important because it is possible to have lots of reads but a very low molecular complexity, leading to waste of reads
  * Reproducibility analysis
  * However, saturation analysis by Erez group showed that there may be less benefit once the number of unique reads reaches certain level. (~2B-3B contacts, i.e. read pairs in the library assuming a high quality library)
  * Inter-chromosomal and intra-chromosomal data may need to be separately considered because the underlying biological process.
  * Define a minimal standard  with the minimal read counts, read depths and quality control of the libraries.
  * Early standard pushout may be more beneficial (can be raised later on) so that people don’t have to regenerate data once a standard is implemented in the future for the “reference genome”.

</WRAP>~~TABLE_CELL_WRAP_STOP~~|