Omics Year 1 Report by Bing Ren
Achievements
Goals for Year 2
Main challenges and obstacles for OMICS group
Build consensus in common terminology
Data standards for Hi-C:
Needed when external experimental groups submit their own Hi-C data
The data standards should reflect what types of feature the datasets are trying to resolve because different features will require vastly different levels of resolution (low for compartment/domains but high for loops)
Do we need to enforce a minimum sequencing depth for Hi-C data? ENCODE has a requirement of 20M reads for phase II and 40M for phase III to ensure the number of binding sites (peaks) does not appear limited.
Datasets may be divided into two (or more) categories by resolution.
Data will have a large amount of heterogeneity
Standard libraries can provide a good “sanity test” for newly generated data for better aggregation analysis
Categorize quality control libraries
Determine the minimum numbers of reads required for datasets
Distinct reads appears to be also very important because it is possible to have lots of reads but a very low molecular complexity, leading to waste of reads
Reproducibility analysis
However, saturation analysis by Erez group showed that there may be less benefit once the number of unique reads reaches certain level. (~2B-3B contacts, i.e. read pairs in the library assuming a high quality library)
Inter-chromosomal and intra-chromosomal data may need to be separately considered because the underlying biological process.
Define a minimal standard with the minimal read counts, read depths and quality control of the libraries.
Early standard pushout may be more beneficial (can be raised later on) so that people don’t have to regenerate data once a standard is implemented in the future for the “reference genome”.