==== Omics Data Standards WG - Minutes 09-26-2016 ==== |~~TABLE_CELL_WRAP_START~~ Omics Year 1 Report by Bing Ren Achievements * Common experiment protocol * Data file formats Goals for Year 2 * Data standards for Hi-C and ChIA-PET * Quality metric for chromatin organization features Main challenges and obstacles for OMICS group * Gold standards for chromatin features * Definition for “reference 4D genome” (probably a reference sweep list for terms and etc. rather than a detailed “reference genome” Build consensus in common terminology Data standards for Hi-C: * Needed when external experimental groups submit their own Hi-C data * The data standards should reflect what types of feature the datasets are trying to resolve because different features will require vastly different levels of resolution (low for compartment/domains but high for loops) * Do we need to enforce a minimum sequencing depth for Hi-C data? ENCODE has a requirement of 20M reads for phase II and 40M for phase III to ensure the number of binding sites (peaks) does not appear limited. * Datasets may be divided into two (or more) categories by resolution. * Data will have a large amount of heterogeneity * Standard libraries can provide a good “sanity test” for newly generated data for better aggregation analysis * Categorize quality control libraries * Do groups do their internal controls individually or mandate a QC standard for all datasets Determine the minimum numbers of reads required for datasets * Distinct reads appears to be also very important because it is possible to have lots of reads but a very low molecular complexity, leading to waste of reads * Reproducibility analysis * However, saturation analysis by Erez group showed that there may be less benefit once the number of unique reads reaches certain level. (~2B-3B contacts, i.e. read pairs in the library assuming a high quality library) * Inter-chromosomal and intra-chromosomal data may need to be separately considered because the underlying biological process. * Define a minimal standard with the minimal read counts, read depths and quality control of the libraries. * Early standard pushout may be more beneficial (can be raised later on) so that people don’t have to regenerate data once a standard is implemented in the future for the “reference genome”. ~~TABLE_CELL_WRAP_STOP~~|