This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
4dn:phase1:working_groups:omics_data_standards:minutes-10-24-2016 [2019/02/15 09:01] oh |
4dn:phase1:working_groups:omics_data_standards:minutes-10-24-2016 [2025/04/22 16:21] (current) |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ==== Omics Data Standards WG - Minutes 09-26-2016 ==== | + | ==== Omics Data Standards WG - Minutes 10-24-2016 ==== |
|~~TABLE_CELL_WRAP_START~~<WRAP> | |~~TABLE_CELL_WRAP_START~~<WRAP> | ||
- | Omics Year 1 Report by Bing Ren | + | Finalize recommendations on the Hi-C data standard, and particularly regarding sequencing depth, replicate number, and other criteria required for data submission. |
- | Achievements | + | Replicates |
- | * Common experiment protocol | + | * Do we need a minimum data requirements (like number of replicates) to ensure the quality of results? |
- | * Data file formats | + | * How replicates would help in the final results? E.g. are three replicates always better than two? |
- | \__ | + | * Needs some metrics to tell if the replicates are concordant |
+ | * Bill Noble’s group (and several other groups) are working on reproducibility of 4DN results. However, it is still hard to set a threshold at the moment. More datasets may be needed to establish one threshold and some datasets may needs to be redone at that point. | ||
+ | * Codes measuring the metrics may be shared if it’s ready. | ||
+ | * Two biological replicate samples have been used previously (Metrics have been developing by other groups). Comparisons have been made for the metrics, but thresholds haven’t been decided) | ||
+ | * Are two different cultures of the same cell line considered as biological replicates? Traditionally yes, two independently cultured samples are biological replicates. (Job Dekker). But one culture splitting into two samples may be considered technical replicates instead. (See below) | ||
+ | * Nomenclature of replicates. | ||
- | Goals for Year 2 | + | Anisogenic biological replicate: same cell type from different individuals.\\ Isogenic biological replicate: same cell type from same individual, separate cultures.\\ Technical replicate: same culture, library prepared separately.\\ ENCODE dropped “biological replicates” and use “isogenic replicates” universally. (there are non-isogenic biological replicates, which may have worse reproducibility though) |
- | * Data standards for Hi-C and ChIA-PET | + | * For HiC we will need two biological replicates (independently grown / engineered / manipulated). Both isogenic and non-isogenic replicates count toward biological replicates although isogenic ones are more desired. (no recommendations, see below) |
- | * Quality metric for chromatin organization features | + | * Non-isogenic information is not incorporated in metadata yet. Needs to make it more explicit to data generating and submitting groups. |
- | \__ | + | * Would non-isogenic samples be actually more beneficial because of the variability? However, isogenic samples are still needed to see how large the non-isogenic part is in the differences. We probably don’t recommend isogenic or non-isogenic and leave that decision to data submitters. |
+ | * There may be cases when only one clone is available so we may need to just recommend two biological replicates instead of requiring them | ||
+ | * Such recommendation / requirements may also need to coordinate with imaging groups to meet their imaging needs | ||
- | Main challenges and obstacles for OMICS group | + | Sequencing Depth |
- | * Gold standards for chromatin features | + | * Saturation of sequencing depth reaches 50% at 3B of reads (from the curve presented on 4DN meeting), however, that is still much bigger than most datasets now. |
- | * Definition for “reference 4D genome” (probably a reference sweep list for terms and etc. rather than a detailed “reference genome” | + | * Requiring the datasets to be saturated may be too restricted for most (~0.5B reads per lane). |
- | \__ | + | * 1B raw reads (~40% of which are high-quality inter-chromosomal / long-range pairs, so ~400M inter-chromosomal / long-range contacts) across replicates might be needed for loop calling. If the quality is low than more raw reads may be needed. |
- | + | * The more reads, the better, but a practical standard may be 500M raw reads across all replicates. | |
- | Build consensus in common terminology | + | * Metrics indicating a high-quality sequencing library |
- | + | * Many of them, like cis/trans ratio may be variable for different types of sample | |
- | \__ | + | * Some metrics still apply, such as number of PCR duplicates and unmappable reads |
- | + | * Distributions of such parameters may be used after accepting data to establish a standard | |
- | Data standards for Hi-C: | + | * Better to phrase this as a recommendation as lots of groups still cannot reach 500M raw reads/replicate? |
- | + | * Groups may make an effort to meet such standard once it’s determined and it’s doable with current technology. | |
- | * Needed when external experimental groups submit their own Hi-C data | + | * If 500M is the typical number of reads per sample, the low bar needs to be less than that. (for example, 400M/replicate) |
- | * The data standards should reflect what types of feature the datasets are trying to resolve because different features will require vastly different levels of resolution (low for compartment/domains but high for loops) | + | * Can more replicates be combined to achieve a similar result? (1B across replicates) |
- | * Do we need to enforce a minimum sequencing depth for Hi-C data? ENCODE has a requirement of 20M reads for phase II and 40M for phase III to ensure the number of binding sites (peaks) does not appear limited. | + | * Selection may be employed to select for datasets with enough reads for loop calling, etc. |
- | * Datasets may be divided into two (or more) categories by resolution. | + | * We can start by recommending 500M/replicate x 2 replicates and having 400M/replicate x 2 as a requirement and establishing a higher tier for loop-calling datasets. |
- | * Data will have a large amount of heterogeneity | + | * May need to share this requirement / recommendation to other groups among 4DN Network for comments. |
- | * Standard libraries can provide a good “sanity test” for newly generated data for better aggregation analysis | + | |
- | * Categorize quality control libraries | + | |
- | * Do groups do their internal controls individually or mandate a QC standard for all datasets | + | |
- | \__ | + | |
- | + | ||
- | Determine the minimum numbers of reads required for datasets | + | |
- | + | ||
- | * Distinct reads appears to be also very important because it is possible to have lots of reads but a very low molecular complexity, leading to waste of reads | + | |
- | * Reproducibility analysis | + | |
- | * However, saturation analysis by Erez group showed that there may be less benefit once the number of unique reads reaches certain level. (~2B-3B contacts, i.e. read pairs in the library assuming a high quality library) | + | |
- | * Inter-chromosomal and intra-chromosomal data may need to be separately considered because the underlying biological process. | + | |
- | * Define a minimal standard \__with the minimal read counts, read depths and quality control of the libraries. | + | |
- | * Early standard pushout may be more beneficial (can be raised later on) so that people don’t have to regenerate data once a standard is implemented in the future for the “reference genome”. | + | |
</WRAP>~~TABLE_CELL_WRAP_STOP~~| | </WRAP>~~TABLE_CELL_WRAP_STOP~~| | ||