User Tools

Site Tools


4dn:phase1:working_groups:omics_data_standards:minutes-10-24-2016

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
4dn:phase1:working_groups:omics_data_standards:minutes-10-24-2016 [2019/02/15 08:59]
oh created
4dn:phase1:working_groups:omics_data_standards:minutes-10-24-2016 [2025/04/22 16:21] (current)
Line 1: Line 1:
-==== Omics Data Standards WG - Minutes ​09-26-2016 ====+==== Omics Data Standards WG - Minutes ​10-24-2016 ====
  
 |~~TABLE_CELL_WRAP_START~~<​WRAP>​ |~~TABLE_CELL_WRAP_START~~<​WRAP>​
-Omics Year 1 Report by Bing Ren+Finalize recommendations on the Hi-C data standard, and particularly regarding sequencing depth, replicate number, and other criteria required for data submission.
  
-Achievements+Replicates
  
-  *+  * Do we need a minimum data requirements (like number of replicates) to ensure the quality of results? 
 +      * How replicates would help in the final results? E.g. are three replicates always better than two? 
 +      * Needs some metrics to tell if the replicates are concordant 
 +      * Bill Noble’s group (and several other groups) are working on reproducibility of 4DN results. However, it is still hard to set a threshold at the moment. More datasets may be needed to establish one threshold and some datasets may needs to be redone at that point. 
 +      * Codes measuring the metrics may be shared if it’s ready. 
 +  * Two biological replicate samples have been used previously (Metrics have been developing by other groups). Comparisons have been made for the metrics, but thresholds haven’t been decided) 
 +      * Are two different cultures of the same cell line considered as biological replicates? Traditionally yes, two independently cultured samples are biological replicates. (Job Dekker). But one culture splitting into two samples may be considered technical replicates instead. (See below) 
 +  * Nomenclature of replicates.
  
-Common experiment protocol+Anisogenic biological replicate: same cell type from different individuals.\\ Isogenic biological replicate: same cell type from same individual, separate cultures.\\ Technical replicate: same culture, library prepared separately.\\ ENCODE dropped “biological replicates” and use “isogenic replicates” universally. (there are non-isogenic biological replicates, which may have worse reproducibility though)
  
-  *+  * For HiC we will need two biological replicates (independently grown / engineered / manipulated). Both isogenic and non-isogenic replicates count toward biological replicates although isogenic ones are more desired. (no recommendations,​ see below) 
 +      * Non-isogenic information is not incorporated in metadata yet. Needs to make it more explicit to data generating and submitting groups. 
 +      * Would non-isogenic samples be actually more beneficial because of the variability?​ However, isogenic samples are still needed to see how large the non-isogenic part is in the differences. We probably don’t recommend isogenic or non-isogenic and leave that decision to data submitters. 
 +  * There may be cases when only one clone is available so we may need to just recommend two biological replicates instead of requiring them 
 +  * Such recommendation / requirements may also need to coordinate with imaging groups to meet their imaging needs
  
-Data file formats+Sequencing Depth
  
-\__ +  ​Saturation of sequencing depth reaches 50% at 3B of reads (from the curve presented on 4DN meeting), however, that is still much bigger than most datasets now
- +  * Requiring ​the datasets to be saturated may be too restricted for most (~0.5B reads per lane). 
-Goals for Year 2 +  1B raw reads (~40% of which are high-quality inter-chromosomal / long-range pairs, so ~400M inter-chromosomal / long-range contactsacross replicates might be needed for loop callingIf the quality is low than more raw reads may be needed
- +  * The more reads, the better, but practical standard may be 500M raw reads across all replicates. 
-  ​* +  * Metrics indicating ​high-quality ​sequencing library 
- +      * Many of them, like cis/trans ratio may be variable ​for different types of sample 
-Data standards for Hi-C and ChIA-PET +      * Some metrics still apply, such as number ​of PCR duplicates and unmappable ​reads 
- +      * Distributions of such parameters may be used after accepting data to establish a standard 
-  * +  * Better ​to phrase this as a recommendation as lots of groups still cannot reach 500M raw reads/replicate? 
- +      Groups ​may make an effort to meet such standard ​once it’s determined and it’s doable with current technology. 
-Quality metric for chromatin organization features +      * If 500M is the typical ​number of reads per sample, the low bar needs to be less than that. (for example400M/​replicate) 
- +      * Can more replicates be combined to achieve ​similar result? (1B across replicates
-\__ +      Selection ​may be employed ​to select for datasets with enough reads for loop calling, etc
- +      We can start by recommending 500M/​replicate x 2 replicates and having 400M/​replicate x 2 as requirement ​and establishing a higher tier for loop-calling datasets
-Main challenges and obstacles for OMICS group +  * May need to share this requirement / recommendation to other groups among 4DN Network ​for comments.
- +
-  * +
- +
-Gold standards for chromatin features +
- +
-  * +
- +
-Definition for “reference 4D genome” ​(probably a reference sweep list for terms and etcrather than a detailed “reference genome” +
- +
-\__ +
- +
-Build consensus in common terminology +
- +
-\__ +
- +
-Data standards for Hi-C: +
- +
-  * +
- +
-Needed when external experimental groups submit their own Hi-C data +
- +
-  * +
- +
-The data standards should reflect what types of feature ​the datasets ​are trying ​to resolve because different features will require vastly different levels of resolution ​(low for compartment/​domains but high for loops+
- +
-      ​* +
- +
-Do we need to enforce a minimum sequencing depth for Hi-C data? ENCODE has a requirement of 20M reads for phase II and 40M for phase III to ensure the number ​of binding sites (peaksdoes not appear limited. +
- +
-      * +
- +
-Datasets ​may be divided into two (or more) categories by resolution. +
- +
-  * +
- +
-Data will have large amount of heterogeneity +
- +
-  * +
- +
-Standard libraries can provide ​good “sanity test” for newly generated data for better aggregation analysis +
- +
-  * +
- +
-Categorize ​quality ​control libraries +
- +
-      * +
- +
-Do groups do their internal controls individually or mandate a QC standard ​for all datasets +
- +
-\__ +
- +
-Determine the minimum numbers ​of reads required for datasets +
- +
-  * +
- +
-Distinct reads appears ​to be also very important because it is possible to have lots of reads but a very low molecular complexity, leading to waste of reads +
- +
-  ​* +
- +
-Reproducibility analysis +
- +
-  * +
- +
-However, saturation analysis by Erez group showed that there may be less benefit ​once the number of unique ​reads reaches certain level. (~2B-3B contactsi.e. read pairs in the library assuming ​high quality library+
- +
-  ​* +
- +
-Inter-chromosomal and intra-chromosomal data may need to be separately considered because the underlying biological process+
- +
-  ​* +
- +
-Define ​minimal standard \__with the minimal read counts, read depths ​and quality control of the libraries. +
- +
-  * +
- +
-Early standard pushout may be more beneficial (can be raised later on) so that people don’t have to regenerate data once a standard is implemented in the future ​for the “reference genome”.+
  
 </​WRAP>​~~TABLE_CELL_WRAP_STOP~~| </​WRAP>​~~TABLE_CELL_WRAP_STOP~~|
  
  
4dn/phase1/working_groups/omics_data_standards/minutes-10-24-2016.1550249998.txt.gz · Last modified: 2025/04/22 16:21 (external edit)