User Tools

Site Tools


4dn:phase1:working_groups:omics_data_standards:minutes-10-24-2016

Omics Data Standards WG - Minutes 10-24-2016

Finalize recommendations on the Hi-C data standard, and particularly regarding sequencing depth, replicate number, and other criteria required for data submission.

Replicates

  • Do we need a minimum data requirements (like number of replicates) to ensure the quality of results?
    • How replicates would help in the final results? E.g. are three replicates always better than two?
    • Needs some metrics to tell if the replicates are concordant
    • Bill Noble’s group (and several other groups) are working on reproducibility of 4DN results. However, it is still hard to set a threshold at the moment. More datasets may be needed to establish one threshold and some datasets may needs to be redone at that point.
    • Codes measuring the metrics may be shared if it’s ready.
  • Two biological replicate samples have been used previously (Metrics have been developing by other groups). Comparisons have been made for the metrics, but thresholds haven’t been decided)
    • Are two different cultures of the same cell line considered as biological replicates? Traditionally yes, two independently cultured samples are biological replicates. (Job Dekker). But one culture splitting into two samples may be considered technical replicates instead. (See below)
  • Nomenclature of replicates.

Anisogenic biological replicate: same cell type from different individuals.
Isogenic biological replicate: same cell type from same individual, separate cultures.
Technical replicate: same culture, library prepared separately.
ENCODE dropped “biological replicates” and use “isogenic replicates” universally. (there are non-isogenic biological replicates, which may have worse reproducibility though)

  • For HiC we will need two biological replicates (independently grown / engineered / manipulated). Both isogenic and non-isogenic replicates count toward biological replicates although isogenic ones are more desired. (no recommendations, see below)
    • Non-isogenic information is not incorporated in metadata yet. Needs to make it more explicit to data generating and submitting groups.
    • Would non-isogenic samples be actually more beneficial because of the variability? However, isogenic samples are still needed to see how large the non-isogenic part is in the differences. We probably don’t recommend isogenic or non-isogenic and leave that decision to data submitters.
  • There may be cases when only one clone is available so we may need to just recommend two biological replicates instead of requiring them
  • Such recommendation / requirements may also need to coordinate with imaging groups to meet their imaging needs

Sequencing Depth

  • Saturation of sequencing depth reaches 50% at 3B of reads (from the curve presented on 4DN meeting), however, that is still much bigger than most datasets now.
  • Requiring the datasets to be saturated may be too restricted for most (~0.5B reads per lane).
  • 1B raw reads (~40% of which are high-quality inter-chromosomal / long-range pairs, so ~400M inter-chromosomal / long-range contacts) across replicates might be needed for loop calling. If the quality is low than more raw reads may be needed.
  • The more reads, the better, but a practical standard may be 500M raw reads across all replicates.
  • Metrics indicating a high-quality sequencing library
    • Many of them, like cis/trans ratio may be variable for different types of sample
    • Some metrics still apply, such as number of PCR duplicates and unmappable reads
    • Distributions of such parameters may be used after accepting data to establish a standard
  • Better to phrase this as a recommendation as lots of groups still cannot reach 500M raw reads/replicate?
    • Groups may make an effort to meet such standard once it’s determined and it’s doable with current technology.
    • If 500M is the typical number of reads per sample, the low bar needs to be less than that. (for example, 400M/replicate)
    • Can more replicates be combined to achieve a similar result? (1B across replicates)
    • Selection may be employed to select for datasets with enough reads for loop calling, etc.
    • We can start by recommending 500M/replicate x 2 replicates and having 400M/replicate x 2 as a requirement and establishing a higher tier for loop-calling datasets.
  • May need to share this requirement / recommendation to other groups among 4DN Network for comments.
4dn/phase1/working_groups/omics_data_standards/minutes-10-24-2016.txt · Last modified: 2025/04/22 16:21 (external edit)