Finalizing the experiment protocol. Derive the Hi-C standard from existing, published in situ Hi-C protocols. Determine what is variable and what should be unified and precise documentation of such variables
Digestion time for DNAs may vary, depending on the exact type of DNAs (e.g. mitochondrial vs genomic).
in-situ is a fairly standard protocol that different labs are using. Two variations were brought up as being worthwhile improvements:
Job Dekker: Dangling sequence removal is an important step that should be added to the standard protocol to improve library quality.
?: For temperature during overhang filling, 23C is better than 37 C (enzyme dependent?)
Other similar procedures that may need a different protocol. May put different categories (like the narrow and broad for ChIP-seq in ENCODE) and may be based on Hi-C protocol before
Micro-C procedure
DNase Hi-C (a different protocol from the ones with restriction enzymes)
How will the protocol be encoded in the meta-data structure?
Trade-off between tracking information in searchable fields vs. ease of logging.
Follow ENCODE's example: standard protocols are provided as pdf, variations become meta-data fields.
Medata for ENCODE ChIP-seq:
Biosample: description of cell, condition, treatment. We may need to expand to describe genome editing.
Antibody: We will not need.
Protocol: fragmentation method, fragment length. handful others(?)
For Hi-C, we will need to add a field for restriction enzyme (information about crosslinking time, temperature etc, may be in pdf or have separate fields, probably pdf is fine.)
Do we need to keep metadata for Hi-C QC prior to sequencing.
e.g. like antibody validation in ENCODE
We can't think of examples for Hi-C that need to be reported.
Assume data prep passed basic wetlab QC in the prep stages. (there are natural checkpoints, go/no-go. All data submitted to DCIC is go/go/go.
DCIC: this will naturally be easy since we are starting with their metadata and database model.
Data formats
ENCODE provide raw files (fastq) and processed files in standard formats (fastq, bam, bed, narrowPeak etc.)
DCIC proposes doing something analogous. For now, keep all intermediate files. We can get rid of some later (e.g. bam).
fastq: raw reads submitted to DCIC
bam: aligned, specifics of alignment TBD
valid pairs: gzipped text file of filtered reads (e.g. PCR duplicate removal)
sparse contact matrices as gzipped text
a binary representation of contact matrices that allows indexing and fast access
Trade-off between utilizing commonly used format vs. developing a “better one”.
It is unclear if commonly used formats do exist for some of the above.
We need to agree on the columns of the specific text formats above.
Bill Noble and Job Dekker strongly support valid pairs being made available. (yes, they will be.)
DCIC has listed some requirements for binary representation of contact matrices and tools they should come with. It is not clear a format that meets all exists. But even if not, DCIC can serve one/multiple of existing formats. New ones may come about or the tools around existing ones may grow.
Discussion to be continued.
4dn/phase1/working_groups/omics_data_standards/minutes-06-13-2016.1550188631.txt.gz · Last modified: 2025/04/22 16:21 (external edit)