==== Omics Data Standards WG - Minutes 06-13-2016 ==== |~~TABLE_CELL_WRAP_START~~ ===== Experiment Procedures, meta-data ===== * Finalizing the experiment protocol. Derive the Hi-C standard from existing, published in situ Hi-C protocols. Determine what is variable and what should be unified and precise documentation of such variables * Digestion time for DNAs may vary, depending on the exact type of DNAs (e.g. mitochondrial vs genomic). * in-situ is a fairly standard protocol that different labs are using. Two variations were brought up as being worthwhile improvements: * Job Dekker: Dangling sequence removal is an important step that should be added to the standard protocol to improve library quality. * ?: For temperature during overhang filling, 23C is better than 37 C (enzyme dependent?) * Other similar procedures that may need a different protocol. May put different categories (like the narrow and broad for ChIP-seq in ENCODE) and may be based on Hi-C protocol before * Micro-C procedure * DNase Hi-C (a different protocol from the ones with restriction enzymes) * How will the protocol be encoded in the meta-data structure? * Trade-off between tracking information in searchable fields vs. ease of logging. * Follow ENCODE's example: standard protocols are provided as pdf, variations become meta-data fields. * Medata for ENCODE ChIP-seq: * Biosample: description of cell, condition, treatment. We may need to expand to describe genome editing. * Antibody: We will not need. * Protocol: fragmentation method, fragment length. handful others(?) * For Hi-C, we will need to add a field for restriction enzyme (information about crosslinking time, temperature etc, may be in pdf or have separate fields, probably pdf is fine.) * Do we need to keep metadata for Hi-C QC prior to sequencing. * e.g. like antibody validation in ENCODE * We can't think of examples for Hi-C that need to be reported. * Assume data prep passed basic wetlab QC in the prep stages. (there are natural checkpoints, go/no-go. All data submitted to DCIC is go/go/go. * computational QC data (e.g. read depth, PCR duplicate proportion, reproducibility, intra/inter) * can be determined after data is taken in by DCIC. * Cross consortium data sharing/merging with ENCODE * DCIC: this will naturally be easy since we are starting with their metadata and database model. ===== Data formats ===== * ENCODE provide raw files (fastq) and processed files in standard formats (fastq, bam, bed, narrowPeak etc.) * DCIC proposes doing something analogous. For now, keep all intermediate files. We can get rid of some later (e.g. bam). * fastq: raw reads submitted to DCIC * bam: aligned, specifics of alignment TBD * valid pairs: gzipped text file of filtered reads (e.g. PCR duplicate removal) * sparse contact matrices as gzipped text * a binary representation of contact matrices that allows indexing and fast access * Trade-off between utilizing commonly used format vs. developing a "better one". * It is unclear if commonly used formats do exist for some of the above. * We need to agree on the columns of the specific text formats above. * Bill Noble and Job Dekker strongly support valid pairs being made available. (yes, they will be.) * DCIC has listed some requirements for binary representation of contact matrices and tools they should come with. It is not clear a format that meets all exists. But even if not, DCIC can serve one/multiple of existing formats. New ones may come about or the tools around existing ones may grow. * Discussion to be continued. ~~TABLE_CELL_WRAP_STOP~~|