User Tools

Site Tools


4dn:phase1:working_groups:omics_data_standards:minutes-05-23-2016

This is an old revision of the document!


Omics Data Standards WG - Minutes 05-23-2016

High-quality genome-wide contact mapping

A standardized in situ Hi-C protocol may need discussion when further results are available in the participating labs (such as the ones with the quick-cutters).

Two Google Docs will be used to discuss candidates for data formats (either from the current available ones or create a new type) to use in visualization and in analysis in genome-wide contact mapping. Burak will generate a draft document and share with the WG.

  • Visualization formats may need to differ from the formats used in analysis (e.g. the former may be plain text while the latter may be binary).
  • Documentation of the file formats will be needed (can be accompanying those of existing tools).
  • One Google Doc will be for the existing formats and the other will be about what requirements for the format may be needed by other groups.

Data file format specifics

  • Data tends to be very large so a conversion between text and binary format may be needed.
  • Hi-C data is two dimensional comparing with regular genomic data, therefore, a two-dimensional index will be needed.
  • Is there a need for data for validation and QC to be included into the final data file (aggregate the meta data and some pre-computed summary statistics into the data file itself)?
    • Need to consider that other pipelines may not calculate the same meta data
    • Therefore, optional attributes may be included in the final file format
    • However, required meta data fields will facilitate easier querying
    • There should be ways to verify/recalculate the pre-computed numbers
  • Raw (FASTQ, SAM, BAM files) results are not very required as part of the data set to use and share between groups, as long as there is a format that is not very lossy.
    • Contact matrices may be a good intermediate candidate between raw results and downstream data.
    • If more information is needed later, people can re-generate them from the intermediate instead of going all the way back to FASTQ.
    • Read ID may be needed in the processed file to enable pointing back into the original reads and to enable stuff like “ultra-zoom”.
    • But smaller files may also have their advantages by reducing data not used in most downstream studies (like BED).
4dn/phase1/working_groups/omics_data_standards/minutes-05-23-2016.1600883212.txt.gz · Last modified: 2025/04/22 16:21 (external edit)