====== Omics Data Standards WG - Minutes 05-23-2016 ======

|~~TABLE_CELL_WRAP_START~~<WRAP>

===== High-quality genome-wide contact mapping =====

A standardized in situ Hi-C protocol may need discussion when further results are available in the participating labs (such as the ones with the quick-cutters).

Two Google Docs will be used to discuss candidates for data formats (either from the current available ones or create a new type) to use in visualization and in analysis in genome-wide contact mapping. Burak will generate a draft document and share with the WG.

  * Visualization formats may need to differ from the formats used in analysis (e.g. the former may be plain text while the latter may be binary).
  * Documentation of the file formats will be needed (can be accompanying those of existing tools).
  * One Google Doc will be for the existing formats and the other will be about what requirements for the format may be needed by other groups.

===== Data file format specifics =====

  * Data tends to be very large so a conversion between text and binary format may be needed.
  * Hi-C data is two dimensional comparing with regular genomic data, therefore, a two-dimensional index will be needed.
  * Is there a need for data for validation and QC to be included into the final data file (aggregate the meta data and some pre-computed summary statistics into the data file itself)?
      * Need to consider that other pipelines may not calculate the same meta data
      * Therefore, optional attributes may be included in the final file format
      * However, required meta data fields will facilitate easier querying
      * There should be ways to verify/recalculate the pre-computed numbers
  * Raw (FASTQ, SAM, BAM files) results are not very required as part of the data set to use and share between groups, as long as there is a format that is not very lossy.
      * Contact matrices may be a good intermediate candidate between raw results and downstream data.
      * If more information is needed later, people can re-generate them from the intermediate instead of going all the way back to FASTQ.
      * Read ID may be needed in the processed file to enable pointing back into the original reads and to enable stuff like "ultra-zoom".
      * But smaller files may also have their advantages by reducing data not used in most downstream studies (like BED).

</WRAP>~~TABLE_CELL_WRAP_STOP~~|