A standardized in situ Hi-C protocol may need discussion when further results are available in the participating labs (such as the ones with the quick-cutters).
Two Google Docs will be used to discuss candidates for data formats (either from the current available ones or create a new type) to use in visualization and in analysis in genome-wide contact mapping. Burak will generate a draft document and share with the WG.
Visualization formats may need to differ from the formats used in analysis (e.g. the former may be plain text while the latter may be binary).
Documentation of the file formats will be needed (can be accompanying those of existing tools).
One Google Doc will be for the existing formats and the other will be about what requirements for the format may be needed by other groups.
Data tends to be very large so a conversion between text and binary format may be needed.
Hi-C data is two dimensional comparing with regular genomic data, therefore, a two-dimensional index will be needed.
Is there a need for data for validation and QC to be included into the final data file (aggregate the meta data and some pre-computed summary statistics into the data file itself)?
Need to consider that other pipelines may not calculate the same meta data
Therefore, optional attributes may be included in the final file format
However, required meta data fields will facilitate easier querying
There should be ways to verify/recalculate the pre-computed numbers
Raw (FASTQ, SAM, BAM files) results are not very required as part of the data set to use and share between groups, as long as there is a format that is not very lossy.
Contact matrices may be a good intermediate candidate between raw results and downstream data.
If more information is needed later, people can re-generate them from the intermediate instead of going all the way back to FASTQ.
Read ID may be needed in the processed file to enable pointing back into the original reads and to enable stuff like “ultra-zoom”.
But smaller files may also have their advantages by reducing data not used in most downstream studies (like BED).