====== Omics Data Standards WG - Minutes 05-23-2016 ====== |~~TABLE_CELL_WRAP_START~~ ===== High-quality genome-wide contact mapping ===== A standardized in situ Hi-C protocol may need discussion when further results are available in the participating labs (such as the ones with the quick-cutters). Two Google Docs will be used to discuss candidates for data formats (either from the current available ones or create a new type) to use in visualization and in analysis in genome-wide contact mapping. Burak will generate a draft document and share with the WG. * Visualization formats may need to differ from the formats used in analysis (e.g. the former may be plain text while the latter may be binary). * Documentation of the file formats will be needed (can be accompanying those of existing tools). * One Google Doc will be for the existing formats and the other will be about what requirements for the format may be needed by other groups. ===== Data file format specifics ===== * Data tends to be very large so a conversion between text and binary format may be needed. * Hi-C data is two dimensional comparing with regular genomic data, therefore, a two-dimensional index will be needed. * Is there a need for data for validation and QC to be included into the final data file (aggregate the meta data and some pre-computed summary statistics into the data file itself)? * Need to consider that other pipelines may not calculate the same meta data * Therefore, optional attributes may be included in the final file format * However, required meta data fields will facilitate easier querying * There should be ways to verify/recalculate the pre-computed numbers * Raw (FASTQ, SAM, BAM files) results are not very required as part of the data set to use and share between groups, as long as there is a format that is not very lossy. * Contact matrices may be a good intermediate candidate between raw results and downstream data. * If more information is needed later, people can re-generate them from the intermediate instead of going all the way back to FASTQ. * Read ID may be needed in the processed file to enable pointing back into the original reads and to enable stuff like "ultra-zoom". * But smaller files may also have their advantages by reducing data not used in most downstream studies (like BED). ~~TABLE_CELL_WRAP_STOP~~|