==== Omics Data Standards WG - Minutes 07-25-2016 ==== |~~TABLE_CELL_WRAP_START~~ * Matrix Formats and intermediate formats to facilitate module swapping (mapping, etc) * FASTQ is definitely needed as the initial raw data * Discuss over which format should be the candidate * Valid read-pair format may be extensible (columns can be added, e.g. cellular barcodes) * Discuss data structures and bits of information needed first? * Pair-wise contact stored as sparse array while high-order contacts may be stored as matrices * Read/write of said candidates should require a low threshold for the broader community * What columns will be required and which ones will be optional? * Standardize mapping procedures? (AWG?) * Re-cap on known formats * Butler file: a binary form of contact matrix for visualization purposes only, like bigWig; * Hdf5 file: a binary file/platform including multi-dimensional arrays that can store meta-data along with the core files, platform including indexing/compression modules that can be switched; * .hic format: family of formats including validpairs/mergednodups, wrapper available for query/compression, multiple contact matrices, meta-data available; * Factors to consider: How to use the format? Visualization, * Current situation of the eco-system (software support, popular tools for visualization, community usage, documentation, etc) * Performances (query speed, etc) * Analysis and visualization may have different needs (pull out the whole matrix vs rapid resolution change and querying) * Yet benchmarks can be heavily influenced by implementation * Discuss the details of the benchmarks for all three formats and design tests for them. * Interaction format is not finalized in ENCODE as it’s out of the core formats. ~~TABLE_CELL_WRAP_STOP~~|