Matrix Formats and intermediate formats to facilitate module swapping (mapping, etc)
FASTQ is definitely needed as the initial raw data
Discuss over which format should be the candidate
Valid read-pair format may be extensible (columns can be added, e.g. cellular barcodes)
Discuss data structures and bits of information needed first?
Pair-wise contact stored as sparse array while high-order contacts may be stored as matrices
Read/write of said candidates should require a low threshold for the broader community
What columns will be required and which ones will be optional?
Standardize mapping procedures? (AWG?)
Re-cap on known formats
Butler file: a binary form of contact matrix for visualization purposes only, like bigWig;
Hdf5 file: a binary file/platform including multi-dimensional arrays that can store meta-data along with the core files, platform including indexing/compression modules that can be switched;
.hic format: family of formats including validpairs/mergednodups, wrapper available for query/compression, multiple contact matrices, meta-data available;
Factors to consider: How to use the format? Visualization,
Current situation of the eco-system (software support, popular tools for visualization, community usage, documentation, etc)
Performances (query speed, etc)
Analysis and visualization may have different needs (pull out the whole matrix vs rapid resolution change and querying)
Yet benchmarks can be heavily influenced by implementation
Discuss the details of the benchmarks for all three formats and design tests for them.
Interaction format is not finalized in ENCODE as it’s out of the core formats.
4dn/phase1/working_groups/omics_data_standards/minutes-07-25-2016.1553551611.txt.gz · Last modified: 2025/04/22 16:21 (external edit)