4D Nucleome Network Wiki

This is an old revision of the document!

Usability study for different file formats.

Members (grad students and postdocs) in labs within 4DN are drafted to survey their opinion on using both formats.
Reports can be seen here on Google Docs:
https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnw0ZG5kYXdpa2l8Z3g6Mjc0NjQzOWIwYTEyOTI0Ng
Main opinion: if the analysis is already implemented in HiC ecosystem then it would make more sense to use HiC; if novel analyses are needed (new algorithms, etc) then the python APIs by Cooler would be better
- There are python APIs in HiC as well (albeit not public).

About the properness of the user audience

About the structure of survey questionnaires

Subjective answers may bring biases, there should be a choice of which format they prefer
Simple multiple choices on other aspects will filter responses

About supporting multiple formats

For a short term supporting both is fine
Files can be optimized during the process, prepare for a second round with a time window for both data formats to optimize, polish APIs, etc.
Since more softwares will be developed for 4DN, simplicity in using would be more important
However, as a long lasting solution maybe one format is preferred to a two-format solution
There would be more confusion if more file formats are supported
If all the conversions are perfect then multiple formats would be OK, but there would be more points of error in any of the component.

About the analysis pipeline

Keep data in lossless bam files, adding information for downstream filtering, then discard original FASTQ files (they can be re-generated from bam if needed).
Pairs file can be generated from the bam file by applying filters
Add information about every pair (mapping quality, validity for pairs, etc.) to the bam file
Bam files are pairs are needed for the information and therefore need to be there