This is an old revision of the document!
Hi-C Data Processing and AnalysisLast meeting, OMICS group discussed about mapping of Hi-C reads and there were large consistence between different groups and the main differences were single-end vs paired-end and filtering. Currently Burak at DCIC are still trying to use different tools for comparison. The real issue is that whether OMICS will give specific recommendations. Over the last week, Soo and Neva compared different parameters for the mapping. Hi-C Read Alignment and Data Standards(Please refer to Soo Lee’s slides) Reporting of chimeric alignment (-5 and -M flags)Chimeric alignment will have one soft-clipping and one hard-clipping reads. By default the longer portion is reported as primary, and is soft-clipped. -5 means 3’-end is always soft-clipped (5’-end hard-clipped) By default, the hard-clipped read will be annotated as “supplementary alignment”. Some tools such as picard cannot handle multiple “primary alignments” for the same read. -M flags marks the hard-clipped read as “secondary alignment”. This flag / reporting style is very widespreadly used in the genomics community. Single-end vs. Paired-end modeWe reported that paired-end mode with -SP produces equivalent results to single-end mode. We (Soo and Neva) investigated this further. Simulations:
Real Hi-C reads:
Possible follow-ups:
Hi-C Normalization ProceduresModel particular bias modality and attempts to correct them; Matrix balancing methods (KR balancing, pre-filter the matrix).
Reproducibility needs to be evaluated carefully and different normalization methods should provide good reproducibility. Which reproducibility metric needs to be used also needs to be determined. Hi-C rep may be a good candidate for evaluation. We can apply different normalization methods on the same dataset (same Hi-C file) and use known features (SHH, for example) as criteria.
A very-high depth data file at different resolutions will be preferred, like the 1kb-resolution ones.
Normalization methods may be tied to the type of feature-calls people are using but it will make it hard to converge to standards. Using correlation between replication may be misleading because of shared biases. Normalization methods work very well if the correct one was chosen, and the results can be seen whether the correct methods have been used. We can use GM12878 5kb Hi-C datasets, chromosome 1 or chromosome 18 and let people send in normalization factors. AGENDAChimeric reads simulation (Neva, Burak and soo) Continue discussion on Hi-C normalization Ask other groups about their research |