User Tools

Site Tools


4dn:phase1:working_groups:omics_data_standards:minutes-05-08-2017

This is an old revision of the document!


Omics Data Standards WG - Minutes 05-08-2017

Hi-C Data Processing and Analysis

Last meeting, OMICS group discussed about mapping of Hi-C reads and there were large consistence between different groups and the main differences were single-end vs paired-end and filtering. Currently Burak at DCIC are still trying to use different tools for comparison. The real issue is that whether OMICS will give specific recommendations.

Over the last week, Soo and Neva compared different parameters for the mapping.

Hi-C Read Alignment and Data Standards

(Please refer to Soo Lee’s slides)

Reporting of chimeric alignment (-5 and -M flags)

Chimeric alignment will have one soft-clipping and one hard-clipping reads.

By default the longer portion is reported as primary, and is soft-clipped.

-5 means 3’-end is always soft-clipped (5’-end hard-clipped)

By default, the hard-clipped read will be annotated as “supplementary alignment”. Some tools such as picard cannot handle multiple “primary alignments” for the same read.

-M flags marks the hard-clipped read as “secondary alignment”. This flag / reporting style is very widespreadly used in the genomics community.

Single-end vs. Paired-end mode

We reported that paired-end mode with -SP produces equivalent results to single-end mode. We (Soo and Neva) investigated this further.

Simulations:

  • not including chimeric reads
  • excluding MAPQ=0 (multimappers which are randomly assigned)
  • gave identical results: 100% accuracy for both SE and PE with -SP.

Real Hi-C reads:

  • 1M reads
  • excluding MAPQ=0
  • All non-chimeric alignments were identical.
  • Only 1 out of 1M chimeric segments aligned to different places. (both with low MAPQ and a funny alignment)
  • Only 28 others out of 1M showed difference, and that was in the MAPQ score. Usually, both MAPQ scores were low but not identical.

Possible follow-ups:

  • Neva will try to see how much effort is required to simulate chimeric reads.
  • DCIC will contact Heng Li to let him know of the issue; we'll see if he has ideas about the difference.
  • Given that the difference is small, and we are not even sure which is really better, we can be ok with proceeding without figuring out exactly what is going on.
  • DCIC will confirm that SE and PE runtimes are roughly identical.

Hi-C Normalization Procedures

Model particular bias modality and attempts to correct them;

Matrix balancing methods (KR balancing, pre-filter the matrix).

  • Is this the right thing to do? It assumes every bits of the genome has the same probability to contact some other bits. However, this may not be true.
  • Each method has its pros and cons, the resulting chromosomal features may become different from the two different approaches.
  • Comparison may be presented on a particular chromosomal features with different normalization methods.
  • We need to find some common criteria to accept while choosing normalization methods and different normalizations may be used for different feature detection.

Reproducibility needs to be evaluated carefully and different normalization methods should provide good reproducibility. Which reproducibility metric needs to be used also needs to be determined. Hi-C rep may be a good candidate for evaluation.

We can apply different normalization methods on the same dataset (same Hi-C file) and use known features (SHH, for example) as criteria.

  • This comparison will be coordinated with DCIC.
  • Neva can use different normalization methods on the file and let people view the ending results across the genome to check the results.

A very-high depth data file at different resolutions will be preferred, like the 1kb-resolution ones.

  • However, those high-res maps are expensive to run (and few labs other than Erez’s lab is generating them) so maybe focusing on 5kb or 10kb resolutions may fit the actual usage better.
  • The standards will be for the community and will last for a long time. Even choosing a hi-res dataset we can still sub-sample it to generate lower-res ones. Also visualization may not be a very good method of inspection and some quantitative methods may need to be agreed upon.
  • The initial normalization applies to low-resolution datasets and theoretically it can be applied to every resolution (although no previous hi-res datasets have been tested).

Normalization methods may be tied to the type of feature-calls people are using but it will make it hard to converge to standards.

Using correlation between replication may be misleading because of shared biases.

Normalization methods work very well if the correct one was chosen, and the results can be seen whether the correct methods have been used.

We can use GM12878 5kb Hi-C datasets, chromosome 1 or chromosome 18 and let people send in normalization factors.

AGENDA

Chimeric reads simulation (Neva, Burak and soo)

Continue discussion on Hi-C normalization

Ask other groups about their research

4dn/phase1/working_groups/omics_data_standards/minutes-05-08-2017.1553551602.txt.gz · Last modified: 2025/04/22 16:21 (external edit)