DAWG Meeting Notes 20170601

Hi-C read filtering. Presentation by Anton, Max, Mirny Lab.

(Minutes by Burak; Feel free to edit.)

4 types of filters

  • - PCR
  • - Dangling ends / self circles
  • - Poorly mapped reads (MAPQ)
  • - Random ligations (distance to restriction site)

Data set:

  • IMR90 HindIII Jin et al
  • IMR90 MboI Rao et al

1. PCR: looking at pairs vs. max-bp differentce

  • - large peak at 0
  • - flattens out at max_delta=3
  • - mismatch of 1 or 2 does not have a lot of excess but they might be worth removing.
  • - Bing's question on tool: Using home-grown pairsamtools.
  • - Burak's question on comparison to juicer: basically agree.

2. Dangling ends / self circles

  • - evidence in read pair orientation discrepancy at short distances.
  • - Dangling ends: ~+- pairs out to 1jb
  • - self circles: ~-+ pairs out to 10kb in HindIII; not so evident in MboI since molecules don't circle back as much.
  • - Conclusion: We cannot trust contact within ~10kb for 6 cutters or within ~1kb for 4 cutters.
  • - Yunjiang: The error structure is to some extent library specific. Anton/Max: All HindIII data we've looked at has similar trends with roughly 10kb.
  • - Max: restriction efficiency can also have an effect in determining features. So we should look at ?2-3 restriction fragment scale
  • - Burak: Did you look at restriction fragment instead of bp distance. Max: Yes, in a paper before.\ * - Anton/Max: propose to keep the pairs because they can be useful. But we might want to flag them. But don't include them in corrections (normalization) * - Bing: need to make sure these are removed in matrix. 3. MAPQ * - cis/total ratio does have a MAPQ dependence out to MAPQ=60. * - Try to estimate sensititivy/specificity vs. MAPQ threshold. * - MAPQ>0: 17% of filtered reads comes from mismappers; MAPQ>=60: 3-5& of filtered reads comes from mismappers;\
  • - MAPQ>=60: 20% of reads are removed
  • - the effect of MAPQ>0 vs MAPQ>59 does not appear to be big in terms of cis contact probability distributions
  • - Conclusion: MAPQ>0 or MAPQ>10 might be the optimal cutoff.
  • - Followup
    • - - One should compare the matrices as a follow up.
    • - - Leonid asks: any A/B compartment features?
  • - Another solution:
    • - - report pure and less pure outputs, similar to juicer.
  • \Hoachen: Similar studies in WGS from same cell type may be informative 4. Distance to restriction site * - very close to rest-site (1-few bp): dangling ends * - very far from rest-site (: random ligation * - Also note low rate of within 30bp; because of bwa-mem requirement. very close:\
  • - can't really study cis/trans.
  • - cis contact probability distributions: some bias in very short distances, but overall have similar features to total.

very far:

  • - similar approach to above; use cis-trans to estimate sensitivity/specificity of a distance threshold. Low
  • - cis contact probability distributions: appears normal.
  • - 4cutters: you can't even assign to the right restriction fragment. Very very few reads at large distances
  • - 6cutters: 15% are far from cut site, but don't show bad properties; and are estimated to have relatively low noise.

Conclusion:

  • Propose: remove duplicates and short distance pairs.

Action items:

  • Build matrices for different filtering outputs and compare.

Next call:

  • 2 weeks. We will follow up on read filtering.