==== DAWG Meeting Notes 20170601 ==== |~~TABLE_CELL_WRAP_START~~ ===== Hi-C read filtering. Presentation by Anton, Max, Mirny Lab. ===== (Minutes by Burak; Feel free to edit.) 4 types of filters * - PCR * - Dangling ends / self circles * - Poorly mapped reads (MAPQ) * - Random ligations (distance to restriction site) Data set: * IMR90 HindIII Jin et al * IMR90 MboI Rao et al 1. PCR: looking at pairs vs. max-bp differentce * - large peak at 0 * - flattens out at max_delta=3 * - mismatch of 1 or 2 does not have a lot of excess but they might be worth removing. * - Bing's question on tool: Using home-grown pairsamtools. * - Burak's question on comparison to juicer: basically agree. 2. Dangling ends / self circles * - evidence in read pair orientation discrepancy at short distances. * - Dangling ends: ~+- pairs out to 1jb * - self circles: ~-+ pairs out to 10kb in HindIII; not so evident in MboI since molecules don't circle back as much. * - Conclusion: We cannot trust contact within ~10kb for 6 cutters or within ~1kb for 4 cutters. * - Yunjiang: The error structure is to some extent library specific. Anton/Max: All HindIII data we've looked at has similar trends with roughly 10kb. * - Max: restriction efficiency can also have an effect in determining features. So we should look at ?2-3 restriction fragment scale * - Burak: Did you look at restriction fragment instead of bp distance. Max: Yes, in a paper before.\__ * - Anton/Max: propose to keep the pairs because they can be useful. But we might want to flag them. But don't include them in corrections (normalization) * - Bing: need to make sure these are removed in matrix. 3. MAPQ * - cis/total ratio does have a MAPQ dependence out to MAPQ=60. * - Try to estimate sensititivy/specificity vs. MAPQ threshold. * - MAPQ>0: 17% of filtered reads comes from mismappers; MAPQ>=60: 3-5& of filtered reads comes from mismappers;\__ * - MAPQ>=60: 20% of reads are removed * - the effect of MAPQ>0 vs MAPQ>59 does not appear to be big in terms of cis contact probability distributions * - Conclusion: MAPQ>0 or MAPQ>10 might be the optimal cutoff. * - Followup * - - One should compare the matrices as a follow up. * - - Leonid asks: any A/B compartment features? * - Another solution: * - - report pure and less pure outputs, similar to juicer. * \__Hoachen: Similar studies in WGS from same cell type may be informative 4. Distance to restriction site * - very close to rest-site (1-few bp): dangling ends * - very far from rest-site (: random ligation * - Also note low rate of within 30bp; because of bwa-mem requirement. very close:\__ * - can't really study cis/trans. * - cis contact probability distributions: some bias in very short distances, but overall have similar features to total. very far: * - similar approach to above; use cis-trans to estimate sensitivity/specificity of a distance threshold. Low * - cis contact probability distributions: appears normal. * - 4cutters: you can't even assign to the right restriction fragment. Very very few reads at large distances * - 6cutters: 15% are far from cut site, but don't show bad properties; and are estimated to have relatively low noise. Conclusion: * Propose: remove duplicates and short distance pairs. Action items: * Build matrices for different filtering outputs and compare. Next call: * 2 weeks. We will follow up on read filtering. ~~TABLE_CELL_WRAP_STOP~~|