Omics Data Standards WG - Minutes 04-24-2017

Mapping in Hi-C

Four groups presenting: Aiden, Mirny, Ren and DCIC

Juicer from Aiden Lab

Pipeline

Align reads as Single-end in parallel with bwa
Chimeric handling, merge sorting
“Wobble” removing duplicates
.hic files with contact maps at multiple resolutions

Aligner

bwa was chosen because it reports chimeric reads
Simulated 36bp, 76bp and 101bp with multiple aligners
- BWASW
- BWA ALN
- BWA MEM Paired-end
- BWA MEM Single-end
BWA Single-end has best accuracy and works with short and long reads
Intra short results has most number of errors (correct type is intra long or inter)

Chimeric reads

One end of the reads may be split into the interaction regions (while the other is within one region), in which case the overlapping part is discarded
Some ambiguous reads may present (two regions per read, non-overlapping), those are collected separately and filtered by MAPQ >= 10
There may be “triples” or “quadruples” from those ambigious reads

Duplicate removal

PCR duplicates / optical or machine duplicates
The former predict library complexity, but the latter don’t
- Adjust position to the 5’ end where the sequencer began to read, if both ends have the same position, duplications are called
- Optical duplicates are called based on camera position, close ones are called optical duplicates.

“Wobble”

There are multiple “jackpot” bins with lots of reads, but they appear to be PCR duplicates (the bins have a huge pileup of reads, and the reads are not evenly distributed within the bin but concentrated within 1~2bp).
These appears to be from machine error from Illumina sequencer (appears to be a known bug acknowledged by Illumina, and not fixed due to low priority).
Therefore, reads within 4 bases will be treated as duplicates.
If a read pair is seen at coordinate (x, y) in the genome, the likelihood that another pair is seen at (x ± 2bp, y ± 2bp) is much elevated. Therefore, it should be accounted to prevent large numbers of false positives in chromatin loop-calling.
This phenomenon has also been seen in other analyses (duplicates shifted by 1~2bp)

Hi-C pipeline in Ren lab

Pipeline is almost identical to Juicer pipeline

Align is done with BWA MEM
Remove low quality alignment (MAPQ < 10)
Remove PCR duplicates with Picard
Filter out
- Read pairs that are > 500bp away from cutting sites;
- Intra-fragment reads;
- Short distance reads (<15kb)

Chimeric reads

Consider 5’ end \of all the reads, which may be closer to the actual partner and 3’ end will be closer to the other side. (The chimeric end appears to be further away from the 5’ end of either read.) Erez: Suppose two reads from one pair A and B, if 5’ of A is actually closer to B and 3’ of A is at somewhere else. How will such exception case be handled? Yunjiang: We suppose 5’ end is always closer to the correct partner. Therefore, we don’t use the 3’ end of A in such cases. ===== Distiller from Mirny Lab ===== Data flow * fastq → (distiller) → .bam + .pairs files * .pairs file → (cooler) → .cool file(s) Data organization * SRA-like hierarchy: runs → libraries → library groups (arbitrary grouping) * hierarchy and experiment grouping is defined in a configuration file Distiller pipeline * Alignment: Map reads using “pseudo-paired” mode: “bwa mem -SP” – disables bwa-mem pairing but keeps the mate information in the output. * -S: skip mate rescue * -P: skip pairing * Parse sam output from bwa into Hi-C pairs; preserve all alignment information in an intermediate hybrid pairs+sam file (.pairsam) for efficient processing. * Chimeric reads: Rescue molecules with chimeric alignments if the 3’ alignment on one side maps downstream of the alignment of the other side and has the same genomic direction. * Classification: Each read pair is assigned an alignment classification code according to this table. Unambiguous pairs correspond to LL (unique linear alignments on either side) and CX (rescued chimeras). * Calculate various statistics per run. * Merge runs into libraries. Filtering (set aside and stored, not deleted) * Unmapped / multimapped (MAPQ > 0) * Multi-ligation molecules (mapping to 3 and more genomic locations) * Random breaks, optionally * Duplicates: smart detection of PCR and other duplicates using (± N bp mismatch on the either side, same as “wobble”) Products * Generate 4DN-DCIC standard .pairs and .bam file with all Hi-C information * Produce lossless bam output with pairing information intact * Aggregate pairs into binned matrices at specified resolutions (.cool files) * Generate single or multi-resolution cool files * Merge contact matrices from libraries and experiment groups Implementation * Each step is implemented as a standalone command line tool for modularity and modules can be reused. * Workflow control is delegated to a workflow manager (snakemake/nextflow) * Installation of dependencies is done automatically via docker/conda * >50% of computational time is spent on mapping
bwa flags and alignment file standards by Soo Lee and Burak Alver from DCIC Use BWA for alignment Data standards * A lossless bam file conforming to existing bam standards * All information is preserved, including chimeric alignments * Alignment information for the mate is provided Alignment issues * Discordant alignments (from Paired-end alignment) must be kept * Chimeric reads needs to be handled Single-end vs Paired-end mode * Single-end mode results do not contain complete mate information (mate info missing and/or incorrect identifier flag) that needs to be fixed manually * Paired-end mode will try to force mapping if only one mate is mapped, and this can be switched off (-SP option) to make results more consistent to Single-end mode. * Single-end mode may miss “mate unmapped” flags, which cause the result appears to be more than Paired-end mode * Paired-end mode with -SP doesn’t appear to penalize discordant reads so it seems that Paired-end mode with -SP is preferred for alignment * Paired-end mode take the same amount of time than Single-end mode. No simulations are conducted for Paired-end -SP yet, but it’s worth trying. \

Caveats and Issues

Currently the sam/bam files from Juicer does not contain some standard SAM information (mates etc.) that are still under work.

Performance of different alignment tools can be evaluated.

Data formats that can be handled by DCIC is also important.

-5 flag can be discussed in the next call.

How to structure a careful simulation study can also be discussed during the next call.

Table of Contents

Omics Data Standards WG - Minutes 04-24-2017

Mapping in Hi-C

Juicer from Aiden Lab

Hi-C pipeline in Ren lab

Caveats and Issues