Agenda
DamID ProtocolThis is the protocol presented by von Schaik group on June 11th, 2018. No additional comments for the protocol. Bill Noble proposed to hold the vote, seconded by Burak Alver. The protocol has been approved by vocal vote and will be sent to steering committee for approval. Cut&Run ProtocolThis is the protocol published on Nature Protocols and presented by Henikoff group on June 25th, 2018. No additional comments for the protocol. Bill Noble proposed to hold the vote, seconded by Burak Alver. The protocol has been approved by vocal vote and will be sent to steering committee for approval. Allele-specific Alignment of HiC ReadsGoals: software for phasing, gold standard of phasing genomes for 4DN cell lines Focus: Hi-C and other 3C methods, and 1D-rich experiment (Repli-seq, ChIP-seq). Two datasets were used: GM12878 and Patski mouse hybrid cell line (M.spretus x C57BL/6J) from embryonic kidney. Read filtering is very different between volunteering groups (multi-mappers, unmapped reads, PCR duplicates, etc.), especially in Patski. The ones are selected are mostly aligned to the same place (consistency over 99%) For all groups (ref-alt calls + alt-ref calls) / (ref-ref + alt-alt) values are small for both cell lines, which is expected. Bias towards one allele (alt-alt / ref-ref) appears close to 1. For contact probability vs genomic distance, all combinations among all groups appears to have a downward trend, even alt-ref and ref-alt on different chromosomes is expected to flat out. There may be some biological mechanism under such phenomenon but they are not expected to turn out to have this pattern. Bing wonder if the high degree of the ref-alt calls are due to mapping or the underlying biology. Bill suggested that the spretus is a wild individual and the reference genome is from Sanger but from a different individual from the spretus population. So there may be differences in variants in the spretus line vs the genome from Sanger. Erez states that the overwhelming source of error may come from lack of a gold standard among the community, in other words, the initial annotation of the variants in spretus may be incorrect, causing apparent errors. Even if we have the best genome, there still appears to be difference among the groups. Bing Ren suggested using F129 Hi-C data with a better characterized genome than spretus, and more data among different groups to be a better dataset. For GM12878, the Platinum Genomes dataset from Illumina was used as the reference, however, Erez states that this genome datasets may not be reliable and their used only variants called by multiple groups in their publication. This reflects the difficulties of phasing genomes and there are also the observation that multiple GM12878 cells are not identical, where different aliquots from different groups possess different variants. Pairwise comparison of reported allele assignment appears to be fairly consistent among different groups for GM12878. The disagreements appears to be more likely from the fact that one group is assigning more calls than the other (maybe causing false positives). For Patski the difference is slightly higher. To try to build a gold standard, DCIC selected ~200 GM12878 reads that are assigned differently to let the groups assign them again manually in a genome browser. Most of the manual assignments agree with each other, and Giancarlo’s and Yunjiang’s results are more favored. For Patski reads, the results are more divergent but Giancarlo’s and Yunjiang’s results are still favored. The allelic assignment working group would come back to the OMICS WG to determine the future steps, such as finalizing the phasing pipeline. Currently Yunjiang’s and Giancarlo’s pipelines can be used as a start. DiscussionMost of the groups are using BWA as the mapper, Anton’s tool is the same one as the one used by DCIC for filtering but per the performance, DCIC is considering adapting Yunjiang’s and Giancarlo’s tools into the existing one. Erez has published some results for 1000 Genomes Project to look at the mismatch rates (how often two ends that do not align to the same molecule), which is an order of magnitude higher. He agrees with Burak that it is very hard to make evaluation unless the SNPs were reliable. Burak commented that we were running into a circularity problem since the tools were used to improve the SNP sets, and the SNP sets were be used to choose the better tool. Erez suggested that a very basic control could be paired-end DNA-seq with same read length vs Hi-C, because paired-end DNA-seq reads had to come from the same molecule 100% of the time. When the DNA-seq control was giving elevated mismatch (conflicting) rates then something was up. A DNA-seq control could be generated by following exactly the same protocol without the ligase and these sorts of control were very important. Giancarlo commented that there was going to be a SNPs issue. He generated a SNP list for spretus mouse and validated the list with the same datasets that had not been used to prevent introduction of some specific bias, might be used in the future experiment. Erez highlighted the usage of “inbred” in discussions that a lot of things called “inbred” are actually only mostly inbred, not totally inbred. There might still be some portion of the genome that is not homozygous. The level of inbreeding required for certain type of genetic studies may be different from the one for other studies. To assess whether the mouse was inbred enough, an accurate phasing methods would be needed, which causes a sort of circularity situation again. The crosses from those unclear “inbred” mice would cause further confusion. The consensus was that we would need a better SNP set and the first iteration of the approach might be trying to identify the false SNP calls and be more stringent in filtering as in Giancarlo’s or Yunjiang’s pipelines. Erez would urge to start a conversation rigorously about creating a cell type and an accurate variant set, that might become a valuable resources for 4DN to contribute. Giancarlo suggested that we might use some other hybrid cell lines since Patski was chosen because it was readily available. Bing concurred that F129 may be used as an additional cell line if there was availability in time and manpower. Bing’s group deposited some F123 data but they might be not as deep as the 129 ones. There are some 129 data sets being prepared by Gilbert lab and once they were updated the sub-working group would try to work on them and report back. |