4D Nucleome Network Wiki

This is an old revision of the document!

Omics Data Standards WG - Minutes 04-10-2017

AGENDA:

Agenda has changed today since the presenters (for microC protocol) were not ready for today’s meeting.
This call will be used to cover issues that were left open for OMICS discussions, such as ‘Hi-C data standards’. We will highlight things that are not clear.

Hi-C DATA STANDARDS:

The current metadata system includes
- Sample information: genetic modification, biological sample;
- Generic information: variations on the SOPs, as well as genetic experiments involved in Hi-C experiment.
- Data information: QC metrics, etc.

All these information will be included when the metadata is submitted, but can be done without a data file. The questions here is: What we are going to do to get these data files? Quality control metrics? and Threshold on those metrics?

An ID for biological sample and experiment will be created by DCIC upon receiving the FASTQ file. Since DCIC wants to apply their SOP on the raw data, QC metrics may not be necessary (b/c DCIC will generate them).
Need to be decided for DCIC SOP: mapping tools, parameters, the way is mapped (individual reads, separately), mapping using pair end features, how peaks/domains/other compartments are called, normalization, etc. The content of metadata will be determined by joint DCIC/OMICS call ten days from now and the discussion of SOP may need to be determined in future discussions.
Before considering which methods to use for peak/domain/other compartments calling, a written and defined metrics should be used to evaluate these methods. This can be done via Google Docs or other means and will not be discussed today.

PRESENTATION:

First slide: Hi-C Experimental structure

Show an overall map of all different objects that we have in our database that could possible be associated with Hi-C experiment.
Experiment set replicate: OMICS working group have been discussed that we must have at least two biological replicates and two technical replicates.
Experiment Hi-C: The experiment Hi-C metadata includes biosample, cell line or tissue. We need to have information how the sample was prepared and from where it came from.
Also the Individuals submitting the data, etc.
These are all the inputs that are need it to have an Experiment Set.
Submit with relatively little information about the experiment if SOP were followed. But submitter still needs to Indicate what cell lines was used and how was prepared, passage number etc. because there are variations in SOPs for each cell line.

Second and third slides: Biosample

Biosample: has been divided in different objects. There is a description and type of the biosample.
There is a controlled dictionary for all keywords and these will be available as options for submitters on the system. In you are using an existing Biosource for a particular Tier1 or other cell line, there is no need to fill out on the information of the organism in most cases (just choose the keywords).

. ycporkVc9kN5m9xFZD4uHlw_cxWuK8sy81mZ8ZIG-roAmz0VZsxLCK29Tm1bOwRJM17RiqHWNgpkLN1_4kJMI1VL290vUjnj8baOiaZCoM1fKbpUd0_r0N2KHaU1SUhxGSyuQb jyotoDQYgD8bN7L20pfLZ3gFNCHQQh_4DiHsjGhThUDhFosDT1pbM5bI1CEwABM746G6Rd5kAlmvk9bKv4h368T_dBRnnYgwKisqMoZ4LGuTZGdknmAOyTiCvX0dTXomnnt4mms3 Fourth slide: Biosample control information * These information entries are put up forward by the Cell working group. * A reference to SOP document is also needed * Which ones are required and which ones are optional? * There will be essential checks (standard to be jointly finalized by Cell WG and OMICS WG), which the submission is required to pass before the dataset can be released * Karyotype image: cell working group mentioned that we might use Hi-C data to determine if the karyotype is normal. Erez and David Gilbert are working together to create a systematic pipeline that is sensitive to karyotypic abnormalities. Is planned to be used as an verification system. * Karyotype information is only required for unstable cell lines that have been passaged more than 10 times. If any entry is required, it is important to tell the consortium ASAP so that people can get prepared before generating any data. This will be put to the steering committee call. * There is a Google Doc by DCIC about these metadata entries and will be shared to OMICS by Burak. orX4jBTIJX5LAd_aiBaa8DdAYPXrxmijgkdF_tNsiFnBolm8jiuX3YRmuanYXho8NeNEh7YeGW1tyiYX1Rk0GqHOhe-CtIbbVjAGaoau-BECU7qrtQ2Uc0gpse3ZOkDs5xHe5fGX NUMBER OF READS: * Number of reads was previously decided to be not based on features to be called, but by convenience ~400-500 million per sample. * Up to a certain level of depth is very difficult to distinguish features on these maps, therefore is very important to provide some guidance to the consortium. The features that we are trying to capture is based on higher depths of sequencing. A single cell will have ~1B contacts, corresponding to 2B read-pairs. * For this, we need a power analysis for a loop-caller to see how good the results are, how many number of loops the function will call. * We can have two standards: 500M required for TADs calling but much deeper (for example, 2B) for loop-calling. However, the current domain-calling tools all needs deeper sequencing, otherwise the resulting domains will not be reliable. * DAWG is discussing about domain/loop-calling methods currently and OMICS may join in such discussion. DAWG has its agenda posted on wiki and is currently working on stuff like alignment filtering, matrix balancing, replicable analysis and will come to those methods afterwards. * But there appears to be a scheduling problem. DAWG meeting was once per month but has been pushed to once every other week so we might be able to move more quickly. * The power analysis is crucial and one on TAD callers were just published and can be used as an example. ALIGNMENT, NORMALIZATION, AND OTHER ASPECTS OF THE SOP: * The groups can present their own in-house normalization procedures so that people can compare. Ren lab has already prepared such presentation and can be scheduled at two-weeks notice. The metrics for method evaluation can be determined after the presentations. * Similar stuff can be said about domain-callers and loop-callers, we can make a Google Doc to put all these matters and schedule when those stuff are going to be discussed. We will follow up in email first about when to discuss about those stuff. * The next DAWG meeting (Thr. Apr. 20th, 8:30am PDT) \will be focused on pipeline discussion (alignment first and the rest of the meeting can be devoted to normalization). Two sessions may be needed for the three talks scheduled.

We will need to know where the divergence lies among the methods and then consider which one is going to be adapted. These may also take another two sessions.
For people to better evaluate the methods, a dataset needs to be provided. Also the computational and time cost may also be taken into consideration.
Burak can provide the dataset, with different datasize created by subsetting for people to benchmark. This may also serve as a way to measure how deep the sequencing needs to go.

4D Nucleome Network Wiki

User Tools

Site Tools

Sidebar

Omics Data Standards WG - Minutes 04-10-2017

Page Tools