Skip to content

In depth metadata guide

ESapenaVentura edited this page Jul 20, 2023 · 3 revisions

Contributing metadata to the DRACC

In MorPhiC, the metadata is defined and validated by the MorPhiC Metadata Schema. However, in order for the metadata to be more accessible to the users, the input is generated as a metadata spreadsheet, which the user can fill in order to describe their experiment.

This document is an in-depth walkthrough for filling out the metadata spreadsheet.

What is in this document?

How we capture experimental designs

Prior to metadata entry, a spreadsheet is generated reflecting the user's lab experimental requirements. For doing that, we have generated a series of diagrams that encapsulate the experiments we are expecting to receive, and the user will receive a tailored version of the spreadsheet that, while adhering to the metadata schema, reflects the needs of the user.

A description of each of the tabs is given below:

Dataset

This is the general information about the user's dataset. We define a dataset as a minimal collection of data and metadata:

  • That shares a unique set of Data Usage restrictions, as defined by the Data Use Ontology (DUO) under data use modifier
  • That utilises a single parental cell line (e.g. KOLF2.1J)
  • That uses a single type of biomaterial (iPSCs, hESCs, Organoid, Embryoid body)
  • Whose parameters extracted from the biomaterials were obtained following the same set of procedures (Readout assay)
  • Tied to a single data-producing institute (DPC)
  • Expected to be produced by a DPC and delivered to the DRACC in a timeline.

On this tab, the user will be filling out general details about their experiment, such as point of contact, a general description of the data that is going to contain, readout assay used to generate this data, and many other fields. For specific details about all the fields you can find in this tab, please refer to the [Dataset documentation] TBD

Cell line/Differentiated cell line

This is the metadata gathered for the cell lines. The user will find 2 different tabs:

  • Cell line: Attributes related to the generated cell line after being modified with a expression alteration strategy protocol. That includes:
    • Which parental cell line derives from
    • Experimental-related metadata to identify it (e.g. Clone ID) or to define it (e.g. Cell line type)
    • Other biologically-relevant metadata (e.g. Zygosity, Model system)
  • Differentiated cell line: Attributes related to the cell line after a period of differentiation. That includes:
    • Which Cell line derives from
    • Other biologycally-relevant metadata (e.g. Differentiation protocol)

On these tabs, more than one row can be filled out, allowing multiple cell lines or differentiated cell lines to be registered at the same time. Once registered, these get associated with a dataset.

Expression alteration strategy

These are the attributes gathered about the method used to alter the gene expression in the cell lines used for the experiment. The information registered here will be used to understand which genes are affected by this technique, how the technique has affected them (e.g. targeted genomic region), and a general description of the protocol itself.

Library preparation

For datasets which generate transcriptomics/genomics data, this tab collects the information related to the library preparation. Currently, most of the information is extracted through the links, such as: What method was used to generate the library, what method was used to dissociated the cells (If any).

Sequence file

For datasets which generate transcriptomics/genomics data, this tab collects the information related to the attributes of the different sequence files. These attributes are then used to understand the content of the file; fields such as type of read (read 1, read 2, Index 1) and lane index.

Understanding the spreadsheet

In this section, we will go into detail into how to interpret the spreadsheet.

What each column tells you about the field

Each of the columns of the metadata spreadsheet aims to be self-explanatory; if you find out that that's not the case, please contact us and we will do our best to improve the description we provide for the field.

img_3.png

Each of the rows is used for a different thing:

  1. Row 1: User friendly name of the field. This should be a short title for the attribute the rows below need to describe.
  2. Row 2: Short description of the field.
  3. Row 3: Guidelines. This should include, when necessary, a list of the controlled vocabulary used, examples, and other important directions when filling out the field.
  4. Row 4: Programmatic name of the field. As a contributor, you don't need to worry about this row: this is later used to import the spreadsheet and ensure the fields map to the JSON metadata schema.
  5. Row 5: This is a separator row. Nothing will be extracted from this row.
  6. Rows 6+: This is where the information is extracted. Please fill each row as a separate entity.

What are IDs: Defining materials, protocols and provenance

The word ID is repeated on almost all the tabs in the metadata spreadsheet. In the context of metadata gathering, there is an important distinction to make here: an ID refers to a unique ID within the dataset to which certain attributes are associated. In a real world scenario, a Cell line ID could be kolf2_knock1_a, and could refer to the metadata associated with the clone A, derived from KOLF2.2J after an expression alteration strategy protocol.

That ID is later used to link to the next step in the experiment, e.g. to the differentiated cell line kolf2_knock1_a_differentationA_1day, and later to a file. Having this experimental layout linked like this allows for the data portal and data consumers to extract from a file all the metadata that is relevant from each stage in the experimental process.