-
Notifications
You must be signed in to change notification settings - Fork 0
In depth metadata guide
In MorPhiC, the metadata is defined and validated by the MorPhiC Metadata Schema. However, in order for the metadata to be more accessible to the users, the input is generated as a metadata spreadsheet, which the user can fill in order to describe their experiment.
This document is an in-depth walkthrough for filling out the metadata spreadsheet.
Prior to metadata entry, a spreadsheet is generated reflecting the user's lab experimental requirements. For doing that, we have generated a series of diagrams that encapsulate the experiments we are expecting to receive, and the user will receive a tailored version of the spreadsheet that, while adhering to the metadata schema, reflects the needs of the user.
A description of each of the tabs is given below:
This is the general information about the user's dataset. We define a dataset as a minimal collection of data and metadata:
- That shares a unique set of Data Usage restrictions, as defined by the Data Use Ontology (DUO) under
data use modifier
- That utilises a single parental cell line (e.g. KOLF2.1J)
- That uses a single type of biomaterial (iPSCs, hESCs, Organoid, Embryoid body)
- Whose parameters extracted from the biomaterials were obtained following the same set of procedures (Readout assay)
- Tied to a single data-producing institute (DPC)
- Expected to be produced by a DPC and delivered to the DRACC in a timeline.
On this tab, the user will be filling out general details about their experiment, such as point of contact, a general description of the data that is going to contain, readout assay used to generate this data, and many other fields. For specific details about all the fields you can find in this tab, please refer to the [Dataset documentation] TBD
This is the metadata gathered for the cell lines. The user will find 2 different tabs:
- Cell line: Attributes related to the generated cell line after being modified with a
expression alteration strategy
protocol. That includes:- Which parental cell line derives from
- Experimental-related metadata to identify it (e.g. Clone ID) or to define it (e.g. Cell line type)
- Other biologically-relevant metadata (e.g. Zygosity, Model system)
- Differentiated cell line: Attributes related to the cell line after a period of differentiation. That includes:
- Which
Cell line
derives from - Other biologycally-relevant metadata (e.g.
Differentiation protocol
)
- Which
On these tabs, more than one row can be filled out, allowing multiple cell lines or differentiated cell lines to be registered at the same time. Once registered, these get associated with a dataset.
These are the attributes gathered about the method used to alter the gene expression in the cell lines used for the experiment.
The information registered here will be used to understand which genes are affected by this technique, how the technique has
affected them (e.g. targeted genomic region
), and a general description of the protocol itself.
For datasets which generate transcriptomics/genomics data, this tab collects the information related to the library preparation. Currently, most of the information is extracted through the links, such as: What method was used to generate the library, what method was used to dissociated the cells (If any).
For datasets which generate transcriptomics/genomics data, this tab collects the information related to the attributes of the
different sequence files. These attributes are then used to understand the content of the file; fields such as type of read
(read 1, read 2, Index 1) and lane index
.
In this section, we will go into detail into how to interpret the spreadsheet.
Each of the columns of the metadata spreadsheet aims to be self-explanatory; if you find out that that's not the case, please contact us and we will do our best to improve the description we provide for the field.
Each of the rows is used for a different thing:
- Row 1: User friendly name of the field. This should be a short title for the attribute the rows below need to describe.
- Row 2: Short description of the field.
- Row 3: Guidelines. This should include, when necessary, a list of the controlled vocabulary used, examples, and other important directions when filling out the field.
- Row 4: Programmatic name of the field. As a contributor, you don't need to worry about this row: this is later used to import the spreadsheet and ensure the fields map to the JSON metadata schema.
- Row 5: This is a separator row. Nothing will be extracted from this row.
- Rows 6+: This is where the information is extracted. Please fill each row as a separate entity.
The word ID is repeated on almost all the tabs in the metadata spreadsheet. In the context of metadata gathering, there is
an important distinction to make here: an ID refers to a unique ID within the dataset to which certain attributes are associated.
In a real world scenario, a Cell line ID
could be kolf2_knock1_a
, and could refer to the metadata associated with the
clone A, derived from KOLF2.2J after an expression alteration strategy
protocol.
That ID is later used to link to the next step in the experiment, e.g. to the differentiated cell line kolf2_knock1_a_differentationA_1day
,
and later to a file. Having this experimental layout linked like this allows for the data portal and data consumers to extract
from a file all the metadata that is relevant from each stage in the experimental process.