Skip to content

Preparing Data for Entry into Expression Database

Alexia edited this page May 5, 2020 · 2 revisions

Schema Overview

The full database schema can be viewed here.

Processing Data for Database

The following image shows the general flow of how data was converted from the format provided by Dr. Dahlquist's Biological Databases class into the format used in the expression database: A diagram explaining data manipulations for expression database Essentially, we want to create a new row for each expression data timepoint. This means that a good deal of data is repeated, but this is necessary for the sake of making the database easy to query.

I manually made these manipulations in an Excel workbook. This workbook can be found in the shared GRNsight Box folder within the GRNsight Expression folder. The file is named ExpressionDataManipulations.xlsx, and is from this workbook that I was able to extract the finalized ExpressionData.csv which contains all the data used to populate the expressiondata table in the database. The ExpressionDataManipulations workbook will be the most helpful resource for future users who wish to upload their data to the expression database in the proper format. I did not automate this process, but I imagine it would not be terribly difficult for a future student to do so, especially since all of the formulas exist in the workbook.

Uploading to the Database

I put my data into the database by

  1. creating the table within the database with all of the correct columns and data types
  2. putting the desired data for the table into a CSV file format
  3. using scp to securely copy the CSV file from my local machine to the EC2 instance
  4. using the \copy process within the database to copy the data within the CSV file from its location on the EC2 to the desired table in the database

I never appended data to an existing table, although I am sure it will involve some INSERT command (look here) combined with some method of using \copy within the database to insert new data from a CSV into the table. (Whoever is tasked with doing this should populate this wiki with more data.)

Table Columns and Details

expressiondata

  • sortindex: an arbitrary ordered numbering of each row (Ex: 1, 2, 3...)
  • id: a unique identifier for each gene, same as systematic_name in degradationrates and productionrates tables (Ex: YAL001C)
  • standardname: a unique identifier for each gene, same as standard_name in degradationrates and productionrates tables (Ex: TFC3)
  • author: last time of the primary author of the publication from which the data originates (Ex: Barreto)
  • sampleid: contains the full entry of strain, expression type, and timepoints -- was used in original data given to me to create these tables (Ex: dCIN5_LogFC_t15-1)
  • expression: the actual expression data (Ex: -0.2983)
  • timepoints: the timepoint value drawn from sampleid (Ex: 15, 30, 60...)
  • dataset: full dataset name that user selects from front end -- contains author, year of publication, and strain data (Ex: Dahlquist_2018_dcin5)

degradationrates

  • systematic_name: a unique identifier for each gene, same as systematic_name in productionrates table and id in expressiondata table (Ex: YAL051W)
  • standard_name: a unique identifier for each gene, same as systematic_name in productionrates table and standardname in expressiondata table (Ex: PDR3)
  • degradation_rate: the actual degradation rate data (Ex: 0.0835)

productionrates

  • systematic_name: a unique identifier for each gene, same as systematic_name in degradationrates table and id in expressiondata table (Ex: YAL051W)
  • standard_name: a unique identifier for each gene, same as systematic_name in degradationrates table and standardname in expressiondata table (Ex: PDR3)
  • degradation_rate: the actual production rate data (Ex: 0.2718)

expressionmetadata

  • sample_id: a combination of the pubmed id and some digit listing the sample number in some experiment/publication (Ex: 122697421)
  • pubmed_id: the pubmed id associated with the experiment/publication (Ex: 12269742)
  • control_yeast_strain: name of the control yeast strain in the experiment/publication (Ex: S288C)
  • treatment_yeast_strain: name of the treatment yeast strain in the experiment/publication (Ex: S288C)
  • control: controlled variable of the experiment/publication (Ex: No Thiuram)
  • treatment: treatment used in the non-control group in the experiment/publication (Ex: Thiuram)
  • concentration_value: concentration of treatment (Ex: 75)
  • concentration_unit: concentration unit of treatment (Ex: uM)
  • time_value: timepoint values (Ex: 15, 30, 120...)
  • time_units: units of timepoint values (Ex: m)
  • number_of_replicates: number of replicates in the experiment/publication (Ex: 3)
  • expression_table: name of the expression table associated with all this data (Ex: Kitagawa_2002_log2_expression)

refs

  • pubmed_id: the pubmed id associated with the experiment/publication (Ex: 12269742)
  • authors: list of authors credited for the experiment/publication (Ex: Kitagawa E., Takahashi J., Momose Y., Iwahashi H.)
  • publication_year: year of publication of the experiment/publication (Ex: 2002)
  • title: title of the experiment/publication (Ex: The short-term response of yest to potassium starvation)
  • doi: the doi associated with the experiment/publication (Ex: 10.1021/es015705v)
  • ncbi_geo_id: the geo id associated with the experiment/publication (Ex: GSE9336)
Clone this wiki locally