Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MVP] Define DB using SQLAlchemy #4400

Closed
2 tasks
btylerburton opened this issue Jul 21, 2023 · 12 comments · Fixed by GSA/datagov-harvester#15
Closed
2 tasks

[MVP] Define DB using SQLAlchemy #4400

btylerburton opened this issue Jul 21, 2023 · 12 comments · Fixed by GSA/datagov-harvester#15
Assignees

Comments

@btylerburton
Copy link
Contributor

User Story

In order to maintain a list of harvest sources and their states, datagovteam wants to define the harvest db using SQLAlchemy

Related to:

Acceptance Criteria

Provided below is a table of DB tables and criteria:

Table name description definition
harvest_source Contains config for each harvest source
  • Id (str)
  • Name (str)
  • notification_emails (array)
  • Organization_name (str) (*self-reported)
  • Frequency (str) (cron)
  • Config (obj)
  • url (arr)
  • Schema_validation_type (str)
harvest_job Contains job state information for each job that's run through the pipeline
  • Id (str)
  • Status (str)
  • Date_created (str)
  • Date_finished (str)
  • Extract_started (str)
  • Extract_finished (str)
  • compare_started (str)
  • compare_finished (str)
  • Records_added (int)
  • Records_updated (int)
  • Records_deleted (int)
  • Records_errored (int)
  • Records_ignored (int)
harvest_error Table to contain all errors in pipeline
  • Id (str)
  • jobId (str)
  • record_id (str) (*blank for extract)
  • Record_reported_id (str) (metadata idenfier present in record or catalog)
  • Date_created (str)
  • Type (str) (where did it occur)
  • Severity (str)
    • CRITICAL: pipeline & system issues error
    • ERROR: record level processing error
    • WARN: we don’t log these
  • Message (str)

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

  • Use SQLALchemy to create the tables as defined above
  • Record any alterations to the schema that were necessary
@rshewitt
Copy link
Contributor

rshewitt commented Aug 1, 2023

we could use an ENUM for the Frequency field in harvest_source ( e.g. "daily" | "0 1 * * *", "weekly" | "0 1 * * 1", "monthly" | "0 1 1 * *" ). I suppose these values could be custom though which would nullify the use of an ENUM.

@rshewitt
Copy link
Contributor

rshewitt commented Aug 1, 2023

if we use an array dtype querying for a specific email or url means we would have to grab all the values to find whatever it is that we're looking for, right? are we going to query for these types of things? we can create a contacts table to store this information instead.

@rshewitt
Copy link
Contributor

rshewitt commented Aug 2, 2023

can a harvest job occur without an associated harvest source?

@rshewitt rshewitt self-assigned this Aug 2, 2023
@rshewitt rshewitt moved this from New Dev to 🏗 In Progress [8] in data.gov team board Aug 2, 2023
@nickumia-reisys
Copy link
Contributor

To answer some of the questions above:

we could use an ENUM for the Frequency field in harvest_source ( e.g. "daily" | "0 1 * * *", "weekly" | "0 1 * * 1", "monthly" | "0 1 1 * *" ). I suppose these values could be custom though which would nullify the use of an ENUM.

It depends on the customization we want to allow. Currently, we only support the generic subset time periods. If we choose an ENUM type, this would be limited without doing a DB migration. In either case though, if we don't need the granular specificity, then ENUM is fine.

if we use an array dtype querying for a specific email or url means we would have to grab all the values to find whatever it is that we're looking for, right? are we going to query for these types of things? we can create a contacts table to store this information instead.

I don't think we'll ever have to search for a specific email. We don't support that currently. It's mostly a one-way operation, just force update the notification emails. However, the contacts table would be more flexible. With that flexibility comes more complication. If the following questions are not important, we can leave it as it... (If they are important, we'll have to re-evalute):

can data be harvested without an associated harvest source record?

No. But there's also something wrong about this question. Harvest source define groups of datasets. I guess, my point is, what is a "harvest source record"? Just a harvest source? Or a record within the catalog that exists at the url in the harvest source config?

@rshewitt
Copy link
Contributor

rshewitt commented Aug 2, 2023

I've updated my question. i'm asking to determine the relationship between tables. so we'll want to add another field to harvest_job like source_id and relate them via a FK?

@nickumia-reisys
Copy link
Contributor

I've updated my question. i'm asking to determine the relationship between tables. so we'll want to add another field to harvest_job like source_id and relate them via a FK?

Yes, that is a very interesting consideration. I was discussing this with @FuhuXia and there's not a simple answer to this. If we want to replicate CKAN completely, then there are FOUR components to this:

  1. Harvest Source,
  2. Harvest Job,
  3. Harvest "object" (complete + incomplete harvested dataset)
  4. Harvest data (complete harvested dataset; once it has been completed, harvest "object" is no longer necessary)

The relationship between these is as follows:

  • One source can have multiple jobs associated with it.
  • One job can have multiple objects associated with it.
  • One job can have multiple data` associated with it.
  • For an initial harvest,
    • Let total number of datasets = 100
    • Let successfully harvested datasets = 90
    • Let datasets in error = 10
    • In this scenario, in the end, there is one source with one job (10 objects and 90 data).
  • If we run this harvest again,
    • Let total number of datasets = 110
    • Let successfully harvested datasets = 95
    • Let datasets in error = 15
    • In this scenario, in the end, there is one source with two jobs..
      • job1 has the old data (10 objects and 90 data).
      • job2 has the new data (15 objects and 95 data).

The "motivation" for separating harvest object from harvest job is to decrease the size of the job table. If we store data in the job table, lookup might slow down because of how large each record is. The harvest object table would have the data with a FK referenced in the job table.

All of that is historically how it has worked in CKAN. I don't believe it is most efficient; however, Fuhu thinks it is robust enough that we should take it as is. If we do replicate this logic, then YES... job would reference the id of source as a FK. The reason I went into this detail is because your original question asked about data being harvested and I wanted to clarify that a job does not necessarily equate to data harvested.

@nickumia-reisys
Copy link
Contributor

can a harvest job occur without an associated harvest source?

Oh... but to this question, no. A job cannot occur without a source. Without a source, what will the job do? It doesn't know what to harvest without a source.

@rshewitt
Copy link
Contributor

rshewitt commented Aug 3, 2023

the circumstance i'm thinking of is if a user can submit a source to be harvested but for whatever reason that source was never added to a harvest source table ( or equivalent ). this may not be possible. in that case, it's irrelevant.

@rshewitt
Copy link
Contributor

rshewitt commented Aug 3, 2023

miro board of ERD. let me know who needs access.

@rshewitt
Copy link
Contributor

rshewitt commented Aug 9, 2023

I got flagged by cloudlock for using Miro so i'm gonna delete my Miro account. Here's a picture of the ERD as it stand now.
Screenshot 2023-08-09 at 3 27 20 PM

@rshewitt
Copy link
Contributor

rshewitt commented Aug 11, 2023

ERD as plantuml located here. this is a WIP. any comments or suggestions are welcome! this diagram is slightly different to the one mentioned above.

@rshewitt rshewitt linked a pull request Aug 15, 2023 that will close this issue
3 tasks
@hkdctol hkdctol moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Aug 22, 2023
@github-project-automation github-project-automation bot moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Aug 25, 2023
@nickumia-reisys
Copy link
Contributor

❗ Ooooo mermaid is testing an experimental ERD diagram: https://mermaid.js.org/syntax/entityRelationshipDiagram.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants