[MVP] Define DB using SQLAlchemy #4400

btylerburton · 2023-07-21T21:16:08Z

User Story

In order to maintain a list of harvest sources and their states, datagovteam wants to define the harvest db using SQLAlchemy

Related to:

WIP: Define DB using SQLAlchemy #4312

Acceptance Criteria

Provided below is a table of DB tables and criteria:

Table name	description	definition
harvest_source	Contains config for each harvest source	Id (str) Name (str) notification_emails (array) Organization_name (str) (*self-reported) Frequency (str) (cron) Config (obj) url (arr) Schema_validation_type (str)
harvest_job	Contains job state information for each job that's run through the pipeline	Id (str) Status (str) Date_created (str) Date_finished (str) Extract_started (str) Extract_finished (str) compare_started (str) compare_finished (str) Records_added (int) Records_updated (int) Records_deleted (int) Records_errored (int) Records_ignored (int)
harvest_error	Table to contain all errors in pipeline	Id (str) jobId (str) record_id (str) (*blank for extract) Record_reported_id (str) (metadata idenfier present in record or catalog) Date_created (str) Type (str) (where did it occur) Severity (str) CRITICAL: pipeline & system issues error ERROR: record level processing error WARN: we don’t log these Message (str)

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

Use SQLALchemy to create the tables as defined above
Record any alterations to the schema that were necessary

rshewitt · 2023-08-01T15:40:04Z

we could use an ENUM for the Frequency field in harvest_source ( e.g. "daily" | "0 1 * * *", "weekly" | "0 1 * * 1", "monthly" | "0 1 1 * *" ). I suppose these values could be custom though which would nullify the use of an ENUM.

rshewitt · 2023-08-01T15:52:52Z

if we use an array dtype querying for a specific email or url means we would have to grab all the values to find whatever it is that we're looking for, right? are we going to query for these types of things? we can create a contacts table to store this information instead.

rshewitt · 2023-08-02T15:01:04Z

can a harvest job occur without an associated harvest source?

nickumia-reisys · 2023-08-02T16:35:27Z

To answer some of the questions above:

we could use an ENUM for the Frequency field in harvest_source ( e.g. "daily" | "0 1 * * *", "weekly" | "0 1 * * 1", "monthly" | "0 1 1 * *" ). I suppose these values could be custom though which would nullify the use of an ENUM.

It depends on the customization we want to allow. Currently, we only support the generic subset time periods. If we choose an ENUM type, this would be limited without doing a DB migration. In either case though, if we don't need the granular specificity, then ENUM is fine.

if we use an array dtype querying for a specific email or url means we would have to grab all the values to find whatever it is that we're looking for, right? are we going to query for these types of things? we can create a contacts table to store this information instead.

I don't think we'll ever have to search for a specific email. We don't support that currently. It's mostly a one-way operation, just force update the notification emails. However, the contacts table would be more flexible. With that flexibility comes more complication. If the following questions are not important, we can leave it as it... (If they are important, we'll have to re-evalute):

Is [email protected] receiving any emails?
What harvest sources are reporting to [email protected]?
Please remove [email protected] from all notification emails.
Please subscribe [email protected] to harvest 1 and harvest 2.

can data be harvested without an associated harvest source record?

No. But there's also something wrong about this question. Harvest source define groups of datasets. I guess, my point is, what is a "harvest source record"? Just a harvest source? Or a record within the catalog that exists at the url in the harvest source config?

rshewitt · 2023-08-02T22:43:06Z

I've updated my question. i'm asking to determine the relationship between tables. so we'll want to add another field to harvest_job like source_id and relate them via a FK?

nickumia-reisys · 2023-08-03T15:22:09Z

I've updated my question. i'm asking to determine the relationship between tables. so we'll want to add another field to harvest_job like source_id and relate them via a FK?

Yes, that is a very interesting consideration. I was discussing this with @FuhuXia and there's not a simple answer to this. If we want to replicate CKAN completely, then there are FOUR components to this:

Harvest Source,
Harvest Job,
Harvest "object" (complete + incomplete harvested dataset)
Harvest data (complete harvested dataset; once it has been completed, harvest "object" is no longer necessary)

The relationship between these is as follows:

One source can have multiple jobs associated with it.
One job can have multiple objects associated with it.
One job can have multiple data` associated with it.
For an initial harvest,
- Let total number of datasets = 100
- Let successfully harvested datasets = 90
- Let datasets in error = 10
- In this scenario, in the end, there is one source with one job (10 objects and 90 data).
If we run this harvest again,
- Let total number of datasets = 110
- Let successfully harvested datasets = 95
- Let datasets in error = 15
- In this scenario, in the end, there is one source with two jobs..
  - job1 has the old data (10 objects and 90 data).
  - job2 has the new data (15 objects and 95 data).

The "motivation" for separating harvest object from harvest job is to decrease the size of the job table. If we store data in the job table, lookup might slow down because of how large each record is. The harvest object table would have the data with a FK referenced in the job table.

All of that is historically how it has worked in CKAN. I don't believe it is most efficient; however, Fuhu thinks it is robust enough that we should take it as is. If we do replicate this logic, then YES... job would reference the id of source as a FK. The reason I went into this detail is because your original question asked about data being harvested and I wanted to clarify that a job does not necessarily equate to data harvested.

nickumia-reisys · 2023-08-03T15:24:05Z

can a harvest job occur without an associated harvest source?

Oh... but to this question, no. A job cannot occur without a source. Without a source, what will the job do? It doesn't know what to harvest without a source.

rshewitt · 2023-08-03T15:57:42Z

the circumstance i'm thinking of is if a user can submit a source to be harvested but for whatever reason that source was never added to a harvest source table ( or equivalent ). this may not be possible. in that case, it's irrelevant.

rshewitt · 2023-08-03T16:49:39Z

miro board of ERD. let me know who needs access.

rshewitt · 2023-08-09T21:28:26Z

I got flagged by cloudlock for using Miro so i'm gonna delete my Miro account. Here's a picture of the ERD as it stand now.

rshewitt · 2023-08-11T13:31:06Z

ERD as plantuml located here. this is a WIP. any comments or suggestions are welcome! this diagram is slightly different to the one mentioned above.

nickumia-reisys · 2023-10-11T15:09:10Z

❗ Ooooo mermaid is testing an experimental ERD diagram: https://mermaid.js.org/syntax/entityRelationshipDiagram.html

btylerburton added the H2.0/MVP label Jul 21, 2023

btylerburton added this to data.gov team board Jul 21, 2023

btylerburton mentioned this issue Jul 21, 2023

[EPIC] DCAT-US Harvesting Pipeline MVP #4395

Closed

11 tasks

hkdctol moved this to New Dev in data.gov team board Jul 27, 2023

rshewitt self-assigned this Aug 2, 2023

rshewitt moved this from New Dev to 🏗 In Progress [8] in data.gov team board Aug 2, 2023

rshewitt mentioned this issue Aug 15, 2023

Define db sqlalchemy GSA/datagov-harvester#15

Merged

3 tasks

rshewitt linked a pull request Aug 15, 2023 that will close this issue

Define db sqlalchemy GSA/datagov-harvester#15

Merged

3 tasks

hkdctol moved this from 🏗 In Progress [8] to 👀 Needs Review [2] in data.gov team board Aug 22, 2023

nickumia-reisys closed this as completed in GSA/datagov-harvester#15 Aug 25, 2023

github-project-automation bot moved this from 👀 Needs Review [2] to ✔ Done in data.gov team board Aug 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MVP] Define DB using SQLAlchemy #4400

[MVP] Define DB using SQLAlchemy #4400

btylerburton commented Jul 21, 2023

rshewitt commented Aug 1, 2023 •

edited

Loading

rshewitt commented Aug 1, 2023 •

edited

Loading

rshewitt commented Aug 2, 2023 •

edited

Loading

nickumia-reisys commented Aug 2, 2023

rshewitt commented Aug 2, 2023 •

edited

Loading

nickumia-reisys commented Aug 3, 2023

nickumia-reisys commented Aug 3, 2023

rshewitt commented Aug 3, 2023 •

edited

Loading

rshewitt commented Aug 3, 2023

rshewitt commented Aug 9, 2023

rshewitt commented Aug 11, 2023 •

edited

Loading

nickumia-reisys commented Oct 11, 2023

[MVP] Define DB using SQLAlchemy #4400

[MVP] Define DB using SQLAlchemy #4400

Comments

btylerburton commented Jul 21, 2023

User Story

Acceptance Criteria

Background

Security Considerations (required)

Sketch

rshewitt commented Aug 1, 2023 • edited Loading

rshewitt commented Aug 1, 2023 • edited Loading

rshewitt commented Aug 2, 2023 • edited Loading

nickumia-reisys commented Aug 2, 2023

rshewitt commented Aug 2, 2023 • edited Loading

nickumia-reisys commented Aug 3, 2023

nickumia-reisys commented Aug 3, 2023

rshewitt commented Aug 3, 2023 • edited Loading

rshewitt commented Aug 3, 2023

rshewitt commented Aug 9, 2023

rshewitt commented Aug 11, 2023 • edited Loading

nickumia-reisys commented Oct 11, 2023

rshewitt commented Aug 1, 2023 •

edited

Loading

rshewitt commented Aug 1, 2023 •

edited

Loading

rshewitt commented Aug 2, 2023 •

edited

Loading

rshewitt commented Aug 2, 2023 •

edited

Loading

rshewitt commented Aug 3, 2023 •

edited

Loading

rshewitt commented Aug 11, 2023 •

edited

Loading