Skip to content

Commit

Permalink
new: start harvesting pipeline documentation
Browse files Browse the repository at this point in the history
- Old vs new general pipeline structure
  • Loading branch information
nickumia-reisys authored Sep 11, 2023
1 parent 9aab59f commit 5de600c
Showing 1 changed file with 43 additions and 0 deletions.
43 changes: 43 additions & 0 deletions harvesting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Harvesting Pipeline Structure

## Old Harvesting Logic
Unique to each file + schema format
```mermaid
flowchart LR
sc([SOURCE CREATION])
gs([GATHER STAGE])
fs([FETCH STAGE])
is([IMPORT STAGE])
sc --> gs
gs --> fs
fs --> is
```

## New Harvesting Logic
Universal to all file + schema formats
```mermaid
flowchart TD
sc([SOURCE CREATION])
extract([Extract Catalog Source])
compare([Compare Source Catalog to Data.gov Catalog])
nochanges{No Changes?}
deletions{Datasets to Delete?}
updates{Datasets to Add or Update?}
load([Load into Data.gov Catalog])
validate([Validate Dataset])
transform([Transform Schema of Dataset])
completed([End])
sc --> extract
extract --> compare
compare --> deletions
compare --> updates
deletions --> load
updates --> validate
validate --> transform
transform --> validate
validate --> load
load --> completed
compare --> nochanges
nochanges --> completed
```

1 comment on commit 5de600c

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage

Coverage Report
FileStmtsMissCoverMissing
harvester
   __init__.py30100% 
harvester/db/models
   __init__.py50100% 
   models.py530100% 
harvester/extract
   __init__.py1922 89%
   dcatus.py1122 82%
harvester/utils
   __init__.py00100% 
   json.py2266 73%
   pg.py3544 89%
   s3.py2466 75%
harvester/validate
   __init__.py00100% 
   dcat_us.py240100% 
TOTAL1962090% 

Tests Skipped Failures Errors Time
29 0 💤 0 ❌ 0 🔥 13.625s ⏱️

Please sign in to comment.