Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harvesting pipeline documentation #17

Merged
merged 23 commits into from
Jan 4, 2024
Merged

Harvesting pipeline documentation #17

merged 23 commits into from
Jan 4, 2024

Conversation

nickumia-reisys
Copy link
Contributor

@nickumia-reisys nickumia-reisys commented Sep 11, 2023

Notes:

  • Adds mermaid flowchart documentation of our current harvesting pipelines, plus some new system diagrams too
  • Also adds npm run makeDoc NPM script to generate new SVGs.

This is not an exhaustive list of all the new docs for Harvester 2.0, but currently this is good to go as is, and can be added to in future.

References:

@github-actions
Copy link

github-actions bot commented Sep 11, 2023

Coverage

Coverage Report
FileStmtsMissCoverMissing
harvester
   __init__.py120100% 
   compare.py120100% 
   extract.py4877 85%
   load.py1001010 90%
   transform.py1377 46%
harvester/utils
   __init__.py20100% 
   json.py40100% 
   util.py70100% 
harvester/validate
   __init__.py20100% 
   dcat_us.py2433 88%
TOTAL2242788% 

Tests Skipped Failures Errors Time
28 0 💤 0 ❌ 0 🔥 17.428s ⏱️

nickumia-reisys and others added 20 commits September 12, 2023 11:04
- Track mermaid code in separate files (*.mmd)
- Generate diagrams using mermaid cli courtesy of docker image: https://github.com/mermaid-js/mermaid-cli#use-dockerpodman
- Bold arrows ==> logic control
- Normal arrows --> data access
- Dashed arrows -.> error/skip conditions
- Added a few connections that were previously missing (minor update)
There are a lot of moving pieces... this diagram does not have the most fine-tuned specificity...
- I have confidence that >90% of the important steps are accounted for, there should be two types of checks and balances for these diagrams: (1) review from a fresh perspective from various team members to see if there are any glaring issues at a high level, (2) in-depth review during pipeline consolidation and cross-checking with the code when the time comes.
- All xml files are geospatial
- fetch is empty, same as datajson
- lots of repeated code...
- seems pretty straightforward
The code is a mess, so make two instances of things if there are two instances in code.  Make the diagram easier to read.. but takes more time to notice things are duplicated.  The code should duplication well though.

Hindsight.... the dcat structure is messy because things are done in set operations in code, not necessarily code iterations... so ... the logic is harder to ascertain.  This should probably be refactored to follow the xml diagram.. but the current design stays truer to the code, so I'm torn **shrug**
This abstracts the different sections better
To highlight connection to import
unify import stage for single xml and waf
@btylerburton btylerburton marked this pull request as ready for review December 19, 2023 21:32
@btylerburton btylerburton requested a review from a team December 19, 2023 21:33
Copy link
Contributor

@btylerburton btylerburton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@btylerburton btylerburton merged commit f64d6cb into main Jan 4, 2024
5 checks passed
@btylerburton btylerburton deleted the data-pipeline-docs branch January 4, 2024 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants