Skip to content

Commit

Permalink
Merge pull request #28 from GSA/feature/compare
Browse files Browse the repository at this point in the history
Feature/compare
  • Loading branch information
rshewitt authored Jan 4, 2024
2 parents a9aede2 + c31a32e commit 9a278fa
Show file tree
Hide file tree
Showing 20 changed files with 1,586 additions and 358 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ clean-dist: ## Cleans dist dir
rm -rf dist/*

test: up ## Runs poetry tests, ignores ckan load
poetry run pytest --ignore=./tests/load/ckan
poetry run pytest --ignore=./tests/integration

up: ## Sets up local docker environment
docker compose up -d
Expand Down
73 changes: 62 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,35 +5,86 @@ transformation, and loading into the data.gov catalog.

## Features

The datagov-harvesting-logic offers the following features:

- Extract
- general purpose fetching and downloading of web resources.
- catered extraction to the following data formats:
- General purpose fetching and downloading of web resources.
- Catered extraction to the following data formats:
- DCAT-US
- Validation
- DCAT-US
- jsonschema validation using draft 2020-12.
- `jsonschema` validation using draft 2020-12.
- Load
- DCAT-US
- conversion of dcatu-us catalog into ckan dataset schema
- create, delete, update, and patch of ckan package/dataset
- Conversion of dcat-us catalog into ckan dataset schema
- Create, delete, update, and patch of ckan package/dataset

## Requirements

This project is using poetry to manage this project. Install [here](https://python-poetry.org/docs/#installation).
This project is using `poetry` to manage this project. Install [here](https://python-poetry.org/docs/#installation).

Once installed, `poetry install` installs dependencies into a local virtual environment.

## Testing

### CKAN load testing

- CKAN load testing doesn't require the services provided in the `docker-compose.yml`.
- [catalog-dev](https://catalog-dev.data.gov/) is used for ckan load testing.
- Create an api-key by signing into catalog-dev.
- Create an api-key by signing into catalog-dev.
- Create a `credentials.py` file at the root of the project containing the variable `ckan_catalog_dev_api_key` assigned to the api-key.
- run tests with the command `poetry run pytest ./tests/load/ckan`
- Run tests with the command `poetry run pytest ./tests/load/ckan`

### Harvester testing
- These tests are found in `extract`, and `validate`. Some of them rely on services in the `docker-compose.yml`. run using docker `docker compose up -d` and with the command `poetry run pytest --ignore=./tests/load/ckan`.

- These tests are found in `extract`, and `validate`. Some of them rely on services in the `docker-compose.yml`. Run using docker `docker compose up -d` and with the command `poetry run pytest --ignore=./tests/load/ckan`.

If you followed the instructions for `CKAN load testing` and `Harvester testing` you can simply run `poetry run pytest` to run all tests.

## Comparison

- `./tests/harvest_sources/ckan_datasets_resp.json`
- Represents what ckan would respond with after querying for the harvest source name
- `./tests/harvest_sources/dcatus_compare.json`
- Represents a changed harvest source
- Created:
- datasets[0]

```diff
+ "identifier" = "cftc-dc10"
```

- Deleted:
- datasets[0]

```diff
- "identifier" = "cftc-dc1"
```

- Updated:
- datasets[1]

```diff
- "modified": "R/P1M"
+ "modified": "R/P1M Update"
```

- datasets[2]

```diff
- "keyword": ["cotton on call", "cotton on-call"]
+ "keyword": ["cotton on call", "cotton on-call", "update keyword"]
```

- datasets[3]

```diff
"publisher": {
"name": "U.S. Commodity Futures Trading Commission",
"subOrganizationOf": {
- "name": "U.S. Government"
+ "name": "Changed Value"
}
}
```

- `./test/harvest_sources/dcatus.json`
- Represents an original harvest source prior to change occuring.
10 changes: 2 additions & 8 deletions harvester/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,8 @@
# TODO these imports will need to be updated to ensure a consistent api
from .compare import compare
from .extract import download_waf, extract, traverse_waf
from .load import (
create_ckan_package,
dcatus_to_ckan,
load,
patch_ckan_package,
purge_ckan_package,
update_ckan_package,
)
from .load import (create_ckan_package, dcatus_to_ckan, load,
patch_ckan_package, purge_ckan_package, update_ckan_package)
from .transform import transform
from .utils import *
from .validate import *
Expand Down
19 changes: 16 additions & 3 deletions harvester/compare.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,22 @@
logger = logging.getLogger("harvester")


# stub, TODO complete
def compare(compare_obj):
def compare(harvest_source, ckan_source):
"""Compares records"""
logger.info("Hello from harvester.compare()")

return compare_obj
output = {
"create": [],
"update": [],
"delete": [],
}

harvest_ids = set(harvest_source.keys())
ckan_ids = set(ckan_source.keys())
same_ids = harvest_ids & ckan_ids

output["create"] += list(harvest_ids - ckan_ids)
output["delete"] += list(ckan_ids - harvest_ids)
output["update"] += [i for i in same_ids if harvest_source[i] != ckan_source[i]]

return output
64 changes: 47 additions & 17 deletions harvester/load.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@

import ckanapi

from harvester.utils.util import sort_dataset

logger = logging.getLogger("harvester")


Expand All @@ -21,7 +23,7 @@ def create_ckan_extra_base(*args):
return [{"key": d[0], "value": d[1]} for d in data]


def create_ckan_extras_additions(dcatus_catalog, additions):
def create_ckan_extras_additions(dcatus_dataset, additions):
extras = [
"accessLevel",
"bureauCode",
Expand All @@ -35,10 +37,13 @@ def create_ckan_extras_additions(dcatus_catalog, additions):

for extra in extras:
data = {"key": extra, "value": None}
val = dcatus_dataset[extra]
if extra == "publisher":
data["value"] = dcatus_catalog[extra]["name"]
data["value"] = val["name"]
else:
data["value"] = dcatus_catalog[extra]
if isinstance(val, list): # TODO: confirm this is what we want.
val = val[0]
data["value"] = val
output.append(data)

return output + additions
Expand Down Expand Up @@ -70,21 +75,28 @@ def get_email_from_str(in_str):
return res.group(0)


def create_ckan_resources(dists):
def create_ckan_resources(dcatus_dataset):
output = []

for dist in dists:
if "distribution" not in dcatus_dataset:
return output

for dist in dcatus_dataset["distribution"]:
url_key = "downloadURL" if "downloadURL" in dist else "accessURL"
resource = {"url": dist[url_key], "mimetype": dist["mediaType"]}
resource = {"url": dist[url_key]}
if "mimetype" in dist:
resource["mimetype"] = dist["mediaType"]

output.append(resource)

return output


def simple_transform(dcatus_catalog):
def simple_transform(dcatus_dataset):
output = {
"name": "-".join(dcatus_catalog["title"].lower().split()),
"owner_org": "test",
"name": "-".join(dcatus_dataset["title"].lower().split()),
"owner_org": "test", # TODO: CHANGE THIS!
"identifier": dcatus_dataset["identifier"],
}

mapping = {
Expand All @@ -93,14 +105,17 @@ def simple_transform(dcatus_catalog):
"title": "title",
}

for k, v in dcatus_catalog.items():
for k, v in dcatus_dataset.items():
if k not in mapping:
continue
if isinstance(mapping[k], dict):
temp = {}
to_skip = ["@type"]
for k2, v2 in v.items():
if k2 == "hasEmail":
v2 = get_email_from_str(v2)
if k2 in to_skip:
continue
temp[mapping[k][k2]] = v2
output = {**output, **temp}
else:
Expand All @@ -116,7 +131,7 @@ def create_defaults():
}


def dcatus_to_ckan(dcatus_catalog):
def dcatus_to_ckan(dcatus_dataset, harvest_source_name):
"""
example:
- from this:
Expand All @@ -126,23 +141,34 @@ def dcatus_to_ckan(dcatus_catalog):
"""

output = simple_transform(dcatus_catalog)
output = simple_transform(dcatus_dataset)

resources = create_ckan_resources(dcatus_catalog["distribution"])
tags = create_ckan_tags(dcatus_catalog["keyword"])
pubisher_hierarchy = create_ckan_publisher_hierarchy(dcatus_catalog["publisher"])
resources = create_ckan_resources(dcatus_dataset)
tags = create_ckan_tags(dcatus_dataset["keyword"])
pubisher_hierarchy = create_ckan_publisher_hierarchy(
dcatus_dataset["publisher"], []
)

extras_base = create_ckan_extra_base(
pubisher_hierarchy, "Dataset", dcatus_catalog["publisher"]["name"]
pubisher_hierarchy, "Dataset", dcatus_dataset["publisher"]["name"]
)
extras = create_ckan_extras_additions(dcatus_catalog, extras_base)
extras = create_ckan_extras_additions(dcatus_dataset, extras_base)

defaults = create_defaults()

output["resources"] = resources
output["tags"] = tags

output["extras"] = extras_base
output["extras"] += extras
output["extras"] += [
{
"key": "dcat_metadata",
"value": str(sort_dataset(dcatus_dataset)),
}
]

output["extras"] += [{"key": "harvest_source_name", "value": harvest_source_name}]

return {**output, **defaults}

Expand All @@ -167,3 +193,7 @@ def update_ckan_package(ckan, update_data):

def purge_ckan_package(ckan, package_data):
return ckan.action.dataset_purge(**package_data)


def search_ckan(ckan, query):
return ckan.action.package_search(**query)
4 changes: 2 additions & 2 deletions harvester/utils/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
from . import json
from . import json, util

__all__ = ["json"]
__all__ = ["json", "util"]
12 changes: 12 additions & 0 deletions harvester/utils/util.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
import hashlib
import json

import sansjson


def sort_dataset(d):
return sansjson.sort_pyobject(d)


def dataset_to_hash(d):
return hashlib.sha256(json.dumps(d, sort_keys=True).encode("utf-8")).hexdigest()
15 changes: 13 additions & 2 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "datagov-harvesting-logic"
version = "0.0.4"
version = "0.1.0"
description = ""
# authors = [
# {name = "Jin Sun", email = "[email protected]"},
Expand All @@ -25,6 +25,7 @@ deepdiff = ">=6"
pytest = ">=7.3.2"
ckanapi = ">=4.7"
beautifulsoup4 = "^4.12.2"
sansjson = "^0.3.0"

[tool.poetry.group.dev.dependencies]
pytest = "^7.3.0"
Expand Down
Loading

1 comment on commit 9a278fa

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage

Coverage Report
FileStmtsMissCoverMissing
harvester
   __init__.py120100% 
   compare.py120100% 
   extract.py4877 85%
   load.py1001010 90%
   transform.py1377 46%
harvester/utils
   __init__.py20100% 
   json.py40100% 
   util.py70100% 
harvester/validate
   __init__.py20100% 
   dcat_us.py2433 88%
TOTAL2242788% 

Tests Skipped Failures Errors Time
28 0 💤 0 ❌ 0 🔥 17.543s ⏱️

Please sign in to comment.