Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert DP+ to a CKAN extension #98

Open
Tracked by #5
jqnatividad opened this issue Jun 18, 2023 · 21 comments
Open
Tracked by #5

Convert DP+ to a CKAN extension #98

jqnatividad opened this issue Jun 18, 2023 · 21 comments
Assignees
Labels
1.x will be done in DP+ 1.x - DP+ running as CKAN extension enhancement New feature or request wip Work in Progress

Comments

@jqnatividad
Copy link
Contributor

jqnatividad commented Jun 18, 2023

DP+ was forked from DP which uses ckanserviceprovider.

ckanserviceprovider was meant to be a general library for CKAN for making web services, and was originally created in 2013.

Right now, DP and DP+ are really the only users of the library, and in the 10 years since, a lot has changed.

DP+ 0.x will continue to use ckanserviceprovider so older CKAN installations using DataPusher can still use it.

However, especially now that CKAN v2.9 uses Python 3.8+ which DP+ requires to call qsv using subprocess options only available in Python 3.7+, v1.x of DP+ will work as a CKAN extension.

This has the following benefits:

  • easier to maintain as it just works as a regular extension
  • no need to maintain a separate web service running in its own virtualenv
  • we can drop the ckanserviceprovider dependency
  • we can use the ckan_default database to store DP+ job history instead of creating our own DP+ database
  • we can add additional UI elements beyond the existing Datastore tab. Here are some examples of what a DP+ extension can do:
    • Have the ability to set resource-specific DP+ job settings on the UI
    • Have the ability to see how the resource is stored in the Datastore (with datastore_info API)
    • Have a more advanced Data Dictionary interface that goes beyond setting label/description and overriding types

This is already WIP in the https://github.com/dathere/datapusher-plus/tree/dev-v1.0 branch and we hope to the make the first release summer 2023.

@jqnatividad jqnatividad added the 1.x will be done in DP+ 1.x - DP+ running as CKAN extension label Jun 18, 2023
@jqnatividad jqnatividad added wip Work in Progress enhancement New feature or request labels Jun 18, 2023
@jqnatividad
Copy link
Contributor Author

jqnatividad commented Jun 26, 2023

DP+ 1.x should also be more interactive - i.e. users have the ability to configure the DP+ job on a per resource basis.

99% of the time, the default configuration should work fine. But when the user wants to tweak DP+ configuration for that particular job (e.g ask for PII screening using a particular PII screening configuration, add summary stats, check for dupes, etc.), there should be some DP+ knobs exposed in the Datastore tab to change DP+ settings just for that resource.

@sagargg
Copy link

sagargg commented Jul 25, 2023

Hi @jqnatividad Any idea, when it will be ready for release ??

@tino097
Copy link
Collaborator

tino097 commented Jul 25, 2023

@sagargg its almost done, any testing on the dev-v1.0 branch would be helpfull

@jhbruhn
Copy link

jhbruhn commented Jan 11, 2024

What is the current state of this migration? I am considering setting up datapusher-plus, but would like to avoid to set up an external service if future updates require it to be an ckanext instead. Could I already use the dev-1.0 branch for this, and if so, how?

@tino097
Copy link
Collaborator

tino097 commented Jan 11, 2024

@jhbruhn dev-1.0 can be used to set the DP+ as extension and you are more than welcome to give it a try

To use it, just checkout to that branch and follow same steps as any other extension for installation

@jhbruhn
Copy link

jhbruhn commented Jan 11, 2024

Thanks. I installed the extension in my ckan 2.10 test instance and included it as the "datapusher" plugin (according to this

datapusher=ckanext.datapusher_plus.plugin:DatapusherPlusPlugin
). I then defined CKAN_DATAPUSHER_URL as ckan would not start if not defined to the value I was using with the original datapusher, but stopped that service.

But when I upload a file and try to push it to the datastore (did happen automatically as well), I get this error in the Web Interface: Could not connect to DataPusher. without a more specific error in the ckan logs.

@tino097
Copy link
Collaborator

tino097 commented Jan 11, 2024

You should start a job worker
ckan jobs worker

@jhbruhn
Copy link

jhbruhn commented Jan 11, 2024

I did (now), but the result is still the same. Could the original, ckan internal datapusher be interfering with datapusher-plus here? Should I open another issue?

@jhbruhn
Copy link

jhbruhn commented Jan 11, 2024

If I provide my original datapusher configuration and service next to the activated dataplusher configuration, it uses the classic datapusher to push data to my resources. While a task in the workerqueue for datapusher-plus seems to be created, it is not used, and nothing is happening on my ckan jobs worker worker.

@tino097
Copy link
Collaborator

tino097 commented Jan 12, 2024

@jhbruhn in the plugins it should be added as ckan.plugins = stats ... datapusher_plus ...

@jhbruhn
Copy link

jhbruhn commented Jan 12, 2024

That leads to this exception at startup:

ckan   | Traceback (most recent call last):
ckan   |   File "/srv/app/wsgi.py", line 20, in <module>
ckan   |     application = make_app(config)
ckan   |   File "/srv/app/src/ckan/ckan/config/middleware/__init__.py", line 27, in make_app
ckan   |     load_environment(conf)
ckan   |   File "/srv/app/src/ckan/ckan/config/environment.py", line 69, in load_environment
ckan   |     p.load_all()
ckan   |   File "/srv/app/src/ckan/ckan/plugins/core.py", line 222, in load_all
ckan   |     load(*plugins)
ckan   |   File "/srv/app/src/ckan/ckan/plugins/core.py", line 238, in load
ckan   |     service = _get_service(plugin)
ckan   |   File "/srv/app/src/ckan/ckan/plugins/core.py", line 346, in _get_service
ckan   |     raise PluginNotFoundException(plugin_name)
ckan   | ckan.plugins.core.PluginNotFoundException: datapusher_plus

Perhaps due to this commit? 1cfba1c

The plugin is definitely installed, as some of it is run when including the datapusher plugin.

@tino097
Copy link
Collaborator

tino097 commented Jan 12, 2024

@jhbruhn my bad, forgot that i had that changed

Are you using dockerized environment for your setup ?

@jhbruhn
Copy link

jhbruhn commented Jan 12, 2024

Yes, this is the ckan Dockerfile I am using, based off of the official image:

FROM ckan/ckan-base:2.10

### DCAT ###
ARG CKANEXT_DCAT_VERSION="1.5.1"
RUN pip3 install -e git+https://github.com/ckan/ckanext-dcat.git@v${CKANEXT_DCAT_VERSION}#egg=ckanext-dcat && \
    pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-dcat/v${CKANEXT_DCAT_VERSION}/requirements.txt

### LDAP with ignore referrals patch ###
ARG CKANEXT_LDAP_VERSION="add_ignore_referrals"
RUN apk add --no-cache \
        openldap-dev
RUN pip3 install -e git+https://github.com/jhbruhn/ckanext-ldap.git@${CKANEXT_LDAP_VERSION}#egg=ckanext-ldap

### PDFView
ARG CKANEXT_PDFVIEW_VERSION="0.0.8"
RUN pip3 install -e git+https://github.com/ckan/ckanext-pdfview.git@${CKANEXT_PDFVIEW_VERSION}#egg=ckanext-pdfview

### Geoview
ARG CKANEXT_GEOVIEW_VERSION="0.1.0"
RUN pip3 install ckanext-geoview==${CKANEXT_GEOVIEW_VERSION}

### Spatial
ARG CKANEXT_SPATIAL_VERSION="2.1.1"
RUN apk add --no-cache \
        proj-util \
        proj-dev \
        geos-dev
RUN pip3 install -e git+https://github.com/ckan/ckanext-spatial.git@v${CKANEXT_SPATIAL_VERSION}#egg=ckanext-spatial && \
    pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-spatial/v${CKANEXT_SPATIAL_VERSION}/requirements.txt

### Hierarchy
ARG CKANEXT_HIERARCHY_VERSION="1.2.1"
RUN pip3 install -e git+https://github.com/ckan/ckanext-hierarchy.git@v${CKANEXT_HIERARCHY_VERSION}#egg=ckanext-hierarchy && \
    pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-hierarchy/v${CKANEXT_HIERARCHY_VERSION}/requirements.txt

### Pages
ARG CKANEXT_PAGES_VERSION="0.5.2"
RUN pip3 install -e git+https://github.com/ckan/ckanext-pages.git@v${CKANEXT_PAGES_VERSION}#egg=ckanext-pages && \
    pip3 install -r https://raw.githubusercontent.com/ckan/ckanext-pages/v${CKANEXT_PAGES_VERSION}/requirements.txt

### Datapusher Plus
ARG CKANEXT_DATAPUSHER_PLUS_VERSION="1cfba1c1373b6890f7338df1fa0f46a28297f0ef"
RUN pip3 install -e git+https://github.com/dathere/datapusher-plus.git@${CKANEXT_DATAPUSHER_PLUS_VERSION}#egg=datapusher-plus && \
    pip3 install -r https://raw.githubusercontent.com/dathere/datapusher-plus/${CKANEXT_DATAPUSHER_PLUS_VERSION}/requirements.txt

# Activate plugins
ENV CKAN__PLUGINS "stats text_view image_view datatables_view video_view audio_view ldap pdf_view dcat dcat_json_interface structured_data envvars datapusher_plus datastore resource_proxy geo_view geojson_view shp_view wmts_view spatial_metadata spatial_query hierarchy_display hierarchy_form hierarchy_group_form pages"

qsv is obviously still missing from that Dockerfile, but it doesn't event get that far to call datapusher_plus, it is only calling the classic datapusher functions as far as I can see.

@tino097
Copy link
Collaborator

tino097 commented Jan 12, 2024

Thanks for the input, i will try to locate whats the issue with dockerized setup. Starting DP+ from source setup was working and it is testing on one of our development environments. I hope i could fix this soon and start using the DP+ as extension more often

@jhbruhn
Copy link

jhbruhn commented Mar 28, 2024

I've just looked into this again, and it seems that I am getting further than before, yay!

I now get this error while executing a datapusher-plus job:

[SQL: INSERT INTO logs (job_id, timestamp, message, level, module, "funcName", lineno) VALUES (%(job_id)s, %(timestamp)s, %(message)s, %(level)s, %(module)s, %(funcName)s, %(lineno)s) RETURNING logs.id]
[parameters: {'job_id': '9de313e1-1969-45af-a07d-67031ff65fd0', 'timestamp': datetime.datetime(2024, 3, 28, 10, 51, 29, 238720), 'message': 'Fetching from: http://localhost:5000/dataset/a68c293b-d3ea-4318-9bac-c122ff4552b9/resource/1900fa9d-baf6-442a-9e58-d86e4e7d873c/download/kanton_zuerich_266.csv...', 'level': 'INFO', 'module': 'jobs', 'funcName': '_push_to_datastore', 'lineno': 441}]
(Background on this error at: https://sqlalche.me/e/14/f405) (Background on this error at: https://sqlalche.me/e/14/7s2a)
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
    self.dialect.do_execute(
  File "/usr/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.UndefinedColumn: column logs.id does not exist
LINE 1: ...'INFO', 'jobs', '_push_to_datastore', 441) RETURNING logs.id
                                                                ^

I have run the ckan datapusher init-db command, so the migrations should've been run. This is running on a (testing) original datapusher/datastore database. Should I move this to a new issue, or is it expected behaviour for DP+ dev-1.0 right now?

@tino097
Copy link
Collaborator

tino097 commented Mar 28, 2024

@jhbruhn thanks for testing the DP+

Yes, there is a issue with the migration script, i think i have a fix for it. Currently is on fixing-issues branch but not ready for merging into dev-v1.0 yet

@jhbruhn
Copy link

jhbruhn commented Mar 28, 2024

Thanks for your reply, that branch is indeed working quite well with my limited tests.

@pdelboca
Copy link

pdelboca commented Apr 26, 2024

@jqnatividad is there any roadmap for the v1.0 of datapusher-plus? I'm looking forward to start using it, but currently the fact that it is based in ckan-serviceprovider is a limitation. Even more knowing that soon it will be a CKAN extension.

I have been testing the dev-v1.0 (and fixing-issues) branch and for normal workflows I didn't find any major issue.

Please let me know if there is anything else we can help with!

@jqnatividad
Copy link
Contributor Author

jqnatividad commented Apr 26, 2024

@pdelboca , it's really on me. DP+ 0.x stalled a bit because the main implementation that was informing its build was paused for a few months.

We started up again a month ago and we're currently writing specs for that implementation (we just actually submitted the "DP+ - Data Resource Upload First (DRUF) edition" specs for review by the customer today)

Good thing is @tino097 has been working diligently on DP+ 1.0 regardless given that ckanext-serviceprovider has been deprecated, and I don't intend to keep working on 0.x.

I'll generalize the spec and publish it here as it will functionally serve as the DP+ 1.0 roadmap.

We are breaking down DP+ DRUF development into three phases.


Phase 1 will be primarily adapted to the customer's requirements and is targeted for a 2024 summer release.

Phase 2 is still being scoped out but its focus will be being able to apply DRUF metadata inferencing to spatial data formats; being able to specify formulas for computed metadata fields in the scheming yaml file, and using the iDataDictionaryForm to implement an expanded data dictionary with to store summary statistics and frequency data for each field.

Phase 3 is even more aspirational, but its looking to do "smarter analysis" using machine learning techniques (perhaps, leveraging @amercader's work on related datasets; and maybe using qsv describegpt to do tag suggestions.

I'm also thinking of using qsv's schema and validate commands.

schema to automatically create a JSONSchema validation files that can be used to enforce field data types and domain/range values when updating a resource. For each column in a CSV, it will enforce the following:

  • data type
  • enum - range of valid literal values that can be accepted for the field (if the cardinality of the column is less than the default threshold of 50 unique values, it will create an enumeration for that field)
  • for selected fields using the qsv schema --pattern-columns option, it can infer a regex pattern constraint based on the field's frequency table
  • minLength
  • maxLength
  • min
  • max

The generated JSONschema file can be further fine-tuned by the Data Steward if required to apply some fairly complex validation rules as the crate I'm using fully implements JSONSchema validation and is widely deployed and used in production systems in the Rust ecosystem.

Currently, qsv validate command is now being used by DP+ to ensure that CSVs are well-formed and conform to RFC4180 standard and is UTF8 encoded. In Phase 3, I want to use validate's ability to use a JSONSchema generated by qsv schema to validate the CONTENTS of the CSV conform to the JSONschema definition.

And it's quite performant - on benchmarks using a 1 million row sample of NYC's 311 data with a non-trivial JSONschema, it validates the CSV in 1.3 seconds!

And yeah, we would love to take your help! Perhaps, we can carve out some time at CSVConf next month?

cc @twdbben @wardi

@pdelboca
Copy link

That's an exciting roadmap @jqnatividad ! Thanks for sharing and let's definitely catchup at the CSVConf :)

As a suggestion/thought: In terms of releases, it would be nice to split architectural change and features into separated releases. It will be easier to develop, easier to deploy, easier to maintain and will give the community time to change. Mixing behavioral changes with architectural can be messy because new errors can come from two fronts :). In that sense, an initial release of DP+ v1.0 as a CKAN extension would be great.

@u10313335
Copy link

Since the DP+ has been a CKAN extension and we no longer need the built-in datapusher plugin, I think we should include the assets from the datapusher plugin to prevent from the following import errors:

2024-07-02 08:49:09,763 ERROR [ckan.lib.webassets_tools] Trying to include unknown asset:
<ckanext-datapusher/datapusher-css>
2024-07-02 08:49:09,823 ERROR [ckan.lib.webassets_tools] Trying to include unknown asset:
<ckanext-datapusher/datapusher>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.x will be done in DP+ 1.x - DP+ running as CKAN extension enhancement New feature or request wip Work in Progress
Projects
None yet
Development

No branches or pull requests

7 participants