Add Europeana integration #200

Joyakis · 2025-10-15T19:48:07Z

Fixes

Fixes Add Europeana as a data source #186 by @Joyakis

Description

Add Europeana API integration for metrics collection

This PR adds a new script europeana_fetch.py that fetches and aggregates data from the Europeana Search API.
The script collects high-level statistics about cultural heritage content available through Europeana, focusing on data provider distribution and content types rather than fragile license parsing.

Technical details

Script Location: scripts/1-fetch/europeana_fetch.py

Data Output: data/2025Q4/1-fetch/europeana_1_count.csv
Key Features:
- Fetches data from Europeana Search API using multiple content queries
- Aggregates by DATA_PROVIDER and content metadata
- Includes proper error handling and API rate limiting
- Integrates with existing project structure and git workflows
Environment: Updated env.example with EUROPEANA_API_KEY placeholder

Tests

Set up Europeana API key in environment variables
Run the script: python scripts/1-fetch/europeana_fetch.py --enable-save
Verify CSV output is generated in data/2025Q4/1-fetch/europeana_1_count.csv
Check that data contains aggregated counts by DATA_PROVIDER

Checklist

I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
My pull request doesn't include code or content generated with AI.
My pull request has a descriptive title (not a vague title like Update index.md).
My pull request targets the default branch of the repository (main or master).
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no
visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

env.example

scripts/1-fetch/europeana_fetch.py

TimidRobot · 2025-10-17T19:07:05Z

scripts/1-fetch/europeana_fetch.py

+        return []
+
+    # Try different queries to get diverse content
+    queries = ["art", "history", "science", "music", "photography"]


What is your source for these categories?

Hello @TimidRobot .I chose the categories on the query based on possible searches on Europeana. It is similar to the theme parameter that already has pre-defined options .The query though can be customised depending on the data to be retrieved.

Are these categories documented anywhere? Are there categories not represented here?

Hello @TimidRobot
I updated the script so that instead of using a generic list of searches possible, I’m now using a curated list of themes directly listed on the Europeana  website (e.g., Art, Fashion, Music, Sport, Photography, Archaeology).
The scripts now first searches for everything then now filters by the themes.

You can see the full list of themes here: themes

Thanks for your guidance and I'd be happy to adjust if you prefer a different approach.

The more authoritative source of themes is https://europeana.atlassian.net/wiki/spaces/EF/pages/2385739812/Search+API+Documentation#Request (in the Search API Request Parameter accordion). Excerpt:

Parameter Datatype Description

theme String Restrict the query over one of the Europeana Thematic Collections. The possible values are: archaeology, art, fashion, industrial, manuscript, map, migration, music, nature, newspaper, photography, sport, ww1.

Yes it is...thankyou for the guide...I guess that will do unless there is any more clarification. @TimidRobot

I would create two data files: 1) without themes 2) with all themes. That should give you an indication of whether all entries have a theme or not.

@TimidRobot Brilliant!...I've done just that

env.example

Pipfile

Pipfile.lock

scripts/1-fetch/europeana_fetch.py

TimidRobot · 2025-10-21T08:54:47Z

@Joyakis I resolved the conflicts with the main branch (due to changes from another pull request being merged). Please remember to fetch the changes to your computer.

Joyakis · 2025-10-21T11:07:10Z

@Joyakis I resolved the conflicts with the main branch (due to changes from another pull request being merged). Please remember to fetch the changes to your computer.

@TimidRobot Done!

scripts/1-fetch/europeana_fetch.py

TimidRobot · 2025-10-21T13:22:44Z

scripts/1-fetch/europeana_fetch.py

+    for theme in themes:
+        params = {
+            "wskey": EUROPEANA_API_KEY,
+            "rows": min(items_per_query, 20),
+            "profile": "rich",
+            "query": "*",
+            "theme": theme,
+        }
+
+        try:
+            LOGGER.info(
+                f"Fetching {params['rows']} records for theme: '{theme}'"
+            )
+            with session.get(BASE_URL, params=params, timeout=30) as response:
+                response.raise_for_status()
+                results = response.json()
+                items = results.get("items", [])
+
+            # Tag each item with the theme used for easy tracking
+            for item in items:
+                item["theme_used"] = theme


You're not looping through the pages, so the data is incomplete. However, I don't think you need to loop through the pages. You should be able to use totalResults from the first page. This will be significantly faster than evaluating each and every record in the database.

If you're going to use totalResults then you need to do all of your selection in the search queries.

You are already looping through theme. You also need look into looping through provider_aggregation_edm_dataProvider. To do so you'll need to get a complete list of data providers.

You can also search for different licenses using the RIGHTS search field (ex. query=RIGHTS:("http://creativecommons.org/licenses/by-sa/4.0/")). See https://europeana.atlassian.net/wiki/spaces/EF/pages/2385739812/Search+API+Documentation#Result-Fields

I find Europeana's APIs to be fairly confusing. To successfully add this data source you'll need to spend a lot of time in the documentation.

Hello @TimidRobot I think i get the logic now. Please review it and advice then i can start working on the sources.md

I have updated the code.

TimidRobot

Please resolve all comments before requesting a new review

TimidRobot · 2025-10-21T18:48:20Z

scripts/1-fetch/europeana_fetch.py

+def simplify_legal_tool(legal_tool):
+    """Simplify and standardize Creative Commons or license URLs."""
+    if (
+        legal_tool
+        and isinstance(legal_tool, str)
+        and legal_tool.startswith("http")
+    ):
+        parts = legal_tool.strip("/").split("/")
+        last_parts = parts[-2:]
+        if last_parts:
+            joined = " ".join(part.upper() for part in last_parts if part)
+            if "creativecommons.org" in legal_tool:
+                return f"CC {joined}"
+            else:
+                return joined
+        else:
+            return "Unknown"
+    return legal_tool


Please add documentation (docstring and/or comments) to this function. Add examples of input and output.

I suspect it does not handle both non-ported (no jurisdiction) and ported (with jurisdiction) license URLs.

non-ported example:

http://creativecommons.org/licenses/by-nc-nd/2.0/

ported example:

http://creativecommons.org/licenses/by-nc-nd/3.0/at/

TimidRobot · 2025-10-21T18:53:52Z

scripts/1-fetch/europeana_fetch.py

+    providers = get_facet_list(session, "DATA_PROVIDER")
+    rights_list = get_facet_list(session, "RIGHTS")


Please fetch this data in the main function and pass the variables.

TimidRobot · 2025-10-21T18:54:04Z

scripts/1-fetch/europeana_fetch.py

+    providers = get_facet_list(session, "DATA_PROVIDER")
+    rights_list = get_facet_list(session, "RIGHTS")


Please fetch this data in the main function and pass the variables.

TimidRobot · 2025-10-21T18:54:27Z

scripts/1-fetch/europeana_fetch.py

+    themes = [
+        "art",
+        "fashion",
+        "music",
+        "industrial",
+        "sport",
+        "photography",
+        "archaeology",
+    ]


Please add the other 7 themes and order the list

TimidRobot · 2025-10-21T18:58:50Z

scripts/1-fetch/europeana_fetch.py

+            else:
+                return joined
+        else:
+            return "Unknown"


The script needs to also handle:

http://rightsstatements.org/vocab/CNE/1.0/

http://rightsstatements.org/vocab/InC-EDU/1.0/

http://rightsstatements.org/vocab/InC-OW-EU/1.0/

http://rightsstatements.org/vocab/InC/1.0/

http://rightsstatements.org/vocab/NoC-NC/1.0/

http://rightsstatements.org/vocab/NoC-OKLR/1.0/

TimidRobot · 2025-10-21T19:03:31Z

scripts/1-fetch/europeana_fetch.py

+    facet_values = [
+        f["label"] for f in data.get("facets", [])[0].get("fields", [])
+    ]
+    return facet_values


First, great job on figuring out how to do this!!

Second, this returns exactly 50 results for both DATA_PROVIDER and RIGHTS. I am confident both have more. Please loop through the pagination.

Hello @TimidRobot I managed to fix the other errors....but I'm still trying to figure out the pagination since i'm still getting 50 entries for both even after including cursor pagination according to their docs.

@Joyakis Please update the pull request with your changes.

@TimidRobot Hello I have now

The script continues to return exactly 50 results for both DATA_PROVIDER and RIGHTS. This is incorrect. There are 4203 data providers and 64 rights.

https://europeana.atlassian.net/wiki/spaces/EF/pages/2385739812/Search+API+Documentation#Offset-and-limit-for-Facets:

Parameter Datatype Description

f.[FACET_NAME].facet.limit Number Number of values an individual facet should contain. Set a limit of "0" to not return anything for that facet. By default, the limit of values of an individual facet is 50. This can be overriden by setting a custom limit e.g. via &f.DEFAULT.facet.limit=100.

You are using an invalid parameter for facet limit.

The script continues to return exactly 50 results for both DATA_PROVIDER and RIGHTS. This is incorrect. There are 4203 data providers and 64 rights.

https://europeana.atlassian.net/wiki/spaces/EF/pages/2385739812/Search+API+Documentation#Offset-and-limit-for-Facets:

Parameter
Datatype
Description

f.[FACET_NAME].facet.limit
Number
Number of values an individual facet should contain. Set a limit of "0" to not return anything for that facet. By default, the limit of values of an individual facet is 50. This can be overriden by setting a custom limit e.g. via &f.DEFAULT.facet.limit=100.

You are using an invalid parameter for facet limit.

Yeah,as i mentioned previously,i was still getting the same 50 providers and rights... thank you for pointing out what the issue might be.Let me have a look and respond in due course

scripts/1-fetch/europeana_fetch.py

TimidRobot · 2025-10-21T19:18:03Z

scripts/1-fetch/europeana_fetch.py

+    facet_values = [
+        f["label"] for f in data.get("facets", [])[0].get("fields", [])
+    ]
+    return facet_values


It would be helpful to log length of facet_vaules using facet_field in message.

TimidRobot · 2025-10-24T10:10:22Z

Please focus on resolving conversations before you add new features.

Please remove custom logging (print statements) and follow logging conventions in other scripts.

Joyakis · 2025-10-24T12:53:55Z

Please focus on resolving conversations before you add new features.

Please remove custom logging (print statements) and follow logging conventions in other scripts.

Hello @TimidRobot The script now gets all providers and rights.

Thank you for pointing out where the problem was.

I have also removed the unnecessary printing statements

This reverts commit 3190e3c.

…h pre-commit hooks

…without themes

Joyakis requested review from a team as code owners October 15, 2025 19:48

Joyakis requested review from TimidRobot and possumbilities and removed request for a team October 15, 2025 19:48

cc-open-source-bot moved this to In review in TimidRobot Oct 15, 2025

cc-open-source-bot added this to TimidRobot Oct 15, 2025

Babi-B reviewed Oct 15, 2025

View reviewed changes

env.example Show resolved Hide resolved

Joyakis force-pushed the europeana-feature branch from 741d35d to 8f80f86 Compare October 16, 2025 04:30

TimidRobot requested changes Oct 17, 2025

View reviewed changes

TimidRobot self-assigned this Oct 18, 2025

TimidRobot changed the title ~~Added Europeana integration~~ Add Europeana integration Oct 20, 2025

TimidRobot reviewed Oct 20, 2025

View reviewed changes

scripts/1-fetch/europeana_fetch.py Outdated Show resolved Hide resolved

Joyakis force-pushed the europeana-feature branch from b05fe07 to 34671b3 Compare October 21, 2025 11:00

TimidRobot requested changes Oct 21, 2025

View reviewed changes

Joyakis force-pushed the europeana-feature branch from 80045f1 to cd0ce9e Compare October 24, 2025 13:49

Joyakis and others added 9 commits October 24, 2025 17:03

Added Europeana integration

502f9b9

Remove unnecessary CSV file

7a7766d

Revert "Done the necessary changes"

9e52f73

This reverts commit 3190e3c.

Done the necessary changes

73f9aa1

Fix formatting and linting issues in europeana_fetch.py to comply wit…

cc2e742

…h pre-commit hooks

update variable name (due to merge with main)

a7a1402

Add functionality to generate separate Europeana data files with and …

e26de97

…without themes

Updated script to use Europeana API’s totalResults

ee36206

Added updated files

add3aaf

Joyakis added 4 commits October 24, 2025 17:08

Removed print statements

57cc578

Used facet pagination instead of cursor

000addc

Uses offset based pagination and updates sources

71c8b58

Removed unnecessary logger info

87c3c1e

Joyakis force-pushed the europeana-feature branch from cd0ce9e to 87c3c1e Compare October 24, 2025 14:10

		providers = get_facet_list(session, "DATA_PROVIDER")
		rights_list = get_facet_list(session, "RIGHTS")

Uh oh!

Add Europeana integration #200

Are you sure you want to change the base?

Add Europeana integration #200

Conversation

Joyakis commented Oct 15, 2025 • edited by TimidRobot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixes

Description

Technical details

Tests

Checklist

Developer Certificate of Origin

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Joyakis Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TimidRobot commented Oct 21, 2025

Uh oh!

Joyakis commented Oct 21, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TimidRobot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Joyakis Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TimidRobot commented Oct 24, 2025

Joyakis commented Oct 15, 2025 •

edited by TimidRobot

Loading

Joyakis Oct 21, 2025 •

edited

Loading

Joyakis Oct 24, 2025 •

edited

Loading

Joyakis commented Oct 24, 2025 •

edited

Loading