Skip to content

Conversation

@Joyakis
Copy link

@Joyakis Joyakis commented Oct 15, 2025

Fixes

Description

Add Europeana API integration for metrics collection

This PR adds a new script europeana_fetch.py that fetches and aggregates data from the Europeana Search API.
The script collects high-level statistics about cultural heritage content available through Europeana, focusing on data provider distribution and content types rather than fragile license parsing.

Technical details

Script Location: scripts/1-fetch/europeana_fetch.py

  • Data Output: data/2025Q4/1-fetch/europeana_1_count.csv
  • Key Features:
    • Fetches data from Europeana Search API using multiple content queries
    • Aggregates by DATA_PROVIDER and content metadata
    • Includes proper error handling and API rate limiting
    • Integrates with existing project structure and git workflows
  • Environment: Updated env.example with EUROPEANA_API_KEY placeholder

Tests

  1. Set up Europeana API key in environment variables
  2. Run the script: python scripts/1-fetch/europeana_fetch.py --enable-save
  3. Verify CSV output is generated in data/2025Q4/1-fetch/europeana_1_count.csv
  4. Check that data contains aggregated counts by DATA_PROVIDER

Checklist

  • I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
  • My pull request doesn't include code or content generated with AI.
  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@Joyakis Joyakis requested review from a team as code owners October 15, 2025 19:48
@Joyakis Joyakis requested review from TimidRobot and possumbilities and removed request for a team October 15, 2025 19:48
@cc-open-source-bot cc-open-source-bot moved this to In review in TimidRobot Oct 15, 2025
return []

# Try different queries to get diverse content
queries = ["art", "history", "science", "music", "photography"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your source for these categories?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @TimidRobot .I chose the categories on the query based on possible searches on Europeana. It is similar to the theme parameter that already has pre-defined options .The query though can be customised depending on the data to be retrieved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these categories documented anywhere? Are there categories not represented here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @TimidRobot
I updated the script so that instead of using a generic list of searches possible, I’m now using a curated list of themes directly listed on the Europeana  website (e.g., Art, Fashion, Music, Sport, Photography, Archaeology).
The scripts now first searches for everything then now filters by the themes.

You can see the full list of themes here: themes

Thanks for your guidance and I'd be happy to adjust if you prefer a different approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more authoritative source of themes is https://europeana.atlassian.net/wiki/spaces/EF/pages/2385739812/Search+API+Documentation#Request (in the Search API Request Parameter accordion). Excerpt:

Parameter Datatype Description
theme String Restrict the query over one of the Europeana Thematic Collections. The possible values are: archaeology, art, fashion, industrial, manuscript, map, migration, music, nature, newspaper, photography, sport, ww1.

Copy link
Author

@Joyakis Joyakis Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is...thankyou for the guide...I guess that will do unless there is any more clarification. @TimidRobot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would create two data files: 1) without themes 2) with all themes. That should give you an indication of whether all entries have a theme or not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimidRobot Brilliant!...I've done just that

@TimidRobot TimidRobot self-assigned this Oct 18, 2025
@TimidRobot TimidRobot changed the title Added Europeana integration Add Europeana integration Oct 20, 2025
@TimidRobot
Copy link
Member

@Joyakis I resolved the conflicts with the main branch (due to changes from another pull request being merged). Please remember to fetch the changes to your computer.

@Joyakis
Copy link
Author

Joyakis commented Oct 21, 2025

@Joyakis I resolved the conflicts with the main branch (due to changes from another pull request being merged). Please remember to fetch the changes to your computer.

@TimidRobot Done!

Comment on lines 139 to 159
for theme in themes:
params = {
"wskey": EUROPEANA_API_KEY,
"rows": min(items_per_query, 20),
"profile": "rich",
"query": "*",
"theme": theme,
}

try:
LOGGER.info(
f"Fetching {params['rows']} records for theme: '{theme}'"
)
with session.get(BASE_URL, params=params, timeout=30) as response:
response.raise_for_status()
results = response.json()
items = results.get("items", [])

# Tag each item with the theme used for easy tracking
for item in items:
item["theme_used"] = theme
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're not looping through the pages, so the data is incomplete. However, I don't think you need to loop through the pages. You should be able to use totalResults from the first page. This will be significantly faster than evaluating each and every record in the database.

If you're going to use totalResults then you need to do all of your selection in the search queries.

You are already looping through theme. You also need look into looping through provider_aggregation_edm_dataProvider. To do so you'll need to get a complete list of data providers.

You can also search for different licenses using the RIGHTS search field (ex. query=RIGHTS:("http://creativecommons.org/licenses/by-sa/4.0/")). See https://europeana.atlassian.net/wiki/spaces/EF/pages/2385739812/Search+API+Documentation#Result-Fields

I find Europeana's APIs to be fairly confusing. To successfully add this data source you'll need to spend a lot of time in the documentation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @TimidRobot I think i get the logic now. Please review it and advice then i can start working on the sources.md

I have updated the code.

Copy link
Member

@TimidRobot TimidRobot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve all comments before requesting a new review

Comment on lines 105 to 193
def simplify_legal_tool(legal_tool):
"""Simplify and standardize Creative Commons or license URLs."""
if (
legal_tool
and isinstance(legal_tool, str)
and legal_tool.startswith("http")
):
parts = legal_tool.strip("/").split("/")
last_parts = parts[-2:]
if last_parts:
joined = " ".join(part.upper() for part in last_parts if part)
if "creativecommons.org" in legal_tool:
return f"CC {joined}"
else:
return joined
else:
return "Unknown"
return legal_tool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add documentation (docstring and/or comments) to this function. Add examples of input and output.

I suspect it does not handle both non-ported (no jurisdiction) and ported (with jurisdiction) license URLs.

non-ported example:

http://creativecommons.org/licenses/by-nc-nd/2.0/

ported example:

http://creativecommons.org/licenses/by-nc-nd/3.0/at/

Comment on lines 132 to 133
providers = get_facet_list(session, "DATA_PROVIDER")
rights_list = get_facet_list(session, "RIGHTS")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fetch this data in the main function and pass the variables.

Comment on lines 174 to 175
providers = get_facet_list(session, "DATA_PROVIDER")
rights_list = get_facet_list(session, "RIGHTS")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fetch this data in the main function and pass the variables.

Comment on lines 176 to 184
themes = [
"art",
"fashion",
"music",
"industrial",
"sport",
"photography",
"archaeology",
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the other 7 themes and order the list

else:
return joined
else:
return "Unknown"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script needs to also handle:

  • http://rightsstatements.org/vocab/CNE/1.0/
  • http://rightsstatements.org/vocab/InC-EDU/1.0/
  • http://rightsstatements.org/vocab/InC-OW-EU/1.0/
  • http://rightsstatements.org/vocab/InC/1.0/
  • http://rightsstatements.org/vocab/NoC-NC/1.0/
  • http://rightsstatements.org/vocab/NoC-OKLR/1.0/

facet_values = [
f["label"] for f in data.get("facets", [])[0].get("fields", [])
]
return facet_values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, great job on figuring out how to do this!!

Second, this returns exactly 50 results for both DATA_PROVIDER and RIGHTS. I am confident both have more. Please loop through the pagination.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @TimidRobot I managed to fix the other errors....but I'm still trying to figure out the pagination since i'm still getting 50 entries for both even after including cursor pagination according to their docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Joyakis Please update the pull request with your changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TimidRobot Hello I have now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script continues to return exactly 50 results for both DATA_PROVIDER and RIGHTS. This is incorrect. There are 4203 data providers and 64 rights.

https://europeana.atlassian.net/wiki/spaces/EF/pages/2385739812/Search+API+Documentation#Offset-and-limit-for-Facets:

Parameter Datatype Description
f.[FACET_NAME].facet.limit Number Number of values an individual facet should contain. Set a limit of "0" to not return anything for that facet. By default, the limit of values of an individual facet is 50. This can be overriden by setting a custom limit e.g. via &f.DEFAULT.facet.limit=100.

You are using an invalid parameter for facet limit.

Copy link
Author

@Joyakis Joyakis Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script continues to return exactly 50 results for both DATA_PROVIDER and RIGHTS. This is incorrect. There are 4203 data providers and 64 rights.

https://europeana.atlassian.net/wiki/spaces/EF/pages/2385739812/Search+API+Documentation#Offset-and-limit-for-Facets:

Parameter
Datatype
Description

f.[FACET_NAME].facet.limit
Number
Number of values an individual facet should contain. Set a limit of "0" to not return anything for that facet. By default, the limit of values of an individual facet is 50. This can be overriden by setting a custom limit e.g. via &f.DEFAULT.facet.limit=100.

You are using an invalid parameter for facet limit.

Yeah,as i mentioned previously,i was still getting the same 50 providers and rights... thank you for pointing out what the issue might be.Let me have a look and respond in due course

facet_values = [
f["label"] for f in data.get("facets", [])[0].get("fields", [])
]
return facet_values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to log length of facet_vaules using facet_field in message.

@TimidRobot
Copy link
Member

Please focus on resolving conversations before you add new features.

Please remove custom logging (print statements) and follow logging conventions in other scripts.

@Joyakis
Copy link
Author

Joyakis commented Oct 24, 2025

Please focus on resolving conversations before you add new features.

Please remove custom logging (print statements) and follow logging conventions in other scripts.

Hello @TimidRobot The script now gets all providers and rights.

Thank you for pointing out where the problem was.

I have also removed the unnecessary printing statements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

Add Europeana as a data source

3 participants