-
-
Notifications
You must be signed in to change notification settings - Fork 60
Add Europeana integration #200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
741d35d to
8f80f86
Compare
scripts/1-fetch/europeana_fetch.py
Outdated
| return [] | ||
|
|
||
| # Try different queries to get diverse content | ||
| queries = ["art", "history", "science", "music", "photography"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your source for these categories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @TimidRobot .I chose the categories on the query based on possible searches on Europeana. It is similar to the theme parameter that already has pre-defined options .The query though can be customised depending on the data to be retrieved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these categories documented anywhere? Are there categories not represented here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @TimidRobot
I updated the script so that instead of using a generic list of searches possible, I’m now using a curated list of themes directly listed on the Europeana website (e.g., Art, Fashion, Music, Sport, Photography, Archaeology).
The scripts now first searches for everything then now filters by the themes.
You can see the full list of themes here: themes
Thanks for your guidance and I'd be happy to adjust if you prefer a different approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more authoritative source of themes is https://europeana.atlassian.net/wiki/spaces/EF/pages/2385739812/Search+API+Documentation#Request (in the Search API Request Parameter accordion). Excerpt:
Parameter Datatype Description theme String Restrict the query over one of the Europeana Thematic Collections. The possible values are: archaeology, art, fashion, industrial, manuscript, map, migration, music, nature, newspaper, photography, sport, ww1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is...thankyou for the guide...I guess that will do unless there is any more clarification. @TimidRobot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would create two data files: 1) without themes 2) with all themes. That should give you an indication of whether all entries have a theme or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimidRobot Brilliant!...I've done just that
|
@Joyakis I resolved the conflicts with the |
b05fe07 to
34671b3
Compare
@TimidRobot Done! |
scripts/1-fetch/europeana_fetch.py
Outdated
| for theme in themes: | ||
| params = { | ||
| "wskey": EUROPEANA_API_KEY, | ||
| "rows": min(items_per_query, 20), | ||
| "profile": "rich", | ||
| "query": "*", | ||
| "theme": theme, | ||
| } | ||
|
|
||
| try: | ||
| LOGGER.info( | ||
| f"Fetching {params['rows']} records for theme: '{theme}'" | ||
| ) | ||
| with session.get(BASE_URL, params=params, timeout=30) as response: | ||
| response.raise_for_status() | ||
| results = response.json() | ||
| items = results.get("items", []) | ||
|
|
||
| # Tag each item with the theme used for easy tracking | ||
| for item in items: | ||
| item["theme_used"] = theme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not looping through the pages, so the data is incomplete. However, I don't think you need to loop through the pages. You should be able to use totalResults from the first page. This will be significantly faster than evaluating each and every record in the database.
If you're going to use totalResults then you need to do all of your selection in the search queries.
You are already looping through theme. You also need look into looping through provider_aggregation_edm_dataProvider. To do so you'll need to get a complete list of data providers.
You can also search for different licenses using the RIGHTS search field (ex. query=RIGHTS:("http://creativecommons.org/licenses/by-sa/4.0/")). See https://europeana.atlassian.net/wiki/spaces/EF/pages/2385739812/Search+API+Documentation#Result-Fields
I find Europeana's APIs to be fairly confusing. To successfully add this data source you'll need to spend a lot of time in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @TimidRobot I think i get the logic now. Please review it and advice then i can start working on the sources.md
I have updated the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please resolve all comments before requesting a new review
| def simplify_legal_tool(legal_tool): | ||
| """Simplify and standardize Creative Commons or license URLs.""" | ||
| if ( | ||
| legal_tool | ||
| and isinstance(legal_tool, str) | ||
| and legal_tool.startswith("http") | ||
| ): | ||
| parts = legal_tool.strip("/").split("/") | ||
| last_parts = parts[-2:] | ||
| if last_parts: | ||
| joined = " ".join(part.upper() for part in last_parts if part) | ||
| if "creativecommons.org" in legal_tool: | ||
| return f"CC {joined}" | ||
| else: | ||
| return joined | ||
| else: | ||
| return "Unknown" | ||
| return legal_tool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add documentation (docstring and/or comments) to this function. Add examples of input and output.
I suspect it does not handle both non-ported (no jurisdiction) and ported (with jurisdiction) license URLs.
non-ported example:
http://creativecommons.org/licenses/by-nc-nd/2.0/
ported example:
http://creativecommons.org/licenses/by-nc-nd/3.0/at/
scripts/1-fetch/europeana_fetch.py
Outdated
| providers = get_facet_list(session, "DATA_PROVIDER") | ||
| rights_list = get_facet_list(session, "RIGHTS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fetch this data in the main function and pass the variables.
scripts/1-fetch/europeana_fetch.py
Outdated
| providers = get_facet_list(session, "DATA_PROVIDER") | ||
| rights_list = get_facet_list(session, "RIGHTS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fetch this data in the main function and pass the variables.
scripts/1-fetch/europeana_fetch.py
Outdated
| themes = [ | ||
| "art", | ||
| "fashion", | ||
| "music", | ||
| "industrial", | ||
| "sport", | ||
| "photography", | ||
| "archaeology", | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the other 7 themes and order the list
scripts/1-fetch/europeana_fetch.py
Outdated
| else: | ||
| return joined | ||
| else: | ||
| return "Unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script needs to also handle:
http://rightsstatements.org/vocab/CNE/1.0/http://rightsstatements.org/vocab/InC-EDU/1.0/http://rightsstatements.org/vocab/InC-OW-EU/1.0/http://rightsstatements.org/vocab/InC/1.0/http://rightsstatements.org/vocab/NoC-NC/1.0/http://rightsstatements.org/vocab/NoC-OKLR/1.0/
scripts/1-fetch/europeana_fetch.py
Outdated
| facet_values = [ | ||
| f["label"] for f in data.get("facets", [])[0].get("fields", []) | ||
| ] | ||
| return facet_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, great job on figuring out how to do this!!
Second, this returns exactly 50 results for both DATA_PROVIDER and RIGHTS. I am confident both have more. Please loop through the pagination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @TimidRobot I managed to fix the other errors....but I'm still trying to figure out the pagination since i'm still getting 50 entries for both even after including cursor pagination according to their docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Joyakis Please update the pull request with your changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TimidRobot Hello I have now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script continues to return exactly 50 results for both DATA_PROVIDER and RIGHTS. This is incorrect. There are 4203 data providers and 64 rights.
Parameter Datatype Description f.[FACET_NAME].facet.limit Number Number of values an individual facet should contain. Set a limit of "0" to not return anything for that facet. By default, the limit of values of an individual facet is 50. This can be overriden by setting a custom limit e.g. via &f.DEFAULT.facet.limit=100.
You are using an invalid parameter for facet limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script continues to return exactly 50 results for both
DATA_PROVIDERandRIGHTS. This is incorrect. There are 4203 data providers and 64 rights.Parameter
Datatype
Descriptionf.[FACET_NAME].facet.limit
Number
Number of values an individual facet should contain. Set a limit of "0" to not return anything for that facet. By default, the limit of values of an individual facet is 50. This can be overriden by setting a custom limit e.g. via&f.DEFAULT.facet.limit=100.You are using an invalid parameter for facet limit.
Yeah,as i mentioned previously,i was still getting the same 50 providers and rights... thank you for pointing out what the issue might be.Let me have a look and respond in due course
scripts/1-fetch/europeana_fetch.py
Outdated
| facet_values = [ | ||
| f["label"] for f in data.get("facets", [])[0].get("fields", []) | ||
| ] | ||
| return facet_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to log length of facet_vaules using facet_field in message.
|
Please focus on resolving conversations before you add new features. Please remove custom logging ( |
Hello @TimidRobot The script now gets all providers and rights. Thank you for pointing out where the problem was. I have also removed the unnecessary printing statements |
80045f1 to
cd0ce9e
Compare
This reverts commit 3190e3c.
…h pre-commit hooks
cd0ce9e to
87c3c1e
Compare
Fixes
Description
Add Europeana API integration for metrics collection
This PR adds a new script
europeana_fetch.pythat fetches and aggregates data from the Europeana Search API.The script collects high-level statistics about cultural heritage content available through Europeana, focusing on data provider distribution and content types rather than fragile license parsing.
Technical details
Script Location:
scripts/1-fetch/europeana_fetch.pydata/2025Q4/1-fetch/europeana_1_count.csvTests
python scripts/1-fetch/europeana_fetch.py --enable-savedata/2025Q4/1-fetch/europeana_1_count.csvChecklist
Update index.md).mainormaster).visible errors.
Developer Certificate of Origin
For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."
Developer Certificate of Origin