-
-
Notifications
You must be signed in to change notification settings - Fork 60
Description
Caution
Because it requires a paid subscription (Flickr Pro), Flickr is not currently a viable data source.
Problem
I've been exploring the codebase and noticed that Flickr data collection is still in pre-automation/flickr/ while other sources like Google Custom Search and GitHub have been automated. The current Flickr scripts appear to be manual and haven't been integrated with the automated pipeline yet.
Description
Looking at pre-automation/flickr/photos.py, it currently:
- Searches for photos with different Creative Commons licenses (IDs: 1,2,3,4,5,6,9,10)
- Gets 500 photos per license
- Saves everything to a JSON file
I noticed some differences compared to the automated scripts:
- Output: Flickr uses JSON, but GCS/GitHub use CSV
- Arguments: The automated scripts have
--enable-save,--enable-gitetc. - Setup: Automated scripts use
shared.setup()instead of the oldquantify.setup() - Integration: They integrate with git automation and quarterly folders
The Pipfile already mentions flickrapi and has a planned script path: flickr_fetched = "./scripts/1-fetch/flickr_fetched.py"
Implementation Plan
Following the guidance to plan each phase before implementation, and after studying the 2025Q3 report structure:
Research Findings
The GCS report is way more detailed than I expected, it's not just counting licenses, but showing breakdowns by different countries, languages, and grouping licenses into categories. All the charts look the same with bars on the left and pie charts on the right, and they use the same colors throughout.
Looking at this made me understand why planning carefully is important, with such a large Flickr dataset, I could easily collect tons of data that doesn't actually help with making useful reports.
Fetch Phase
Build new automation (using pre-automation as reference only):
- Create
scripts/1-fetch/flickr_fetch.pyfollowing the same pattern asgcs_fetch.py - Use the same setup method, save to CSV files, same command options
- Focus on collecting data that will actually be useful for the reporting phase
- Get license types (1,2,3,4,5,6,9,10) plus any metadata that supports analysis
Process Phase
Transform data to support reporting:
- Map Flickr license IDs to actual CC license names and versions using the Flickr API reference
- Group licenses the same way as GCS (Latest/Prior/Retired categories)
- Process any geographic or language data to match the report structure
- Make sure everything aligns with what will actually be shown in charts
Report Phase
Create visualizations that add value:
- License distribution analysis (similar to GCS "Products totals")
- Status breakdown (Latest/Prior/Retired)
- Geographic or language analysis if the data supports it
- Focus on analyses that make sense for visual content
Next Steps
I'd like to start by understanding the license mapping and what kinds of reports would be most useful. Then build the fetch script to collect only the data that will actually be used in reporting.
This way I can avoid doing a lot of fetch work that won't end up being helpful for the final reports.
Alternatives
- Build Flickr automation following a different pattern than GCS/GitHub
- Focus on other data sources first and return to Flickr later
- Start with a smaller subset of licenses to test the approach
Additional Context
- I'm still learning the codebase patterns, but I can see the structure used by other automated sources
- The scripts use Creative Commons license IDs that Flickr recognizes
- Currently limited to 500 photos per license and outputs to JSON format
- Could potentially be modernized to follow the pattern used by other automated sources in
scripts/1-fetch/
Metadata
Metadata
Assignees
Labels
Type
Projects
Status