Skip to content

[Feature] Automate Flickr Data Fetching #164

@SaurabhCodesAI

Description

@SaurabhCodesAI

Caution

Because it requires a paid subscription (Flickr Pro), Flickr is not currently a viable data source.

Problem

I've been exploring the codebase and noticed that Flickr data collection is still in pre-automation/flickr/ while other sources like Google Custom Search and GitHub have been automated. The current Flickr scripts appear to be manual and haven't been integrated with the automated pipeline yet.

Description

Looking at pre-automation/flickr/photos.py, it currently:

  • Searches for photos with different Creative Commons licenses (IDs: 1,2,3,4,5,6,9,10)
  • Gets 500 photos per license
  • Saves everything to a JSON file

I noticed some differences compared to the automated scripts:

  • Output: Flickr uses JSON, but GCS/GitHub use CSV
  • Arguments: The automated scripts have --enable-save, --enable-git etc.
  • Setup: Automated scripts use shared.setup() instead of the old quantify.setup()
  • Integration: They integrate with git automation and quarterly folders

The Pipfile already mentions flickrapi and has a planned script path: flickr_fetched = "./scripts/1-fetch/flickr_fetched.py"

Implementation Plan

Following the guidance to plan each phase before implementation, and after studying the 2025Q3 report structure:

Research Findings

The GCS report is way more detailed than I expected, it's not just counting licenses, but showing breakdowns by different countries, languages, and grouping licenses into categories. All the charts look the same with bars on the left and pie charts on the right, and they use the same colors throughout.

Looking at this made me understand why planning carefully is important, with such a large Flickr dataset, I could easily collect tons of data that doesn't actually help with making useful reports.

Fetch Phase

Build new automation (using pre-automation as reference only):

  • Create scripts/1-fetch/flickr_fetch.py following the same pattern as gcs_fetch.py
  • Use the same setup method, save to CSV files, same command options
  • Focus on collecting data that will actually be useful for the reporting phase
  • Get license types (1,2,3,4,5,6,9,10) plus any metadata that supports analysis

Process Phase

Transform data to support reporting:

  • Map Flickr license IDs to actual CC license names and versions using the Flickr API reference
  • Group licenses the same way as GCS (Latest/Prior/Retired categories)
  • Process any geographic or language data to match the report structure
  • Make sure everything aligns with what will actually be shown in charts

Report Phase

Create visualizations that add value:

  • License distribution analysis (similar to GCS "Products totals")
  • Status breakdown (Latest/Prior/Retired)
  • Geographic or language analysis if the data supports it
  • Focus on analyses that make sense for visual content

Next Steps

I'd like to start by understanding the license mapping and what kinds of reports would be most useful. Then build the fetch script to collect only the data that will actually be used in reporting.

This way I can avoid doing a lot of fetch work that won't end up being helpful for the final reports.

Alternatives

  • Build Flickr automation following a different pattern than GCS/GitHub
  • Focus on other data sources first and return to Flickr later
  • Start with a smaller subset of licenses to test the approach

Additional Context

  • I'm still learning the codebase patterns, but I can see the structure used by other automated sources
  • The scripts use Creative Commons license IDs that Flickr recognizes
  • Currently limited to 500 photos per license and outputs to JSON format
  • Could potentially be modernized to follow the pattern used by other automated sources in scripts/1-fetch/

Metadata

Metadata

Assignees

No one assigned

    Labels

    ✨ goal: improvementImprovement to an existing feature💻 aspect: codeConcerns the software code in the repository🚧 status: blockedBlocked & therefore, not ready for work🟩 priority: lowLow priority and doesn't need to be rushed

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions