Skip to content
This repository was archived by the owner on Apr 22, 2026. It is now read-only.
This repository was archived by the owner on Apr 22, 2026. It is now read-only.

Reduce Repository Size: Separate Code from Data (24GB → <15GB) #252

@pesap

Description

@pesap

Problem

The ReEDS repository is 24GB, causing operational failures and slowing development.

Current state:

  • Working directory: 11GB (load scenario files)
  • Git history + LFS cache: 13GB
  • GitHub Actions runners have ~14GB available disk space
  • CI/CD jobs download 11GB of data, leaving insufficient space for builds and tests
  • This forces sequential test execution and causes workflow failures
  • Developers must clone entire 24GB history to contribute code changes

Impact on developers:

  • Clone times: 10+ minutes
  • onboarding friction for new contributors
  • Unnecessary bandwidth usage

Root cause:
Large scientific data files (load scenarios) should not be stored in a code repository.
Code repositories are optimized for frequent updates, branching, and distributed development.
Data repositories require different management: archival, preservation, DOI registration, and long-term access.

Proposed Solution

Move large data files to Zenodo (NREL's existing data archival platform) and create a fresh code-only GitHub repository.

Why Zenodo?

  • Free permanent storage backed by CERN
  • Each dataset gets persistent DOI (citable in publications)
  • 10+ year preservation guarantee
  • Integrates with GitHub, publications, and funding agencies
  • Supports data versioning independent of software releases

What gets archived (9GB):

  • EER v2024 scenarios: 5GB (3 files: EER_100by2050, EER_IRAlow, EER_Baseline_AEO2023)
  • EER v2023 scenarios: 2.8GB (older versions, kept for historical validation)
  • Legacy study scenarios: 1.5GB (2018-2022 studies: Clean2035, EP variants)

What stays in repository:

  • Source code (Python, Julia, R)
  • Tests and validation
  • Configuration and CI/CD scripts
  • Documentation
  • Small test datasets (<50MB)

Expected Outcomes

Metric Current After Migration
Repository size 24GB 1-2GB
Clone time 10+ minutes 30-40 seconds
Push/pull operations Slow 5-10x faster
CI/CD parallelization Limited (disk space) Enabled
Data accessibility Mixed with code Persistent DOI
Data preservation Tied to code repo 10+ year guarantee

Decision: Choose One Approach

Option A: Zenodo + Fresh Repository (Recommended)

Best for: Long-term sustainability, FAIR data compliance, organizational clarity

  • Create new GitHub repo with code only
  • Upload scenarios to Zenodo with DOIs
  • Archive current repo as read-only reference
  • CI/CD downloads datasets from Zenodo as needed
  • Developers access data via DATASETS.yaml configuration file with DOI references

Pros:

  • Cleanest separation of concerns
  • Sustainable pattern for future data management
  • Complies with funding agency (DOE, NSF) data archival requirements
  • Data updates independent of code releases
  • Datasets become citable in publications

Cons:

  • Requires initial effort to migrate to Zenodo
  • Developers must download data separately during setup
  • CI/CD scripts need updating for data access

Option B: Compress Data in Repository (uint16 Encoding)

Best for: Minimal disruption, keeping data with code

  • Compress load scenarios using uint16 + scalar multiplier
  • Reduces file sizes by 75% (0.76 MW/unit precision = <1% error)
  • Transparent decompression on data load
  • No code changes needed for analysis

Size reduction:

  • Current scenarios: 5GB → 1.25GB
  • Legacy scenarios: 2.8GB → 0.7GB
  • Total: 9GB → 2.25GB

Pros:

  • No repository migration needed
  • No external dependencies (Zenodo account, network calls)
  • Data stays with code for reproducibility
  • Minimal analysis impact (<1% error for power systems)

Cons:

  • Still leaves repo at ~15GB (hits target but doesn't improve sustainability)
  • Data not independently preserved or citable
  • Doesn't align with FAIR data principles
  • Doesn't solve long-term growth concerns

Option C: Paid Storage (GitHub Enterprise)

Best for: Budgets that prioritize simplicity over architecture

  • Subscribe to GitHub Enterprise with additional storage
  • Unlimited storage and bandwidth
  • No data migration needed

Cost: $21+ per user per month

Pros:

  • Minimal operational effort
  • Familiar GitHub interface

Cons:

  • Ongoing cost per user (scales with team size)
  • Doesn't address fundamental architectural problem
  • No long-term data preservation guarantee
  • Data not independently citable or FAIR-compliant
  • Repository will continue growing

Implementation Paths

If Option A (Zenodo + Fresh Repository):

  1. Create new GitHub repository with code only
  2. Upload scenarios to Zenodo with metadata
  3. Obtain DOI for each dataset
  4. Update README with Zenodo download links
  5. Update CI/CD to fetch datasets from Zenodo
  6. Archive current repo as read-only

If Option B (uint16 Compression):

  1. Create compression utility (float64/float32 → uint16)
  2. Validate <1% error across all regions/hours
  3. Compress all load scenarios
  4. Update data loading code for transparent decompression
  5. Run test suite with compressed data
  6. Deploy compressed files

If Option C (Paid Storage):

  1. Evaluate GitHub Enterprise storage plans
  2. Compare costs vs. budget constraints
  3. Review bandwidth and collaboration features
  4. Estimate long-term cost trajectory
  5. Purchase plan and enable

Data Update Workflow

Once data is separated from code, developers and researchers can:

  • Update data independently: Upload new scenarios to Zenodo without code release
  • Specify data versions: Reference datasets by DOI in configuration file (DATASETS.yaml)
  • Version comparison: Easily test different data versions without cloning code branches
  • CI/CD flexibility: Download only needed datasets for each test job, avoiding disk space issues
  • Reproducibility: Exact data version traceable through DOI record

Questions for Team Discussion

  1. Architecture preference: Do we prioritize code/data separation for long-term sustainability (Option A), or minimize migration effort (Option B/C)?
  2. Funding compliance: Do our grants require FAIR data principles and long-term preservation (DOE/NSF)? This favors Option A.
  3. Data governance: Should data be independently managed and updated? Option A enables this.
  4. CI/CD impact: Can we accept Zenodo downloads in workflows, or must data be local? Option B keeps data local.
  5. Timeline: When should this be implemented?

Notes on Terminology

  • LFS (Large File Storage): Git extension for tracking large files efficiently by storing pointer objects instead of full file contents in the .git directory
  • DOI (Digital Object Identifier): Persistent, citable identifier for datasets (e.g., 10.5281/zenodo.XXXXXXX)
  • FAIR (Findable, Accessible, Interoperable, Reusable): Data management framework required by funding agencies
  • Zenodo: CERN-backed open science repository; NREL's institutional data archival platform
  • Git LFS cache: Local storage of large file objects; currently 13GB in .git/lfs/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions