Problem
The ReEDS repository is 24GB, causing operational failures and slowing development.
Current state:
- Working directory: 11GB (load scenario files)
- Git history + LFS cache: 13GB
- GitHub Actions runners have ~14GB available disk space
- CI/CD jobs download 11GB of data, leaving insufficient space for builds and tests
- This forces sequential test execution and causes workflow failures
- Developers must clone entire 24GB history to contribute code changes
Impact on developers:
- Clone times: 10+ minutes
- onboarding friction for new contributors
- Unnecessary bandwidth usage
Root cause:
Large scientific data files (load scenarios) should not be stored in a code repository.
Code repositories are optimized for frequent updates, branching, and distributed development.
Data repositories require different management: archival, preservation, DOI registration, and long-term access.
Proposed Solution
Move large data files to Zenodo (NREL's existing data archival platform) and create a fresh code-only GitHub repository.
Why Zenodo?
- Free permanent storage backed by CERN
- Each dataset gets persistent DOI (citable in publications)
- 10+ year preservation guarantee
- Integrates with GitHub, publications, and funding agencies
- Supports data versioning independent of software releases
What gets archived (9GB):
- EER v2024 scenarios: 5GB (3 files: EER_100by2050, EER_IRAlow, EER_Baseline_AEO2023)
- EER v2023 scenarios: 2.8GB (older versions, kept for historical validation)
- Legacy study scenarios: 1.5GB (2018-2022 studies: Clean2035, EP variants)
What stays in repository:
- Source code (Python, Julia, R)
- Tests and validation
- Configuration and CI/CD scripts
- Documentation
- Small test datasets (<50MB)
Expected Outcomes
| Metric |
Current |
After Migration |
| Repository size |
24GB |
1-2GB |
| Clone time |
10+ minutes |
30-40 seconds |
| Push/pull operations |
Slow |
5-10x faster |
| CI/CD parallelization |
Limited (disk space) |
Enabled |
| Data accessibility |
Mixed with code |
Persistent DOI |
| Data preservation |
Tied to code repo |
10+ year guarantee |
Decision: Choose One Approach
Option A: Zenodo + Fresh Repository (Recommended)
Best for: Long-term sustainability, FAIR data compliance, organizational clarity
- Create new GitHub repo with code only
- Upload scenarios to Zenodo with DOIs
- Archive current repo as read-only reference
- CI/CD downloads datasets from Zenodo as needed
- Developers access data via
DATASETS.yaml configuration file with DOI references
Pros:
- Cleanest separation of concerns
- Sustainable pattern for future data management
- Complies with funding agency (DOE, NSF) data archival requirements
- Data updates independent of code releases
- Datasets become citable in publications
Cons:
- Requires initial effort to migrate to Zenodo
- Developers must download data separately during setup
- CI/CD scripts need updating for data access
Option B: Compress Data in Repository (uint16 Encoding)
Best for: Minimal disruption, keeping data with code
- Compress load scenarios using uint16 + scalar multiplier
- Reduces file sizes by 75% (0.76 MW/unit precision = <1% error)
- Transparent decompression on data load
- No code changes needed for analysis
Size reduction:
- Current scenarios: 5GB → 1.25GB
- Legacy scenarios: 2.8GB → 0.7GB
- Total: 9GB → 2.25GB
Pros:
- No repository migration needed
- No external dependencies (Zenodo account, network calls)
- Data stays with code for reproducibility
- Minimal analysis impact (<1% error for power systems)
Cons:
- Still leaves repo at ~15GB (hits target but doesn't improve sustainability)
- Data not independently preserved or citable
- Doesn't align with FAIR data principles
- Doesn't solve long-term growth concerns
Option C: Paid Storage (GitHub Enterprise)
Best for: Budgets that prioritize simplicity over architecture
- Subscribe to GitHub Enterprise with additional storage
- Unlimited storage and bandwidth
- No data migration needed
Cost: $21+ per user per month
Pros:
- Minimal operational effort
- Familiar GitHub interface
Cons:
- Ongoing cost per user (scales with team size)
- Doesn't address fundamental architectural problem
- No long-term data preservation guarantee
- Data not independently citable or FAIR-compliant
- Repository will continue growing
Implementation Paths
If Option A (Zenodo + Fresh Repository):
- Create new GitHub repository with code only
- Upload scenarios to Zenodo with metadata
- Obtain DOI for each dataset
- Update README with Zenodo download links
- Update CI/CD to fetch datasets from Zenodo
- Archive current repo as read-only
If Option B (uint16 Compression):
- Create compression utility (float64/float32 → uint16)
- Validate <1% error across all regions/hours
- Compress all load scenarios
- Update data loading code for transparent decompression
- Run test suite with compressed data
- Deploy compressed files
If Option C (Paid Storage):
- Evaluate GitHub Enterprise storage plans
- Compare costs vs. budget constraints
- Review bandwidth and collaboration features
- Estimate long-term cost trajectory
- Purchase plan and enable
Data Update Workflow
Once data is separated from code, developers and researchers can:
- Update data independently: Upload new scenarios to Zenodo without code release
- Specify data versions: Reference datasets by DOI in configuration file (
DATASETS.yaml)
- Version comparison: Easily test different data versions without cloning code branches
- CI/CD flexibility: Download only needed datasets for each test job, avoiding disk space issues
- Reproducibility: Exact data version traceable through DOI record
Questions for Team Discussion
- Architecture preference: Do we prioritize code/data separation for long-term sustainability (Option A), or minimize migration effort (Option B/C)?
- Funding compliance: Do our grants require FAIR data principles and long-term preservation (DOE/NSF)? This favors Option A.
- Data governance: Should data be independently managed and updated? Option A enables this.
- CI/CD impact: Can we accept Zenodo downloads in workflows, or must data be local? Option B keeps data local.
- Timeline: When should this be implemented?
Notes on Terminology
- LFS (Large File Storage): Git extension for tracking large files efficiently by storing pointer objects instead of full file contents in the
.git directory
- DOI (Digital Object Identifier): Persistent, citable identifier for datasets (e.g., 10.5281/zenodo.XXXXXXX)
- FAIR (Findable, Accessible, Interoperable, Reusable): Data management framework required by funding agencies
- Zenodo: CERN-backed open science repository; NREL's institutional data archival platform
- Git LFS cache: Local storage of large file objects; currently 13GB in
.git/lfs/
Problem
The ReEDS repository is 24GB, causing operational failures and slowing development.
Current state:
Impact on developers:
Root cause:
Large scientific data files (load scenarios) should not be stored in a code repository.
Code repositories are optimized for frequent updates, branching, and distributed development.
Data repositories require different management: archival, preservation, DOI registration, and long-term access.
Proposed Solution
Move large data files to Zenodo (NREL's existing data archival platform) and create a fresh code-only GitHub repository.
Why Zenodo?
What gets archived (9GB):
What stays in repository:
Expected Outcomes
Decision: Choose One Approach
Option A: Zenodo + Fresh Repository (Recommended)
Best for: Long-term sustainability, FAIR data compliance, organizational clarity
DATASETS.yamlconfiguration file with DOI referencesPros:
Cons:
Option B: Compress Data in Repository (uint16 Encoding)
Best for: Minimal disruption, keeping data with code
Size reduction:
Pros:
Cons:
Option C: Paid Storage (GitHub Enterprise)
Best for: Budgets that prioritize simplicity over architecture
Cost: $21+ per user per month
Pros:
Cons:
Implementation Paths
If Option A (Zenodo + Fresh Repository):
If Option B (uint16 Compression):
If Option C (Paid Storage):
Data Update Workflow
Once data is separated from code, developers and researchers can:
DATASETS.yaml)Questions for Team Discussion
Notes on Terminology
.gitdirectory.git/lfs/