Reduce Repository Size: Separate Code from Data (24GB → <15GB)

## Problem

The ReEDS repository is 24GB, causing operational failures and slowing development.

**Current state:**
- Working directory: 11GB (load scenario files)
- Git history + LFS cache: 13GB
- GitHub Actions runners have ~14GB available disk space
- CI/CD jobs download 11GB of data, leaving insufficient space for builds and tests
- This forces sequential test execution and causes workflow failures
- Developers must clone entire 24GB history to contribute code changes

**Impact on developers:**
- Clone times: 10+ minutes
- onboarding friction for new contributors
- Unnecessary bandwidth usage

**Root cause:**
Large scientific data files (load scenarios) should not be stored in a code repository.
Code repositories are optimized for frequent updates, branching, and distributed development.
Data repositories require different management: archival, preservation, DOI registration, and long-term access.

## Proposed Solution

Move large data files to **Zenodo** (NREL's existing data archival platform) and create a fresh code-only GitHub repository.

**Why Zenodo?**
- Free permanent storage backed by CERN
- Each dataset gets persistent DOI (citable in publications)
- 10+ year preservation guarantee
- Integrates with GitHub, publications, and funding agencies
- Supports data versioning independent of software releases

**What gets archived (9GB):**
- EER v2024 scenarios: 5GB (3 files: EER_100by2050, EER_IRAlow, EER_Baseline_AEO2023)
- EER v2023 scenarios: 2.8GB (older versions, kept for historical validation)
- Legacy study scenarios: 1.5GB (2018-2022 studies: Clean2035, EP variants)

**What stays in repository:**
- Source code (Python, Julia, R)
- Tests and validation
- Configuration and CI/CD scripts
- Documentation
- Small test datasets (<50MB)

## Expected Outcomes

| Metric | Current | After Migration |
|--------|---------|-----------------|
| Repository size | 24GB | 1-2GB |
| Clone time | 10+ minutes | 30-40 seconds |
| Push/pull operations | Slow | 5-10x faster |
| CI/CD parallelization | Limited (disk space) | Enabled |
| Data accessibility | Mixed with code | Persistent DOI |
| Data preservation | Tied to code repo | 10+ year guarantee |

## Decision: Choose One Approach

### Option A: Zenodo + Fresh Repository (Recommended)
**Best for:** Long-term sustainability, FAIR data compliance, organizational clarity

- Create new GitHub repo with code only
- Upload scenarios to Zenodo with DOIs
- Archive current repo as read-only reference
- CI/CD downloads datasets from Zenodo as needed
- Developers access data via `DATASETS.yaml` configuration file with DOI references

**Pros:**
- Cleanest separation of concerns
- Sustainable pattern for future data management
- Complies with funding agency (DOE, NSF) data archival requirements
- Data updates independent of code releases
- Datasets become citable in publications

**Cons:**
- Requires initial effort to migrate to Zenodo
- Developers must download data separately during setup
- CI/CD scripts need updating for data access

---

### Option B: Compress Data in Repository (uint16 Encoding)
**Best for:** Minimal disruption, keeping data with code

- Compress load scenarios using uint16 + scalar multiplier
- Reduces file sizes by 75% (0.76 MW/unit precision = <1% error)
- Transparent decompression on data load
- No code changes needed for analysis

**Size reduction:**
- Current scenarios: 5GB → 1.25GB
- Legacy scenarios: 2.8GB → 0.7GB
- Total: 9GB → 2.25GB

**Pros:**
- No repository migration needed
- No external dependencies (Zenodo account, network calls)
- Data stays with code for reproducibility
- Minimal analysis impact (<1% error for power systems)

**Cons:**
- Still leaves repo at ~15GB (hits target but doesn't improve sustainability)
- Data not independently preserved or citable
- Doesn't align with FAIR data principles
- Doesn't solve long-term growth concerns

---

### Option C: Paid Storage (GitHub Enterprise)
**Best for:** Budgets that prioritize simplicity over architecture

- Subscribe to GitHub Enterprise with additional storage
- Unlimited storage and bandwidth
- No data migration needed

**Cost:** $21+ per user per month

**Pros:**
- Minimal operational effort
- Familiar GitHub interface

**Cons:**
- Ongoing cost per user (scales with team size)
- Doesn't address fundamental architectural problem
- No long-term data preservation guarantee
- Data not independently citable or FAIR-compliant
- Repository will continue growing

---

## Implementation Paths

### If Option A (Zenodo + Fresh Repository):
1. Create new GitHub repository with code only
2. Upload scenarios to Zenodo with metadata
3. Obtain DOI for each dataset
4. Update README with Zenodo download links
5. Update CI/CD to fetch datasets from Zenodo
6. Archive current repo as read-only

### If Option B (uint16 Compression):
1. Create compression utility (float64/float32 → uint16)
2. Validate <1% error across all regions/hours
3. Compress all load scenarios
4. Update data loading code for transparent decompression
5. Run test suite with compressed data
6. Deploy compressed files

### If Option C (Paid Storage):
1. Evaluate GitHub Enterprise storage plans
2. Compare costs vs. budget constraints
3. Review bandwidth and collaboration features
4. Estimate long-term cost trajectory
5. Purchase plan and enable

---

## Data Update Workflow

Once data is separated from code, developers and researchers can:

- **Update data independently:** Upload new scenarios to Zenodo without code release
- **Specify data versions:** Reference datasets by DOI in configuration file (`DATASETS.yaml`)
- **Version comparison:** Easily test different data versions without cloning code branches
- **CI/CD flexibility:** Download only needed datasets for each test job, avoiding disk space issues
- **Reproducibility:** Exact data version traceable through DOI record

---

## Questions for Team Discussion

1. **Architecture preference:** Do we prioritize code/data separation for long-term sustainability (Option A), or minimize migration effort (Option B/C)?
2. **Funding compliance:** Do our grants require FAIR data principles and long-term preservation (DOE/NSF)? This favors Option A.
3. **Data governance:** Should data be independently managed and updated? Option A enables this.
4. **CI/CD impact:** Can we accept Zenodo downloads in workflows, or must data be local? Option B keeps data local.
5. **Timeline:** When should this be implemented?

---

## Notes on Terminology

- **LFS (Large File Storage):** Git extension for tracking large files efficiently by storing pointer objects instead of full file contents in the `.git` directory
- **DOI (Digital Object Identifier):** Persistent, citable identifier for datasets (e.g., 10.5281/zenodo.XXXXXXX)
- **FAIR (Findable, Accessible, Interoperable, Reusable):** Data management framework required by funding agencies
- **Zenodo:** CERN-backed open science repository; NREL's institutional data archival platform
- **Git LFS cache:** Local storage of large file objects; currently 13GB in `.git/lfs/`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce Repository Size: Separate Code from Data (24GB → <15GB) #252

Problem

Proposed Solution

Expected Outcomes

Decision: Choose One Approach

Option A: Zenodo + Fresh Repository (Recommended)

Option B: Compress Data in Repository (uint16 Encoding)

Option C: Paid Storage (GitHub Enterprise)

Implementation Paths

If Option A (Zenodo + Fresh Repository):

If Option B (uint16 Compression):

If Option C (Paid Storage):

Data Update Workflow

Questions for Team Discussion

Notes on Terminology

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Current	After Migration
Repository size	24GB	1-2GB
Clone time	10+ minutes	30-40 seconds
Push/pull operations	Slow	5-10x faster
CI/CD parallelization	Limited (disk space)	Enabled
Data accessibility	Mixed with code	Persistent DOI
Data preservation	Tied to code repo	10+ year guarantee

Reduce Repository Size: Separate Code from Data (24GB → <15GB) #252

Description

Problem

Proposed Solution

Expected Outcomes

Decision: Choose One Approach

Option A: Zenodo + Fresh Repository (Recommended)

Option B: Compress Data in Repository (uint16 Encoding)

Option C: Paid Storage (GitHub Enterprise)

Implementation Paths

If Option A (Zenodo + Fresh Repository):

If Option B (uint16 Compression):

If Option C (Paid Storage):

Data Update Workflow

Questions for Team Discussion

Notes on Terminology

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions