Skip to content

Scalability: proof bundle storage strategy for growing certifications #3

@astefano

Description

@astefano

Summary

Each Z3 proof bundle is ~22 MB raw (45 functions, pmemlog). This is fine for a few projects, but will hit GitHub's repo size limits at scale. We need a storage strategy before scaling up.

Current numbers (pmemlog, 45 functions)

Artifact Raw size Compressed (gzip)
smt_queries/ (45 .smt2 files) 12 MB ~1.5 MB
z3_proofs/ (45 .proof files) 11 MB ~1.4 MB
proofs.json 64 KB negligible
Total per bundle ~22 MB ~2.9 MB

For comparison, 10 certifications of results + specs are only 1.2 MB total. The proof bundles are ~100x larger than results + specs combined.

Scaling projections

Scenario Raw git growth Timeline to 5 GB soft limit
1 project, 1 cert/week ~1.1 GB/year ~4.5 years
5 projects, monthly ~1.3 GB/year ~3.8 years
20 projects, monthly ~5.3 GB/year < 1 year
20 projects, weekly ~23 GB/year ~3 months

Git compresses well (22 MB → 2.9 MB gzip), but git never forgets — every bundle stays in the history forever.

GitHub limits

  • Repo size soft limit: 5 GB (GitHub warns)
  • Repo size hard limit: 100 GB (GitHub may restrict pushes)
  • Single file limit: 100 MB (we're fine — largest is 2.2 MB)
  • Git LFS: 1 GB free storage, 1 GB bandwidth/month

Alternatives to evaluate

Option A: Git LFS (easiest migration)

  • Move .smt2 and .proof files to Git LFS
  • Git repo stays small; LFS stores blobs externally
  • Pros: minimal code changes, same workflow
  • Cons: still paying for storage ($5/month per 50 GB), LFS has bandwidth limits

Option B: Keep only latest, archive to external storage

  • Store only proofs/latest/ in git (overwrite each time — no history accumulation)
  • Archive timestamped bundles to IPFS, S3, or GitHub Releases
  • Store the content-addressable hash in history.json for verification
  • Pros: git stays lean forever; external storage is cheap
  • Cons: needs upload/fetch code for external store

Option C: IPFS (most aligned with decentralization)

  • Upload each proof bundle to IPFS, get a CID
  • Store only the CID in history.json (alongside the Merkle hash on-chain)
  • Anyone can pin and retrieve the bundle via CID
  • Pros: content-addressable (CID = hash of content), decentralized, permanent
  • Cons: needs pinning service (Pinata, web3.storage) for reliability

Option D: GitHub Releases as artifact storage

  • Attach compressed proof bundles (.tar.gz) as release assets
  • history.json links to the release asset URL
  • GitHub Releases have 2 GB per-file limit, no total cap on releases
  • Pros: free, integrated with GitHub, no new infrastructure
  • Cons: not content-addressable, relies on GitHub availability

Option E: Compress + deduplicate in-repo

  • Store bundles as .tar.gz instead of expanded directories
  • Deduplicate shared SMT preambles (most .smt2 files share ~80% of content)
  • Could reduce per-bundle size from 22 MB to ~3 MB
  • Pros: no external dependencies
  • Cons: still grows linearly, just slower

Recommended approach

  • Short-term (now → 10 projects): Keep current approach. Sizes are manageable.
  • Medium-term (10+ projects): Option B + D — keep only latest in git, attach compressed bundles to GitHub Releases, store the release URL + hash in history.json.
  • Long-term: Option C (IPFS) — proof bundle CID alongside on-chain Merkle hash for full decentralization alignment.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions