bcgovpond is an opinionated R package for managing immutable research data using a
data-pond pattern: append-only raw files, explicit metadata, and stable logical
pointers (“views”) that decouple analysis code from physical file names.
This package was built for real research workflows, not for abstract elegance. It assumes that people:
- dump CSV/XLSX files into folders,
- forget what version they used six months ago,
- want reproducibility without constant babysitting,
- and mostly work in R.
If that sounds familiar, this package is for you.
Most research projects fail at one (or more) of the following:
- Raw data silently changes
- Files are overwritten without record
- Analysis scripts hard-code file paths
- “Final” datasets cannot be reconstructed
- Metadata lives in people’s heads
bcgovpond addresses these problems by enforcing a few simple rules:
- Raw data is immutable
- Every file has metadata
- Analysis code never points directly to raw files
- Logical names (“views”) can be updated, but history is preserved
Most users will only ever call create_bcgov_pond_project() (once per project) ingest_pond() (when new data arrives) and read_view() (for access to the data)
Install directly from GitHub using pak:
install.packages("pak")
pak::pak("bcgov/bcgovpond")On Windows, you may be prompted to install Rtools. This is optional for this package,
but recommended if you plan to use pak more broadly.
The pond is where canonical raw data lives.
- Files are moved into the pond
- Files are never edited in place
- New versions are added, not overwritten
- Files may optionally be made immutable at the filesystem level
Think of this as cold storage for raw inputs.
Every file in the pond has a corresponding YAML metadata file describing:
- source
- contents
- structure
- relevant identifiers (as available)
Metadata files are tracked in git.
Raw data files usually are not.
A view is a small YAML pointer that maps a stable logical name to a specific physical file.
Analysis scripts load data via views, not raw file paths.
When a new version of a dataset arrives:
- the old file stays in the pond
- the view is updated to point to the new file
- old analyses can still be reproduced
New files land here first.
bcgovpond ingestion functions:
- inspect the file
- generate metadata
- move it into the pond
- update or create the appropriate view
This keeps ingestion boring and repeatable.
data_store/
├── add_to_pond/ # incoming raw files
├── data_pond/ # immutable canonical raw data (not tracked in git)
└── data_parquet/ # derived parquet / Arrow outputs (not tracked)
data_index/
├── meta/ # YAML metadata (tracked in git)
└── views/ # logical pointers (tracked in git)
You commit data_index/, not the raw or derived data in data_store/.
For a new analysis project, initialize the standard data-pond structure once:
create_bcgov_pond_project()This creates the required directories:
data_store/(raw and derived data, not tracked in git)data_index/(metadata and views, tracked in git)
After this initial setup, most users will only need
ingest_pond() and read_view().
- Drop new CSV or XLSX files into
data_store/add_to_pond/ - Run:
ingest_pond()
- Load data in analysis code using views:
tb <- read_view("census_industry")
That’s it.
No file paths. No version handling in analysis code.
Raw file names must follow this pattern:
specific-info_general-info.ext
- The first underscore is meaningful
specific-infochanges over time (year, extract ID, version)general-infoidentifies the dataset concept
Examples:
2021_census_industry.xlsxRTRA3605542_agenaics.csv
Do not overwrite files already in the pond.
New versions must always have new filenames.
If you need to see which physical file a view currently points to:
resolve_current("census_industry")This is useful for:
- debugging unexpected results
- auditing data provenance
- confirming which raw file is active
Most analysis code should not need this.
Parquet files are treated as derived artifacts, not primary data.
- They exist for performance and convenience
- They may be deleted and regenerated at any time
- They are not authoritative
- Views should never treat Parquet as the source of truth
Parquet is a cache, not a source of truth.
Raw data in data_store/data_pond/ should be treated as read-only.
On Linux, this can be enforced at the filesystem level. On Windows, this relies on user discipline.
Either way: never edit or overwrite files in the pond.
bcgovpond is intentionally not:
- a general-purpose data lake framework
- a CRAN-polished, pure-function package
- a tidyverse-style abstraction layer
- a database replacement
- a Parquet-first system
It touches the filesystem and has side effects.
That is the point.
- Safety over elegance
- Reproducibility over convenience
- Filesystem semantics are real APIs
- Boring > clever
- Humans make mistakes; systems should assume that
If these assumptions bother you, this package will bother you.
Reproducibility relies on Git + data_index/ + renv, not copying data folders.
Initialize renv once:
renv::init()Track in git:
renv.lockrenv/activate.R- all analysis code
data_index/
- Check out the desired git commit
- Ensure the corresponding raw files exist in
data_store/data_pond/ - Restore packages:
renv::restore()
- Run the analysis scripts
This package is:
- stable enough for daily use
- intentionally opinionated
- evolving slowly and conservatively
APIs may change, but the conceptual model will not.
bcgovpond is intentionally opinionated.
It is designed to prevent common reproducibility failures in applied research — such as overwritten raw data, hard-coded file paths, and undocumented “current” datasets — by enforcing a small number of non-negotiable rules: immutable raw data, explicit metadata, and stable logical views.
These constraints are deliberate. They favor auditability and long-term trust over flexibility or automation.
For a full explanation of the design choices and trade-offs, see: vignette("design-philosophy")
If you are looking for:
- maximum flexibility,
- silent overwrites,
- or “just load the latest file” shortcuts,
this package will feel annoying.
That annoyance is doing useful work.
