HarperFast · irjudson · Nov 13, 2025 · Nov 10, 2025 · Nov 12, 2025 · Nov 12, 2025
diff --git a/.prettierignore b/.prettierignore
@@ -0,0 +1,15 @@
+# Prettier ignore patterns
+
+# Node modules
+node_modules/
+
+# Build outputs
+dist/
+build/
+coverage/
+
+# Example files with intentional YAML syntax (multiple config examples in one file)
+examples/column-selection-config.yaml
+
+# External packages
+ext/
diff --git a/Readme.md b/Readme.md
@@ -17,11 +17,12 @@ See [System Overview](docs/SYSTEM-OVERVIEW.md) for how they work together, or ju
 
 ## Plugin Features
 
+- **Multi-Table Support**: Sync multiple BigQuery tables simultaneously with independent settings
 - **Horizontal Scalability**: Linear throughput increase with cluster size
 - **No Coordination**: Each node independently determines its workload
 - **Failure Recovery**: Local checkpoints enable independent node recovery
 - **Adaptive Polling**: Batch sizes adjust based on sync lag
-- **Continuous Validation**: Automatic data completeness checks
+- **Continuous Validation**: Automatic data completeness checks across all tables
 - **Native Replication**: Leverages Harper's clustering for data distribution ([docs](https://docs.harperdb.io/docs/developers/replication))
 - **Generic Storage**: Stores complete BigQuery records without schema constraints
 
@@ -34,14 +35,39 @@ See [System Overview](docs/SYSTEM-OVERVIEW.md) for how they work together, or ju
 - **Easy Testing**: Perfect for validating the BigQuery plugin with realistic workloads
 - **Shared Config**: Uses the same `config.yaml` as the plugin - no duplicate setup
 
-**Quick Start**: `npx maritime-data-synthesizer start` (auto-backfills and maintains rolling window)
+### Running the Synthesizer
 
-**Key Commands:**
+The maritime synthesizer generates test data TO BigQuery, which the plugin then syncs FROM BigQuery to Harper.
 
-- `start` - Auto-backfill and continuous generation (rolling window)
-- `clear` - Clear all data (keeps schema) - perfect for quick resets
+**Prerequisites:**
+
+1. GCP project with BigQuery enabled
+2. Service account key with BigQuery write permissions
+3. Update `config.yaml` with your BigQuery credentials
+
+**Quick Start:**
+
+```bash
+# Install dependencies (if not already done)
+npm install
+
+# Generate test data - auto-detects mode from config.yaml
+npx maritime-data-synthesizer initialize realistic
+```
+
+**Available Commands:**
+
+- `initialize <scenario>` - Generate test data (scenarios: small, realistic, stress)
+  - `small`: 100 positions, 10 events, 20 metadata (~1 hour of data)
+  - `realistic`: 10k positions, 500 events, 100 metadata (~24 hours)
+  - `stress`: 100k positions, 5k events, 1k metadata (~7 days)
+- `start` - Continuous generation with rolling window (single-table mode only)
+- `stats` - View BigQuery table statistics
+- `clear` - Clear all data (keeps schema)
 - `reset N` - Delete and reload with N days of data
 
+**Note:** Multi-table mode (current default config) supports `initialize` command. For continuous generation with `start`, use single-table config format.
+
 **Documentation:**
 
 - **[5-Minute Quick Start](docs/QUICKSTART.md)** - Get generating data immediately
@@ -90,6 +116,61 @@ Each node:
 
 ## Configuration
 
+### Multi-Table Support
+
+The plugin supports syncing **multiple BigQuery tables** simultaneously, each with independent sync settings:
+
+```yaml
+bigquery:
+  projectId: your-project
+  credentials: service-account-key.json
+  location: US
+
+  tables:
+    - id: vessel_positions
+      dataset: maritime_tracking
+      table: vessel_positions
+      timestampColumn: timestamp
+      columns: [timestamp, mmsi, latitude, longitude, speed_knots]
+      targetTable: VesselPositions
+      sync:
+        initialBatchSize: 10000
+        catchupBatchSize: 1000
+        steadyBatchSize: 500
+
+    - id: port_events
+      dataset: maritime_tracking
+      table: port_events
+      timestampColumn: event_time
+      columns: ['*'] # Fetch all columns
+      targetTable: PortEvents
+      sync:
+        initialBatchSize: 5000
+        catchupBatchSize: 500
+        steadyBatchSize: 100
+```
+
+**Key Features:**
+
+- Each table syncs to a separate Harper table
+- Independent batch sizes and sync rates per table
+- Different timestamp column names supported
+- Isolated checkpoints - one table failure doesn't affect others
+- Per-table validation and monitoring
+- Backward compatible with single-table configuration
+
+**Important Constraint:**
+Each BigQuery table MUST sync to a **different** Harper table. Multiple BigQuery tables syncing to the same Harper table is not supported and will cause:
+
+- Record ID collisions and data overwrites
+- Validation failures (can only validate one source)
+- Checkpoint confusion (different sync states)
+- Schema conflicts (mixed field sets)
+
+If you need to combine data from multiple BigQuery tables, sync them to separate Harper tables and join at query time.
+
+See `config.multi-table.yaml` for a complete example.
+
 ### Data Storage
 
 BigQuery records are stored as-is at the top level:
@@ -282,88 +363,86 @@ Learn more about [Harper's storage architecture](https://docs.harperdb.io/docs/r
 
 ## Roadmap
 
-### 🐛 Crawl (Current - v1.0)
+### Crawl (v1.0 - Complete)
 
-**Status:** 🔨 In Progress
+**Status:** ✅ Shipped
 
-Single-threaded ingestion (one worker per Harper instance):
+Single-threaded, single-table ingestion:
 
 - ✅ Modulo-based partitioning for distributed workload
-- ✅ One BigQuery table ingestion
 - ✅ Adaptive batch sizing (phase-based: initial/catchup/steady)
-- ✅ Checkpoint-based recovery per thread (`hostname-workerIndex`)
+- ✅ Checkpoint-based recovery per node
 - ✅ Durable thread identity (survives restarts)
-- ✅ Basic monitoring via GraphQL API (`/SyncControl`)
-- ⚠️ **Validation subsystem** (not yet complete - see src/validation.js)
-
-**Current Limitations:**
-
-- Single worker thread per instance (supports multi-instance clusters)
-- Manual cluster scaling coordination
-- Validation endpoint disabled (commented out in src/resources.js)
-
-**Note:** The code already supports multiple worker threads per instance via `server.workerIndex`. Each thread gets a durable identity (`hostname-workerIndex`) that persists across restarts, enabling checkpoint-based recovery.
-
-### 🚶 Walk (Planned - v2.0)
+- ✅ Monitoring via REST API (`/SyncControl`)
 
-**Status:** 🔨 In Development
+### 🚶 Walk (v2.0 - Complete)
 
-Multi-threaded, multi-instance Harper cluster support:
-
-- [ ] **Multi-threaded ingestion** - Multiple worker threads per node
-- [ ] **Full cluster distribution** - Automatic workload distribution across all Harper nodes
-- [ ] **Dynamic rebalancing** - Handle node additions/removals without manual intervention
-- [ ] **Improved monitoring** - Cluster-wide health dashboard
-- [ ] **Thread-level checkpointing** - Fine-grained recovery per worker thread
-
-**Benefits:**
-
-- Linear scaling across cluster nodes
-- Better resource utilization per node
-- Automatic failover and rebalancing
-
-### 🏃 Run (Future - v3.0)
-
-**Status:** 📋 Planned
+**Status:** ✅ Shipped
 
 Multi-table ingestion with column selection:
 
-- [ ] **Multiple BigQuery tables** - Ingest from multiple tables simultaneously
-- [ ] **Column selection** - Choose specific columns per table (reduce data transfer)
-- [ ] **Per-table configuration** - Different batch sizes, intervals, and strategies per table
-- [ ] **Data transformation** - Optional transformations during ingestion
-- [ ] **Unified monitoring** - Single dashboard for all table ingestions
+- ✅ **Multiple BigQuery tables** - Ingest from multiple tables simultaneously
+- ✅ **Column selection** - Choose specific columns per table (reduce data transfer)
+- ✅ **Per-table configuration** - Different batch sizes, intervals, and strategies per table
+- ✅ **Multi-table validation** - Independent validation per table
+- ✅ **Unified monitoring** - Single dashboard for all table ingestions
 
 **Use Cases:**
 
 - Ingest multiple related datasets (e.g., vessels, ports, weather)
 - Reduce costs by selecting only needed columns
 - Different sync strategies per data type (real-time vs batch)
 
-**Example Configuration (Future):**
+**Configuration Example:**
 
 ```yaml
 bigquery:
   projectId: your-project
   credentials: service-account-key.json
+  location: US
 
   tables:
-    - dataset: maritime_tracking
+    - id: vessel_positions
+      dataset: maritime_tracking
       table: vessel_positions
       columns: [mmsi, timestamp, latitude, longitude, speed_knots]
-      batchSize: 1000
-
-    - dataset: maritime_tracking
+      targetTable: VesselPositions
+      sync:
+        initialBatchSize: 10000
+        catchupBatchSize: 1000
+        steadyBatchSize: 500
+
+    - id: port_events
+      dataset: maritime_tracking
       table: port_events
       columns: [port_id, vessel_mmsi, event_type, timestamp]
-      batchSize: 500
-
-    - dataset: weather_data
-      table: marine_weather
-      columns: [location, timestamp, wind_speed, wave_height]
-      batchSize: 100
+      targetTable: PortEvents
+      sync:
+        initialBatchSize: 5000
+        catchupBatchSize: 500
+        steadyBatchSize: 100
 ```
 
+### 🏃 Run (v3.0 - Planned)
+
+**Status:** 📋 Future
+
+Multi-threaded, multi-instance Harper cluster support:
+
+- [ ] **Multi-threaded ingestion** - Multiple worker threads per node
+- [ ] **Full cluster distribution** - Automatic workload distribution across all Harper nodes
+- [ ] **Dynamic rebalancing** - Handle node additions/removals without manual intervention
+- [ ] **Improved monitoring** - Cluster-wide health dashboard
+- [ ] **Thread-level checkpointing** - Fine-grained recovery per worker thread
+
+**Benefits:**
+
+- Linear scaling across cluster nodes
+- Better resource utilization per node
+- Automatic failover and rebalancing
+
+**Note:** The code already supports multiple worker threads per instance via `server.workerIndex`. Each thread gets a durable identity (`hostname-workerIndex`) that persists across restarts, enabling checkpoint-based recovery.
+
 ---
 
 **Get Started:** Deploy on [Harper Fabric](https://fabric.harper.fast) - free tier available, no credit card required.