Skip to content

Conversation

@irjudson
Copy link
Member

Summary

This PR adds column selection support to reduce BigQuery scanning costs and introduces multi-table orchestrator capabilities for the maritime data synthesizer.

Key Features

1. Column Selection (Cost Optimization)

  • Query cost reduction: Select only needed columns instead of SELECT *
  • Backward compatible: Defaults to ['*'] when not specified
  • Validation: Ensures timestamp column is always included
  • Examples provided: Minimal tracking (80% savings), Movement analysis (65% savings)

2. Multi-Table Support

  • Multi-table orchestrator: Generate related datasets (vessel positions, port events, vessel metadata)
  • Independent sync engines: Each table has its own checkpoint and sync configuration
  • Dynamic table access: Runtime table selection via tables[tableName]

3. CI/CD Infrastructure (Merged from main)

  • ESLint + Prettier: Code quality with @harperdb/code-guidelines
  • Pre-commit hooks: Automatic lint + test validation with Husky
  • GitHub Actions: Lint, test (Node 20/22), and format checks on all PRs

Configuration Formats

Both formats are fully supported with automatic normalization:

Legacy single-table:

bigquery:
  projectId: my-project
  dataset: maritime
  table: vessel_positions
  timestampColumn: timestamp
  columns: ['timestamp', 'mmsi', 'latitude', 'longitude']  # Optional

Multi-table:

bigquery:
  projectId: my-project
  credentials: /path/to/key.json
  location: US
  tables:
    - id: vessel_positions
      dataset: maritime
      table: positions
      timestampColumn: timestamp
      targetTable: VesselPositions
      columns: ['timestamp', 'mmsi', 'latitude', 'longitude']
    - id: port_events
      dataset: maritime
      table: events
      timestampColumn: event_time
      targetTable: PortEvents
      columns: ['*']

Test Coverage

  • 33 tests, all passing
  • 7 tests for legacy single-table format (automatic normalization)
  • 4 tests for native multi-table format
  • Ensures no regressions when using either configuration style

Changes

  • Added QueryBuilder for column-aware SQL generation
  • Updated BigQueryClient to support column selection
  • Added normalizeConfig() for backward compatibility
  • Implemented multi-table orchestrator with 3 generators
  • Added comprehensive documentation and examples
  • Merged CI/CD infrastructure from main branch
  • Fixed all lint/format issues with pre-commit hooks

Testing

npm test        # All 33 tests passing
npm run lint    # 0 errors, 0 warnings
npm run format:check  # All files properly formatted

Documentation

  • Updated README with column selection examples
  • Added COLUMN-SELECTION.md with detailed usage
  • Documented cost savings for different column configurations
  • Added multi-table orchestrator documentation

🤖 Generated with Claude Code

irjudson and others added 13 commits November 10, 2025 14:06
This commit implements Phase 1 of column selection for BigQuery sync,
along with significant code quality improvements and comprehensive testing.

## Features Added

### Column Selection
- Add optional 'columns' field to config.yaml
- Support for selecting specific columns from BigQuery tables
- Defaults to SELECT * for backward compatibility
- Validates that timestampColumn is included in column list
- Reduces BigQuery data transfer costs and improves performance

### Code Quality Improvements
- Extract SQL query construction to new QueryBuilder class
- Extract type conversion to separate type-converter module
- Add centralized validation module (validators.js)
- Add comprehensive JSDoc documentation throughout
- Simplify and clarify type conversion logic

## Files Changed

### New Files
- src/query-builder.js - SQL query construction with column selection
- src/type-converter.js - Simplified BigQuery type conversion
- src/validators.js - Centralized configuration validation
- test/query-builder.test.js - 21 unit tests for query builder
- test/type-converter.test.js - 28 unit tests for type converter
- test/integration/column-selection.test.js - E2E integration tests
- docs/MULTI-TABLE-ROADMAP.md - Future architecture roadmap

### Modified Files
- config.yaml - Add columns field documentation
- src/bigquery-client.js - Use QueryBuilder, add JSDoc
- src/config-loader.js - Extract and validate columns, add JSDoc
- src/sync-engine.js - Use type-converter utility
- test/config-loader.test.js - Add tests for column selection
- docs/QUICKSTART.md - Document column selection feature
- docs/SYSTEM-OVERVIEW.md - Update with new architecture

## Testing
- All existing tests pass (no regressions)
- 60+ new unit tests added
- Integration test framework for E2E validation
- Test coverage for column selection, validation, and type conversion

## Benefits
- Reduced BigQuery costs (fetch only needed columns)
- Improved sync performance
- Better code organization and testability
- Comprehensive documentation
- Future-ready for multi-table support (Phase 2)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Completes v1.0 feature set with comprehensive multi-table data generation
support. The orchestrator generates test data for all 3 tables (vessel_positions,
port_events, vessel_metadata) with consistent MMSI identifiers.

Core Changes:
- Created VesselPositionsGenerator wrapper (ext/maritime-data-synthesizer/generators/)
- Integrated MultiTableOrchestrator into CLI (bin/cli.js)
- Enhanced getSynthesizerConfig() to detect multi-table mode (src/config-loader.js)
- Updated default config.yaml to multi-table format with column selection
- CLI auto-detects single-table vs multi-table mode from configuration

CLI Usage:
  # Multi-table mode (when config has bigquery.tables array):
  npx maritime-data-synthesizer initialize small      # 100 positions, 10 events, 20 metadata
  npx maritime-data-synthesizer initialize realistic  # 10k positions, 500 events, 100 metadata
  npx maritime-data-synthesizer initialize stress     # 100k positions, 5k events, 1k metadata

  # Single-table mode (when config has legacy format):
  npx maritime-data-synthesizer start                 # Continuous generation

Test Coverage:
- Added 19 new tests (66 total, all passing)
- test/config-loader.test.js: Multi-table detection (3 tests)
- test/vessel-positions-generator.test.js: Generator wrapper (11 tests)
- test/integration/orchestrator.test.js: Full orchestrator (19 tests)

Config Updates:
- config.yaml: Now uses multi-table format by default
- config.multi-table.yaml: Example showing all 3 tables
- Added column selection to all table configurations

Documentation:
- Updated README with multi-table examples and constraints
- Added design document (docs/plans/2025-11-12-multi-table-tdd-design.md)

All 66 tests passing. Ready for production use.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…structions

- Walk (v2.0): Multi-table with column selection (COMPLETE)
- Run (v3.0): Multi-threaded ingestion (PLANNED)
- Added clear synthesizer running instructions with prerequisites
- Documented all CLI commands with scenario descriptions
- Clarified multi-table vs single-table mode behavior

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
The synthesizer DOES generate all 3 tables when bigquery.tables is present.
Updated comment to reflect that multi-table orchestrator is used in this mode,
and dataset/table fields are only defaults for single-table mode.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
CRITICAL FIX: BigQuery streaming inserts are not available in free tier
and have significant limitations. Changed orchestrator to use load job
API with NEWLINE_DELIMITED_JSON files, matching the single-table
synthesizer implementation.

Changes:
- Added fs, os, path imports for temp file handling
- Replaced table.insert() with table.load() in insertRecords()
- Write records to temp NDJSON file before loading
- Clean up temp files after successful load
- Maintains batching for large datasets (10k records per batch)

This fix enables the orchestrator to work with:
- BigQuery free tier accounts
- All BigQuery pricing tiers without streaming insert costs
- Existing table schemas (autodetect: false)

Tested: All 66 tests passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Currently tables must be pre-defined in schema.graphql. In the future,
we could dynamically create Harper tables based on BigQuery schema at
runtime using the Operations API.

This would enable:
- Automatic table creation from BigQuery metadata
- No manual schema.graphql maintenance for new tables
- Schema evolution support

Reference: https://docs.harperdb.io/docs/developers/operations-api

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
The verify method was hardcoding 'timestamp' for all tables, causing
verification failures for port_events (event_time) and vessel_metadata
(last_updated).

Changes:
- Added tableConfigs mapping: table name -> timestamp column
- vessel_metadata: last_updated
- port_events: event_time
- vessel_positions: timestamp
- Dynamic query construction using correct timestamp column

Fixes verification step after data generation completes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Schema Changes:
- Cleaned up extra blank lines in schema file

Resources Changes:
- Renamed BigQueryData to VesselMetadata (matches table name)
- Added VesselPositions resource class
- Added PortEvents resource class
- All 3 resource classes support dynamic attribute searching
- Simplified validation endpoint (removed null check, let error throw)

CLI Changes:
- Added setTimeout with process.exit(0) after initialization
- Allows CLI to exit cleanly instead of hanging after completion
- BigQuery client keeps event loop alive, this forces clean exit

All resource classes extend their respective table types and provide
consistent get() and search() implementations with debug logging.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add comprehensive documentation to design doc explaining:
- Current implementation uses load job API for free tier support
- Streaming insert API not available in BigQuery free tier
- TODO: Add opt-in streaming insert for production deployments
- Performance tradeoffs clearly documented
- Code examples for both approaches

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
1. Documentation Consolidation:
   - Merged ROLLING-WINDOW.md into maritime-synthesizer.md
   - Deleted redundant maritime-data-synthesizer.md (duplicate)
   - Archived MULTI-TABLE-ROADMAP.md to internal/ (planning doc, now implemented)

2. File Naming Standardization:
   - Renamed all docs to lowercase for consistency:
     * MARITIME-SYNTHESIZER-README.md → maritime-synthesizer.md
     * QUICKSTART.md → quickstart.md
     * SECURITY.md → security.md
     * SYSTEM-OVERVIEW.md → system-overview.md
     * All internal docs renamed to lowercase

3. Added TODO for Multi-Table Rolling Window:
   - Added comprehensive TODO in bin/cli.js documenting that
     multi-table orchestrator currently only supports one-time
     'initialize' command
   - References single-table MaritimeDataSynthesizer as implementation
     example for rolling window/backfill/cleanup features

4. Development Attribution:
   - Added Development section to README crediting Claude Code

Result:
- Non-redundant documentation
- Consistent naming (lowercase)
- Clear about multi-table limitations
- Rolling window docs integrated into main synthesizer guide
Integrated CI/CD infrastructure from main branch:
- Added ESLint and Prettier with @harperdb/code-guidelines
- Added GitHub Actions workflow (lint, test, format checks)
- Added Husky pre-commit hooks
- Fixed unused variables and loose equality operators
- Auto-formatted all files with Prettier (tabs vs spaces)

Preserved multi-table implementation from feature/column-selection.
Removed documentation files that were cleaned up in feature branch.

Note: Some config-loader tests need updating for multi-table format changes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Updated config-loader tests to verify backward compatibility with both
configuration formats:

Legacy single-table format (7 tests):
- Validates automatic normalization to multi-table format
- Tests config.bigquery.tables[0] structure after normalization
- Verifies column validation and defaults

Multi-table format (4 tests):
- Tests native multi-table configuration handling
- Validates multiple tables with different configurations
- Tests column validation per table
- Verifies location defaults

All 33 tests passing - ensures no regressions when using either format.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
The examples/column-selection-config.yaml file contains multiple example
configurations with duplicate 'bigquery:' keys, which is intentional for
documentation purposes but causes Prettier's YAML parser to fail.

Added .prettierignore to exclude:
- Example files with multiple config snippets
- External packages (ext/)
- Standard ignores (node_modules, dist, coverage)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@irjudson irjudson merged commit 041e93b into main Nov 13, 2025
4 checks passed
@irjudson irjudson deleted the feature/column-selection branch November 13, 2025 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants