-
Notifications
You must be signed in to change notification settings - Fork 0
Add column selection and multi-table support #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit implements Phase 1 of column selection for BigQuery sync, along with significant code quality improvements and comprehensive testing. ## Features Added ### Column Selection - Add optional 'columns' field to config.yaml - Support for selecting specific columns from BigQuery tables - Defaults to SELECT * for backward compatibility - Validates that timestampColumn is included in column list - Reduces BigQuery data transfer costs and improves performance ### Code Quality Improvements - Extract SQL query construction to new QueryBuilder class - Extract type conversion to separate type-converter module - Add centralized validation module (validators.js) - Add comprehensive JSDoc documentation throughout - Simplify and clarify type conversion logic ## Files Changed ### New Files - src/query-builder.js - SQL query construction with column selection - src/type-converter.js - Simplified BigQuery type conversion - src/validators.js - Centralized configuration validation - test/query-builder.test.js - 21 unit tests for query builder - test/type-converter.test.js - 28 unit tests for type converter - test/integration/column-selection.test.js - E2E integration tests - docs/MULTI-TABLE-ROADMAP.md - Future architecture roadmap ### Modified Files - config.yaml - Add columns field documentation - src/bigquery-client.js - Use QueryBuilder, add JSDoc - src/config-loader.js - Extract and validate columns, add JSDoc - src/sync-engine.js - Use type-converter utility - test/config-loader.test.js - Add tests for column selection - docs/QUICKSTART.md - Document column selection feature - docs/SYSTEM-OVERVIEW.md - Update with new architecture ## Testing - All existing tests pass (no regressions) - 60+ new unit tests added - Integration test framework for E2E validation - Test coverage for column selection, validation, and type conversion ## Benefits - Reduced BigQuery costs (fetch only needed columns) - Improved sync performance - Better code organization and testability - Comprehensive documentation - Future-ready for multi-table support (Phase 2) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Completes v1.0 feature set with comprehensive multi-table data generation support. The orchestrator generates test data for all 3 tables (vessel_positions, port_events, vessel_metadata) with consistent MMSI identifiers. Core Changes: - Created VesselPositionsGenerator wrapper (ext/maritime-data-synthesizer/generators/) - Integrated MultiTableOrchestrator into CLI (bin/cli.js) - Enhanced getSynthesizerConfig() to detect multi-table mode (src/config-loader.js) - Updated default config.yaml to multi-table format with column selection - CLI auto-detects single-table vs multi-table mode from configuration CLI Usage: # Multi-table mode (when config has bigquery.tables array): npx maritime-data-synthesizer initialize small # 100 positions, 10 events, 20 metadata npx maritime-data-synthesizer initialize realistic # 10k positions, 500 events, 100 metadata npx maritime-data-synthesizer initialize stress # 100k positions, 5k events, 1k metadata # Single-table mode (when config has legacy format): npx maritime-data-synthesizer start # Continuous generation Test Coverage: - Added 19 new tests (66 total, all passing) - test/config-loader.test.js: Multi-table detection (3 tests) - test/vessel-positions-generator.test.js: Generator wrapper (11 tests) - test/integration/orchestrator.test.js: Full orchestrator (19 tests) Config Updates: - config.yaml: Now uses multi-table format by default - config.multi-table.yaml: Example showing all 3 tables - Added column selection to all table configurations Documentation: - Updated README with multi-table examples and constraints - Added design document (docs/plans/2025-11-12-multi-table-tdd-design.md) All 66 tests passing. Ready for production use. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…structions - Walk (v2.0): Multi-table with column selection (COMPLETE) - Run (v3.0): Multi-threaded ingestion (PLANNED) - Added clear synthesizer running instructions with prerequisites - Documented all CLI commands with scenario descriptions - Clarified multi-table vs single-table mode behavior 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The synthesizer DOES generate all 3 tables when bigquery.tables is present. Updated comment to reflect that multi-table orchestrator is used in this mode, and dataset/table fields are only defaults for single-table mode. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
CRITICAL FIX: BigQuery streaming inserts are not available in free tier and have significant limitations. Changed orchestrator to use load job API with NEWLINE_DELIMITED_JSON files, matching the single-table synthesizer implementation. Changes: - Added fs, os, path imports for temp file handling - Replaced table.insert() with table.load() in insertRecords() - Write records to temp NDJSON file before loading - Clean up temp files after successful load - Maintains batching for large datasets (10k records per batch) This fix enables the orchestrator to work with: - BigQuery free tier accounts - All BigQuery pricing tiers without streaming insert costs - Existing table schemas (autodetect: false) Tested: All 66 tests passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Currently tables must be pre-defined in schema.graphql. In the future, we could dynamically create Harper tables based on BigQuery schema at runtime using the Operations API. This would enable: - Automatic table creation from BigQuery metadata - No manual schema.graphql maintenance for new tables - Schema evolution support Reference: https://docs.harperdb.io/docs/developers/operations-api 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The verify method was hardcoding 'timestamp' for all tables, causing verification failures for port_events (event_time) and vessel_metadata (last_updated). Changes: - Added tableConfigs mapping: table name -> timestamp column - vessel_metadata: last_updated - port_events: event_time - vessel_positions: timestamp - Dynamic query construction using correct timestamp column Fixes verification step after data generation completes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Schema Changes: - Cleaned up extra blank lines in schema file Resources Changes: - Renamed BigQueryData to VesselMetadata (matches table name) - Added VesselPositions resource class - Added PortEvents resource class - All 3 resource classes support dynamic attribute searching - Simplified validation endpoint (removed null check, let error throw) CLI Changes: - Added setTimeout with process.exit(0) after initialization - Allows CLI to exit cleanly instead of hanging after completion - BigQuery client keeps event loop alive, this forces clean exit All resource classes extend their respective table types and provide consistent get() and search() implementations with debug logging. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Add comprehensive documentation to design doc explaining: - Current implementation uses load job API for free tier support - Streaming insert API not available in BigQuery free tier - TODO: Add opt-in streaming insert for production deployments - Performance tradeoffs clearly documented - Code examples for both approaches 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
1. Documentation Consolidation:
- Merged ROLLING-WINDOW.md into maritime-synthesizer.md
- Deleted redundant maritime-data-synthesizer.md (duplicate)
- Archived MULTI-TABLE-ROADMAP.md to internal/ (planning doc, now implemented)
2. File Naming Standardization:
- Renamed all docs to lowercase for consistency:
* MARITIME-SYNTHESIZER-README.md → maritime-synthesizer.md
* QUICKSTART.md → quickstart.md
* SECURITY.md → security.md
* SYSTEM-OVERVIEW.md → system-overview.md
* All internal docs renamed to lowercase
3. Added TODO for Multi-Table Rolling Window:
- Added comprehensive TODO in bin/cli.js documenting that
multi-table orchestrator currently only supports one-time
'initialize' command
- References single-table MaritimeDataSynthesizer as implementation
example for rolling window/backfill/cleanup features
4. Development Attribution:
- Added Development section to README crediting Claude Code
Result:
- Non-redundant documentation
- Consistent naming (lowercase)
- Clear about multi-table limitations
- Rolling window docs integrated into main synthesizer guide
Integrated CI/CD infrastructure from main branch: - Added ESLint and Prettier with @harperdb/code-guidelines - Added GitHub Actions workflow (lint, test, format checks) - Added Husky pre-commit hooks - Fixed unused variables and loose equality operators - Auto-formatted all files with Prettier (tabs vs spaces) Preserved multi-table implementation from feature/column-selection. Removed documentation files that were cleaned up in feature branch. Note: Some config-loader tests need updating for multi-table format changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Updated config-loader tests to verify backward compatibility with both configuration formats: Legacy single-table format (7 tests): - Validates automatic normalization to multi-table format - Tests config.bigquery.tables[0] structure after normalization - Verifies column validation and defaults Multi-table format (4 tests): - Tests native multi-table configuration handling - Validates multiple tables with different configurations - Tests column validation per table - Verifies location defaults All 33 tests passing - ensures no regressions when using either format. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The examples/column-selection-config.yaml file contains multiple example configurations with duplicate 'bigquery:' keys, which is intentional for documentation purposes but causes Prettier's YAML parser to fail. Added .prettierignore to exclude: - Example files with multiple config snippets - External packages (ext/) - Standard ignores (node_modules, dist, coverage) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds column selection support to reduce BigQuery scanning costs and introduces multi-table orchestrator capabilities for the maritime data synthesizer.
Key Features
1. Column Selection (Cost Optimization)
SELECT *['*']when not specified2. Multi-Table Support
tables[tableName]3. CI/CD Infrastructure (Merged from main)
Configuration Formats
Both formats are fully supported with automatic normalization:
Legacy single-table:
Multi-table:
Test Coverage
Changes
Testing
Documentation
🤖 Generated with Claude Code