-
Notifications
You must be signed in to change notification settings - Fork 0
Fix validation service and update API documentation #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Single record ('sync-control') for cluster-wide state
- Version field for race condition handling
- Supports start/stop/validate commands
- Constructor accepts array of SyncEngine instances - Tracks processing state and version - Basic getStatus() implementation
- Loads existing state or initializes to 'stop' - Sets up table subscription - Stubs for processCommand and subscription loop
- Iterates over AsyncIterable subscription - Version-based deduplication - Auto-restart on failure with 5s delay
- Switch statement for start/stop/validate - Prevents concurrent processing - Error handling per command
- startAllEngines: parallel start with failure tracking - stopAllEngines: parallel stop, clears failures - runValidation: delegates to global validator
- Add globals and tables to global declarations - Make error extraction safer with optional chaining
- Import and initialize after syncEngines created - Store in globals for access by SyncControl resource - Logs initialization for observability
- GET returns global state + worker-specific status - POST updates SyncControlState table instead of direct control - Version-based coordination across all workers/nodes - Bumped version to 2.0.0
- Add example JSON response showing global + worker state - Clarify that control commands are now cluster-wide - Document nodeId format and table-level status
- 4 test suites: Single Node, Restart Recovery, Multi-Worker, Data Queries - 17 requests with automated test assertions - Tests version increments, cluster-wide coordination, worker state - Includes environment variable tracking for version numbers
- Prevent TypeError when GET called during startup - Return initializing status if controlManager not ready - Log warning if POST called before initialization complete 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Package Changes: - Update package name to @harperdb/bigquery-ingestor - Update description to emphasize data ingestion focus - Update package-lock.json with new package name Schema Changes: - Rename schema/harper-bigquery-sync.graphql to bigquery-ingestor.graphql - Update config.yaml and config.multi-table.yaml to reference new schema file Documentation Updates: - Update all GitHub issue links in ROADMAP.md - Update installation instructions in README.md - Update references in CONTRIBUTING.md and all docs/ files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Set default credentials (HDB_ADMIN/password) at collection level - All requests inherit auth from parent collection - Updated collection description with auth instructions - Users can change credentials in one place Fixes authentication errors when running requests out of the box. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Bug: discoverCluster was trying to access node.hostname from server.nodes
array, but these properties don't exist. This caused "undefined-0" node IDs
and "Current node not found in cluster" errors.
Fix: Generate worker IDs using server.workerCount and server.hostname,
creating IDs like "harper-0-0", "harper-0-1" for multi-threaded workers.
Changes:
- Use server.workerCount to enumerate workers on current node
- Generate worker IDs as ${hostname}-${workerIndex}
- Remove incorrect server.nodes iteration
- Update debug logging to show worker count instead of nodes object
This properly handles multi-threaded Harper instances (e.g., --threads 2).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Detect multi-node clusters via server.nodes array - Enumerate all nodes and their workers - Add debug logging to discover node object structure - Try multiple common property names (hostname, host, id, name) - Fallback to single-node mode if server.nodes unavailable This should handle both: - Multi-node clusters (3 nodes with 1 worker each) - Single-node multi-threaded (1 node with 3 workers) Debug logs will show actual node object structure for proper fix. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Critical Fix: server.nodes only contains OTHER cluster nodes, not the current node. This caused "Current node not found in cluster" errors. Changes: - Add current node (server.hostname) to cluster list first - Then add other nodes from server.nodes array - Use node.name property (Harper's cluster node format) - Apply workerCount to all nodes in cluster This properly handles 3-node cluster with 2 workers each: - harper-0-0, harper-0-1 (current node) - harper-1-0, harper-1-1 (from server.nodes) - harper-2-0, harper-2-1 (from server.nodes) Total: 6 workers across 3 nodes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Change: Use toISOString() when writing commandedAt to ensure proper serialization to the database. This ensures Harper can properly store and retrieve the timestamp value. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Bug: subscription was calling this.tables.SyncControlState.get() but SyncControlManager doesn't have a tables property, causing: TypeError: Cannot read properties of undefined (reading 'SyncControlState') Fix: Use global tables object declared at top of file instead of this.tables
Fixes: ReferenceError: globals is not defined in runValidation method
Bug: JSON.stringify fails on objects with circular references like timers Error: Converting circular structure to JSON Fix: Remove JSON.stringify from debug logs - just log the key name instead
Fixes: ReferenceError: logger is not defined when running CLI tools The globals module is imported by CLI tools that run outside Harper's runtime where logger is not available. Add typeof checks before using logger to allow code to work in both contexts.
Fixes: logger is not defined when running CLI tools The config-loader module is imported by CLI tools that run outside Harper's runtime. Added a safe wrapper object that checks logger availability before calling it, allowing the code to work in both Harper runtime and standalone CLI contexts.
Changed initialize command to generate data starting 30 days ago instead of starting from current time. This ensures sync engine finds data when it starts, since it queries for timestamps > 1970. The data will span from (now - 30 days) to (now - 30 days + scenario duration): - small: 1 hour window - realistic: 24 hour window - stress: 7 day window
Added clearAllTables() and deleteAllTables() methods to MultiTableOrchestrator: - clear: Deletes all data from tables but preserves schema - clean: Deletes tables entirely (schema and data) Updated CLI to support these commands in multi-table mode. Both commands operate on all three tables: vessel_positions, port_events, and vessel_metadata. Updated help message with better examples showing scenario usage.
BigQuery free tier does not support DML queries (DELETE statements). Updated implementation to work around this limitation: - clearAllTables(): Now deletes and recreates tables to clear data while preserving schema (free tier compatible) - truncateTables(): Same approach - delete and recreate tables - cleanupOldData(): Added error handling to detect free tier DML errors and skip cleanup with a helpful message Users in free tier can still use all commands: - clear: Works by recreating tables - clean: Already worked (just deletes tables) - start: Works, but automatic cleanup is skipped (manual clear needed) This allows the tool to work fully in BigQuery's free tier without requiring billing to be enabled.
Harper requires ID fields to be strings (or arrays of strings), but the sync engine was generating numeric IDs. This caused validation errors: 'Value X in property id must be a string' Changed ID generation to convert the numeric hash to a string before storing records. This maintains deterministic ID generation while satisfying Harper's type requirements.
- Log full SQL query at INFO level - Log all query parameters (nodeId, clusterSize, lastTimestamp, batchSize) - Add warning with params when query returns 0 results - This will help identify why queries return 0 despite data existing
- Log raw BigQuery record structure (keys and sample) - Log converted record structure after type conversion - Log mapped record structure before writing to Harper - This will help identify where BigQuery fields are being lost
…rtitioning PROBLEM: All generated timestamps had .000 microseconds because JavaScript Date only has millisecond precision. This caused MOD(UNIX_MICROS(timestamp), 6) to only produce even values (0, 2, 4), leaving nodes 1, 3, 5 with zero records. SOLUTION: Added toISOStringWithMicros() helper that: - Preserves fractional milliseconds from timestamp generation - Adds random microseconds (0-999) for additional distribution - Formats as ISO 8601 with 6-digit microsecond precision IMPACT: Data will now distribute evenly across all 6 cluster nodes. Modified: - src/generator.js: vessel_positions timestamp - tools/maritime-data-synthesizer/generators/port-events-generator.js: event_time - tools/maritime-data-synthesizer/generators/vessel-metadata-generator.js: last_updated
When regenerating BigQuery data or restarting sync, records with the same deterministic IDs may already exist in Harper. Using put instead of create prevents 'Record already exists' errors and allows graceful updates.
- Fix validation.js to use server global instead of harperCluster for cluster discovery - Update Postman collection with correct REST endpoints and data verification tests - Add comprehensive validation documentation to README - Document distributed sync control commands (start, stop, validate) - Document REST API endpoints for accessing synced data - Remove temporary diagnostic scripts
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes the validation service cluster discovery and updates API documentation with comprehensive distributed control and validation features.
Changes
Bug Fixes
validation.jsto useserverglobal instead ofharperClusterfor cluster topology discovery, matching the approach used insync-engine.jsReferenceError: harperCluster is not definedSyncAudittableDocumentation Updates
Postman Collection
/bigquery-ingestorprefix)Cleanup
Testing
Impact