Fix validation service and update API documentation #19

irjudson · 2025-12-22T18:51:20Z

Summary

This PR fixes the validation service cluster discovery and updates API documentation with comprehensive distributed control and validation features.

Changes

Bug Fixes

Validation service cluster discovery: Fixed validation.js to use server global instead of harperCluster for cluster topology discovery, matching the approach used in sync-engine.js
- Previously caused ReferenceError: harperCluster is not defined
- Validation now successfully runs and writes results to SyncAudit table

Documentation Updates

Distributed sync control: Documented cluster-wide start, stop, and validate commands
Data validation: Added comprehensive documentation of validation checks (progress, smoke test, spot check)
REST API endpoints: Documented endpoints for accessing synced data tables with authentication examples
Postman collection: Added documentation for included test collection

Postman Collection

Updated collection with correct REST endpoint paths (removed incorrect /bigquery-ingestor prefix)
Added data verification tests for PortEvents, VesselMetadata, VesselPositions
Fixed authentication to use correct credentials (admin / HarperRocks!)
Added validation result queries via SyncAudit endpoint

Cleanup

Removed temporary diagnostic scripts (check-bq-data.js, check-node-distribution.js, etc.)

Testing

✅ Validation command executes successfully across cluster
✅ Audit records written to SyncAudit table with detailed check results
✅ All unit tests passing
✅ Postman collection verified with live endpoints

Impact

Validation feature now functional for monitoring data integrity
Improved documentation for operators using distributed sync control
Clean repository without temporary diagnostic scripts

- Single record ('sync-control') for cluster-wide state - Version field for race condition handling - Supports start/stop/validate commands

- Constructor accepts array of SyncEngine instances - Tracks processing state and version - Basic getStatus() implementation

- Loads existing state or initializes to 'stop' - Sets up table subscription - Stubs for processCommand and subscription loop

- Iterates over AsyncIterable subscription - Version-based deduplication - Auto-restart on failure with 5s delay

- Switch statement for start/stop/validate - Prevents concurrent processing - Error handling per command

- startAllEngines: parallel start with failure tracking - stopAllEngines: parallel stop, clears failures - runValidation: delegates to global validator

- Add globals and tables to global declarations - Make error extraction safer with optional chaining

- Import and initialize after syncEngines created - Store in globals for access by SyncControl resource - Logs initialization for observability

- GET returns global state + worker-specific status - POST updates SyncControlState table instead of direct control - Version-based coordination across all workers/nodes - Bumped version to 2.0.0

- Add example JSON response showing global + worker state - Clarify that control commands are now cluster-wide - Document nodeId format and table-level status

- 4 test suites: Single Node, Restart Recovery, Multi-Worker, Data Queries - 17 requests with automated test assertions - Tests version increments, cluster-wide coordination, worker state - Includes environment variable tracking for version numbers

- Prevent TypeError when GET called during startup - Return initializing status if controlManager not ready - Log warning if POST called before initialization complete 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Package Changes: - Update package name to @harperdb/bigquery-ingestor - Update description to emphasize data ingestion focus - Update package-lock.json with new package name Schema Changes: - Rename schema/harper-bigquery-sync.graphql to bigquery-ingestor.graphql - Update config.yaml and config.multi-table.yaml to reference new schema file Documentation Updates: - Update all GitHub issue links in ROADMAP.md - Update installation instructions in README.md - Update references in CONTRIBUTING.md and all docs/ files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

- Set default credentials (HDB_ADMIN/password) at collection level - All requests inherit auth from parent collection - Updated collection description with auth instructions - Users can change credentials in one place Fixes authentication errors when running requests out of the box. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Bug: discoverCluster was trying to access node.hostname from server.nodes array, but these properties don't exist. This caused "undefined-0" node IDs and "Current node not found in cluster" errors. Fix: Generate worker IDs using server.workerCount and server.hostname, creating IDs like "harper-0-0", "harper-0-1" for multi-threaded workers. Changes: - Use server.workerCount to enumerate workers on current node - Generate worker IDs as ${hostname}-${workerIndex} - Remove incorrect server.nodes iteration - Update debug logging to show worker count instead of nodes object This properly handles multi-threaded Harper instances (e.g., --threads 2). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

- Detect multi-node clusters via server.nodes array - Enumerate all nodes and their workers - Add debug logging to discover node object structure - Try multiple common property names (hostname, host, id, name) - Fallback to single-node mode if server.nodes unavailable This should handle both: - Multi-node clusters (3 nodes with 1 worker each) - Single-node multi-threaded (1 node with 3 workers) Debug logs will show actual node object structure for proper fix. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Critical Fix: server.nodes only contains OTHER cluster nodes, not the current node. This caused "Current node not found in cluster" errors. Changes: - Add current node (server.hostname) to cluster list first - Then add other nodes from server.nodes array - Use node.name property (Harper's cluster node format) - Apply workerCount to all nodes in cluster This properly handles 3-node cluster with 2 workers each: - harper-0-0, harper-0-1 (current node) - harper-1-0, harper-1-1 (from server.nodes) - harper-2-0, harper-2-1 (from server.nodes) Total: 6 workers across 3 nodes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Change: Use toISOString() when writing commandedAt to ensure proper serialization to the database. This ensures Harper can properly store and retrieve the timestamp value. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

Bug: subscription was calling this.tables.SyncControlState.get() but SyncControlManager doesn't have a tables property, causing: TypeError: Cannot read properties of undefined (reading 'SyncControlState') Fix: Use global tables object declared at top of file instead of this.tables

Fixes: ReferenceError: globals is not defined in runValidation method

Bug: JSON.stringify fails on objects with circular references like timers Error: Converting circular structure to JSON Fix: Remove JSON.stringify from debug logs - just log the key name instead

Fixes: ReferenceError: logger is not defined when running CLI tools The globals module is imported by CLI tools that run outside Harper's runtime where logger is not available. Add typeof checks before using logger to allow code to work in both contexts.

Fixes: logger is not defined when running CLI tools The config-loader module is imported by CLI tools that run outside Harper's runtime. Added a safe wrapper object that checks logger availability before calling it, allowing the code to work in both Harper runtime and standalone CLI contexts.

Changed initialize command to generate data starting 30 days ago instead of starting from current time. This ensures sync engine finds data when it starts, since it queries for timestamps > 1970. The data will span from (now - 30 days) to (now - 30 days + scenario duration): - small: 1 hour window - realistic: 24 hour window - stress: 7 day window

Added clearAllTables() and deleteAllTables() methods to MultiTableOrchestrator: - clear: Deletes all data from tables but preserves schema - clean: Deletes tables entirely (schema and data) Updated CLI to support these commands in multi-table mode. Both commands operate on all three tables: vessel_positions, port_events, and vessel_metadata. Updated help message with better examples showing scenario usage.

BigQuery free tier does not support DML queries (DELETE statements). Updated implementation to work around this limitation: - clearAllTables(): Now deletes and recreates tables to clear data while preserving schema (free tier compatible) - truncateTables(): Same approach - delete and recreate tables - cleanupOldData(): Added error handling to detect free tier DML errors and skip cleanup with a helpful message Users in free tier can still use all commands: - clear: Works by recreating tables - clean: Already worked (just deletes tables) - start: Works, but automatic cleanup is skipped (manual clear needed) This allows the tool to work fully in BigQuery's free tier without requiring billing to be enabled.

Harper requires ID fields to be strings (or arrays of strings), but the sync engine was generating numeric IDs. This caused validation errors: 'Value X in property id must be a string' Changed ID generation to convert the numeric hash to a string before storing records. This maintains deterministic ID generation while satisfying Harper's type requirements.

- Log full SQL query at INFO level - Log all query parameters (nodeId, clusterSize, lastTimestamp, batchSize) - Add warning with params when query returns 0 results - This will help identify why queries return 0 despite data existing

- Log raw BigQuery record structure (keys and sample) - Log converted record structure after type conversion - Log mapped record structure before writing to Harper - This will help identify where BigQuery fields are being lost

…rtitioning PROBLEM: All generated timestamps had .000 microseconds because JavaScript Date only has millisecond precision. This caused MOD(UNIX_MICROS(timestamp), 6) to only produce even values (0, 2, 4), leaving nodes 1, 3, 5 with zero records. SOLUTION: Added toISOStringWithMicros() helper that: - Preserves fractional milliseconds from timestamp generation - Adds random microseconds (0-999) for additional distribution - Formats as ISO 8601 with 6-digit microsecond precision IMPACT: Data will now distribute evenly across all 6 cluster nodes. Modified: - src/generator.js: vessel_positions timestamp - tools/maritime-data-synthesizer/generators/port-events-generator.js: event_time - tools/maritime-data-synthesizer/generators/vessel-metadata-generator.js: last_updated

When regenerating BigQuery data or restarting sync, records with the same deterministic IDs may already exist in Harper. Using put instead of create prevents 'Record already exists' errors and allows graceful updates.

- Fix validation.js to use server global instead of harperCluster for cluster discovery - Update Postman collection with correct REST endpoints and data verification tests - Add comprehensive validation documentation to README - Document distributed sync control commands (start, stop, validate) - Document REST API endpoints for accessing synced data - Remove temporary diagnostic scripts

irjudson and others added 30 commits December 16, 2025 10:30

Add SyncControlState table for distributed control

74a069d

- Single record ('sync-control') for cluster-wide state - Version field for race condition handling - Supports start/stop/validate commands

Add SyncControlManager class structure

959a677

- Constructor accepts array of SyncEngine instances - Tracks processing state and version - Basic getStatus() implementation

Implement state loading in SyncControlManager

05a145a

- Loads existing state or initializes to 'stop' - Sets up table subscription - Stubs for processCommand and subscription loop

Implement subscription loop with error recovery

6db267a

- Iterates over AsyncIterable subscription - Version-based deduplication - Auto-restart on failure with 5s delay

Implement command processing with concurrency guard

f42d249

- Switch statement for start/stop/validate - Prevents concurrent processing - Error handling per command

Add engine control methods

7a8f202

- startAllEngines: parallel start with failure tracking - stopAllEngines: parallel stop, clears failures - runValidation: delegates to global validator

Fix code review issues in Task 6

cd032fd

- Add globals and tables to global declarations - Make error extraction safer with optional chaining

Integrate SyncControlManager into handleApplication

de229cb

- Import and initialize after syncEngines created - Store in globals for access by SyncControl resource - Logs initialization for observability

Update SyncControl resource for cluster-wide control

a382c48

- GET returns global state + worker-specific status - POST updates SyncControlState table instead of direct control - Version-based coordination across all workers/nodes - Bumped version to 2.0.0

Document new cluster-wide status format in README

2ec915f

- Add example JSON response showing global + worker state - Clarify that control commands are now cluster-wide - Document nodeId format and table-level status

Fix prettier formatting for long log lines

df9af78

Add globals import to sync-control-manager.js

abd9a67

Fixes: ReferenceError: globals is not defined in runValidation method

Remove JSON.stringify from globals debug logging

2155b30

Bug: JSON.stringify fails on objects with circular references like timers Error: Converting circular structure to JSON Fix: Remove JSON.stringify from debug logs - just log the key name instead

Add detailed query logging to debug 0 results issue

380ee20

- Log full SQL query at INFO level - Log all query parameters (nodeId, clusterSize, lastTimestamp, batchSize) - Add warning with params when query returns 0 results - This will help identify why queries return 0 despite data existing

Add detailed record logging to debug missing fields issue

63713f7

- Log raw BigQuery record structure (keys and sample) - Log converted record structure after type conversion - Log mapped record structure before writing to Harper - This will help identify where BigQuery fields are being lost

irjudson added 4 commits December 18, 2025 16:10

Use put (upsert) instead of create to handle duplicate IDs

d914a9c

When regenerating BigQuery data or restarting sync, records with the same deterministic IDs may already exist in Harper. Using put instead of create prevents 'Record already exists' errors and allows graceful updates.

Fix code formatting issues

f5a98c8

irjudson merged commit 671d9f0 into main Dec 22, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix validation service and update API documentation #19

Fix validation service and update API documentation #19

Uh oh!

irjudson commented Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix validation service and update API documentation #19

Fix validation service and update API documentation #19

Uh oh!

Conversation

irjudson commented Dec 22, 2025

Summary

Changes

Bug Fixes

Documentation Updates

Postman Collection

Cleanup

Testing

Impact

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants