Skip to content

Conversation

@irjudson
Copy link
Member

Summary

This PR fixes the validation service cluster discovery and updates API documentation with comprehensive distributed control and validation features.

Changes

Bug Fixes

  • Validation service cluster discovery: Fixed validation.js to use server global instead of harperCluster for cluster topology discovery, matching the approach used in sync-engine.js
    • Previously caused ReferenceError: harperCluster is not defined
    • Validation now successfully runs and writes results to SyncAudit table

Documentation Updates

  • Distributed sync control: Documented cluster-wide start, stop, and validate commands
  • Data validation: Added comprehensive documentation of validation checks (progress, smoke test, spot check)
  • REST API endpoints: Documented endpoints for accessing synced data tables with authentication examples
  • Postman collection: Added documentation for included test collection

Postman Collection

  • Updated collection with correct REST endpoint paths (removed incorrect /bigquery-ingestor prefix)
  • Added data verification tests for PortEvents, VesselMetadata, VesselPositions
  • Fixed authentication to use correct credentials (admin / HarperRocks!)
  • Added validation result queries via SyncAudit endpoint

Cleanup

  • Removed temporary diagnostic scripts (check-bq-data.js, check-node-distribution.js, etc.)

Testing

  • ✅ Validation command executes successfully across cluster
  • ✅ Audit records written to SyncAudit table with detailed check results
  • ✅ All unit tests passing
  • ✅ Postman collection verified with live endpoints

Impact

  • Validation feature now functional for monitoring data integrity
  • Improved documentation for operators using distributed sync control
  • Clean repository without temporary diagnostic scripts

irjudson and others added 30 commits December 16, 2025 10:30
- Single record ('sync-control') for cluster-wide state
- Version field for race condition handling
- Supports start/stop/validate commands
- Constructor accepts array of SyncEngine instances
- Tracks processing state and version
- Basic getStatus() implementation
- Loads existing state or initializes to 'stop'
- Sets up table subscription
- Stubs for processCommand and subscription loop
- Iterates over AsyncIterable subscription
- Version-based deduplication
- Auto-restart on failure with 5s delay
- Switch statement for start/stop/validate
- Prevents concurrent processing
- Error handling per command
- startAllEngines: parallel start with failure tracking
- stopAllEngines: parallel stop, clears failures
- runValidation: delegates to global validator
- Add globals and tables to global declarations
- Make error extraction safer with optional chaining
- Import and initialize after syncEngines created
- Store in globals for access by SyncControl resource
- Logs initialization for observability
- GET returns global state + worker-specific status
- POST updates SyncControlState table instead of direct control
- Version-based coordination across all workers/nodes
- Bumped version to 2.0.0
- Add example JSON response showing global + worker state
- Clarify that control commands are now cluster-wide
- Document nodeId format and table-level status
- 4 test suites: Single Node, Restart Recovery, Multi-Worker, Data Queries
- 17 requests with automated test assertions
- Tests version increments, cluster-wide coordination, worker state
- Includes environment variable tracking for version numbers
- Prevent TypeError when GET called during startup
- Return initializing status if controlManager not ready
- Log warning if POST called before initialization complete

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Package Changes:
- Update package name to @harperdb/bigquery-ingestor
- Update description to emphasize data ingestion focus
- Update package-lock.json with new package name

Schema Changes:
- Rename schema/harper-bigquery-sync.graphql to bigquery-ingestor.graphql
- Update config.yaml and config.multi-table.yaml to reference new schema file

Documentation Updates:
- Update all GitHub issue links in ROADMAP.md
- Update installation instructions in README.md
- Update references in CONTRIBUTING.md and all docs/ files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Set default credentials (HDB_ADMIN/password) at collection level
- All requests inherit auth from parent collection
- Updated collection description with auth instructions
- Users can change credentials in one place

Fixes authentication errors when running requests out of the box.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Bug: discoverCluster was trying to access node.hostname from server.nodes
array, but these properties don't exist. This caused "undefined-0" node IDs
and "Current node not found in cluster" errors.

Fix: Generate worker IDs using server.workerCount and server.hostname,
creating IDs like "harper-0-0", "harper-0-1" for multi-threaded workers.

Changes:
- Use server.workerCount to enumerate workers on current node
- Generate worker IDs as ${hostname}-${workerIndex}
- Remove incorrect server.nodes iteration
- Update debug logging to show worker count instead of nodes object

This properly handles multi-threaded Harper instances (e.g., --threads 2).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
- Detect multi-node clusters via server.nodes array
- Enumerate all nodes and their workers
- Add debug logging to discover node object structure
- Try multiple common property names (hostname, host, id, name)
- Fallback to single-node mode if server.nodes unavailable

This should handle both:
- Multi-node clusters (3 nodes with 1 worker each)
- Single-node multi-threaded (1 node with 3 workers)

Debug logs will show actual node object structure for proper fix.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Critical Fix: server.nodes only contains OTHER cluster nodes, not the
current node. This caused "Current node not found in cluster" errors.

Changes:
- Add current node (server.hostname) to cluster list first
- Then add other nodes from server.nodes array
- Use node.name property (Harper's cluster node format)
- Apply workerCount to all nodes in cluster

This properly handles 3-node cluster with 2 workers each:
- harper-0-0, harper-0-1 (current node)
- harper-1-0, harper-1-1 (from server.nodes)
- harper-2-0, harper-2-1 (from server.nodes)
Total: 6 workers across 3 nodes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Change: Use toISOString() when writing commandedAt to ensure proper
serialization to the database.

This ensures Harper can properly store and retrieve the timestamp value.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Bug: subscription was calling this.tables.SyncControlState.get() but
SyncControlManager doesn't have a tables property, causing:
TypeError: Cannot read properties of undefined (reading 'SyncControlState')

Fix: Use global tables object declared at top of file instead of this.tables
Fixes: ReferenceError: globals is not defined in runValidation method
Bug: JSON.stringify fails on objects with circular references like timers
Error: Converting circular structure to JSON

Fix: Remove JSON.stringify from debug logs - just log the key name instead
Fixes: ReferenceError: logger is not defined when running CLI tools

The globals module is imported by CLI tools that run outside Harper's
runtime where logger is not available. Add typeof checks before using
logger to allow code to work in both contexts.
Fixes: logger is not defined when running CLI tools

The config-loader module is imported by CLI tools that run outside
Harper's runtime. Added a safe wrapper object that checks logger
availability before calling it, allowing the code to work in both
Harper runtime and standalone CLI contexts.
Changed initialize command to generate data starting 30 days ago instead
of starting from current time. This ensures sync engine finds data when
it starts, since it queries for timestamps > 1970.

The data will span from (now - 30 days) to (now - 30 days + scenario duration):
- small: 1 hour window
- realistic: 24 hour window
- stress: 7 day window
Added clearAllTables() and deleteAllTables() methods to MultiTableOrchestrator:
- clear: Deletes all data from tables but preserves schema
- clean: Deletes tables entirely (schema and data)

Updated CLI to support these commands in multi-table mode.
Both commands operate on all three tables: vessel_positions, port_events,
and vessel_metadata.

Updated help message with better examples showing scenario usage.
BigQuery free tier does not support DML queries (DELETE statements).
Updated implementation to work around this limitation:

- clearAllTables(): Now deletes and recreates tables to clear data
  while preserving schema (free tier compatible)

- truncateTables(): Same approach - delete and recreate tables

- cleanupOldData(): Added error handling to detect free tier DML
  errors and skip cleanup with a helpful message

Users in free tier can still use all commands:
- clear: Works by recreating tables
- clean: Already worked (just deletes tables)
- start: Works, but automatic cleanup is skipped (manual clear needed)

This allows the tool to work fully in BigQuery's free tier without
requiring billing to be enabled.
Harper requires ID fields to be strings (or arrays of strings), but the
sync engine was generating numeric IDs. This caused validation errors:
'Value X in property id must be a string'

Changed ID generation to convert the numeric hash to a string before
storing records. This maintains deterministic ID generation while
satisfying Harper's type requirements.
- Log full SQL query at INFO level
- Log all query parameters (nodeId, clusterSize, lastTimestamp, batchSize)
- Add warning with params when query returns 0 results
- This will help identify why queries return 0 despite data existing
- Log raw BigQuery record structure (keys and sample)
- Log converted record structure after type conversion
- Log mapped record structure before writing to Harper
- This will help identify where BigQuery fields are being lost
…rtitioning

PROBLEM: All generated timestamps had .000 microseconds because JavaScript
Date only has millisecond precision. This caused MOD(UNIX_MICROS(timestamp), 6)
to only produce even values (0, 2, 4), leaving nodes 1, 3, 5 with zero records.

SOLUTION: Added toISOStringWithMicros() helper that:
- Preserves fractional milliseconds from timestamp generation
- Adds random microseconds (0-999) for additional distribution
- Formats as ISO 8601 with 6-digit microsecond precision

IMPACT: Data will now distribute evenly across all 6 cluster nodes.

Modified:
- src/generator.js: vessel_positions timestamp
- tools/maritime-data-synthesizer/generators/port-events-generator.js: event_time
- tools/maritime-data-synthesizer/generators/vessel-metadata-generator.js: last_updated
When regenerating BigQuery data or restarting sync, records with the same
deterministic IDs may already exist in Harper. Using put instead of create
prevents 'Record already exists' errors and allows graceful updates.
- Fix validation.js to use server global instead of harperCluster for cluster discovery
- Update Postman collection with correct REST endpoints and data verification tests
- Add comprehensive validation documentation to README
- Document distributed sync control commands (start, stop, validate)
- Document REST API endpoints for accessing synced data
- Remove temporary diagnostic scripts
@irjudson irjudson merged commit 671d9f0 into main Dec 22, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants