Skip to content

Conversation

@irjudson
Copy link
Member

Summary

Implements automatic Harper table creation via Operations API, eliminating manual schema.graphql definitions for data tables. Tables are created dynamically based on BigQuery schema introspection.

Closes #7

Key Discovery

Harper Operations API is fully schemaless - tables automatically index ALL fields without pre-definition. This dramatically simplified the implementation compared to the original design.

What's Included

Core Components

  • OperationsClient - HTTP client for Harper Operations API (describe_table, create_table)
  • SchemaManager - Orchestrates table creation and schema introspection
  • TypeMapper - Maps BigQuery types to Harper types (for documentation)
  • IndexStrategy - Documents indexing strategy (Harper auto-indexes everything)
  • Integer ID generation - Deterministic SHA256-based IDs for fast indexing

Integration

  • Modified src/index.js to initialize SchemaManager before SyncEngines
  • Ensures tables exist before syncing begins
  • Graceful degradation if Operations API unavailable (falls back to schema.graphql)

Configuration

operations:
  host: localhost
  port: 9925
  username: admin
  password: password

How It Works

  1. On startup, SchemaManager.ensureTable() checks if Harper table exists
  2. If not, creates table with create_table (only requires hash_attribute)
  3. SyncEngine inserts BigQuery data with deterministic integer IDs
  4. Harper automatically indexes ALL inserted fields (no schema pre-definition)
  5. New fields in BigQuery are automatically handled on next insert

Performance Optimization

Integer Primary Keys:

  • Deterministic SHA256-based integer IDs from record data
  • Ensures fast indexing (no strings/GUIDs/objects as PKs)
  • IDs are positive 53-bit safe integers
  • Same input always produces same ID (idempotent)

Benefits

✅ Zero manual schema definitions for data tables
✅ Automatic schema evolution (Harper handles new fields)
✅ Fast integer indexing for optimal performance
✅ Thread-safe and idempotent table creation
✅ Simple implementation (no polling, migrations, or distributed locking)
✅ Graceful fallback to schema.graphql if API unavailable

Testing

  • 97 tests passing (91 existing + 6 new integer ID tests)
  • New test files:
    • test/schema-manager.test.js
    • test/integer-id-generation.test.js
    • test/operations-client.test.js
    • test/type-mapper.test.js
    • test/index-strategy.test.js
  • Integration scripts for live testing in examples/

Changes

18 files changed, 1,734 insertions:

  • New: src/operations-client.js, src/schema-manager.js, src/type-mapper.js, src/index-strategy.js
  • Modified: src/index.js, src/sync-engine.js, src/config-loader.js, config.yaml
  • Tests: 6 new test files with comprehensive coverage
  • Docs: Updated design document and added implementation summary

Documentation

  • Updated docs/plans/2025-11-13-dynamic-table-creation-design.md with implementation notes
  • Added docs/internal/dynamic-table-creation-summary.md with complete feature summary
  • Documented the schemaless discovery and its implications

Migration Path

Existing deployments:

  • System tables remain in schema.graphql (SyncCheckpoint, SyncAudit)
  • Data tables can be removed from schema.graphql
  • Add Operations API credentials to config.yaml
  • Restart - tables created automatically

New deployments:

  • Only system tables in schema.graphql
  • Configure Operations API credentials
  • All data tables created dynamically

What We Didn't Build (Not Needed)

Original design included these components, but they're unnecessary due to Harper's schemaless nature:

  • ❌ SchemaLeaderElection polling (no periodic checks needed)
  • ❌ Schema migration logic (just insert, Harper handles it)
  • addColumns() operation (not supported by API)
  • ❌ Distributed locking for schema checks (no concurrent operations)
  • ❌ Complex type change handling (everything schemaless)

Ready to merge! All tests passing, feature complete, fully documented.

irjudson and others added 8 commits November 13, 2025 15:11
Complete design for Issue #7 covering:
- Thread-safe table creation with check-then-act pattern
- Rich type mapping for all BigQuery types
- Smart indexing based on BigQuery metadata
- Automatic schema migration (additive only)
- Adaptive polling with exponential backoff
- Comprehensive error handling and circuit breaker
- Integration with existing codebase
- Testing strategy for concurrency and correctness
Changed from independent per-node polling to leader election pattern:
- SchemaLeaderElection uses SchemaLock table for coordination
- Only one node (leader) checks schemas at any time
- Dramatically reduces BigQuery and Harper API calls
- Eliminates race conditions during schema migrations
- Automatic failover via lock TTL (10min)
- Still uses adaptive backoff (5min to 30min)

Benefits:
- N nodes polling → 1 node polling (N-1 fewer API calls)
- No conflicts when adding columns
- Clean serialized schema evolution
- Simple and efficient
Implements core components for Issue #7 (dynamic table creation):

Components added:
- TypeMapper: Converts BigQuery types to Harper GraphQL types
- IndexStrategy: Determines which columns should be indexed
- OperationsClient: Wrapper for Harper Operations API
- SchemaManager: Orchestrates table creation and migrations
- SchemaLeaderElection: Distributed locking for schema checks

Schema changes:
- Add SchemaLock table for distributed coordination

Test coverage:
- 60 new unit tests covering all components
- All 151 tests passing (91 existing + 60 new)
- TDD approach: tests written first, then implementation

Design:
- Thread-safe with check-then-act pattern
- Idempotent operations (already exists = success)
- No destructive changes (only additive)
- Versioned columns for type changes (e.g., column_v2)
- Adaptive polling with leader election

Next steps:
- Integrate with Harper Operations API
- Implement ensureTable in SchemaManager
- Wire into handleApplication lifecycle
Complete the main orchestration logic for dynamic table creation:

Implementation:
- ensureTable() fetches BigQuery schema via getMetadata()
- Checks if Harper table exists via Operations API
- Determines migration needs (create, migrate, or none)
- Creates new tables with attributes and indexes
- Adds new attributes to existing tables
- Returns detailed result with action taken

Test coverage:
- 3 new integration tests for ensureTable flow
- Test table creation from scratch
- Test schema migration (adding columns)
- Test no-op when schemas match
- All 63 tests passing (60 unit + 3 integration)

Design:
- Uses mocked OperationsClient for testing
- Orchestrates TypeMapper, IndexStrategy, and OperationsClient
- Thread-safe through check-then-act in Operations API
- Ready for actual API integration

Next steps:
- Implement HTTP methods in OperationsClient
- Wire into application lifecycle
- Test with real Harper Operations API
Added full HTTP implementation for interacting with Harper Operations API:

HTTP Infrastructure:
- makeRequest() method with POST request handling
- Basic Authentication support for secured endpoints
- Error handling for both HTTP errors and API errors
- Proper JSON serialization/deserialization

API Methods:
- describeTable(): Fetch table schema, returns null if not found
- createTable(): Create tables with attributes and indexes
- addAttributes(): Add columns via ALTER operations

Features:
- Idempotent operations (handles "already exists" errors gracefully)
- Per-attribute error handling in addAttributes()
- Automatic index creation after table creation
- Proper error propagation with status codes

Testing:
- All 63 component tests passing
- All 91 main integration tests passing
- Lint clean (fixed useless-catch error)
- Prettier formatted

Ready for integration with actual Harper Operations API endpoint.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Major architectural pivot after integration testing with local Harper instance:

Discovery:
- Harper Operations API does NOT support ALTER or create_attribute_index operations
- Harper automatically indexes ALL fields in schemaless tables
- Tables created with just hash_attribute accept ANY fields on INSERT
- All inserted fields are automatically indexed and queryable

Implementation changes:
- Simplified OperationsClient.createTable() to only require hash_attribute
- Removed complex schema migration logic from SchemaManager
- Harper handles schema evolution automatically during insert
- No need to pre-define columns or indexes

Performance optimization:
- Generate deterministic integer IDs from SHA256 hash of record data
- Ensures fast indexing (no strings/GUIDs/objects as primary keys)
- IDs are positive 53-bit safe integers for JavaScript compatibility
- Same input always produces same ID for idempotency

Test coverage:
- Added comprehensive integer ID generation tests (6 new tests)
- Updated schema manager tests to match simplified API
- All 97 tests passing (91 main + 6 new ID tests)
- Added debug-operations-api.js investigation script

Files modified:
- src/operations-client.js: Simplified to match Harper's actual API
- src/schema-manager.js: Removed migration logic, leverages auto-indexing
- src/sync-engine.js: Added deterministic integer ID generation
- test/schema-manager.test.js: Updated for simplified API
- test/integer-id-generation.test.js: New comprehensive test suite
- examples/debug-operations-api.js: New investigation/testing script
Completes integration of dynamic table creation feature:

Integration changes:
- Modified src/index.js to initialize SchemaManager before SyncEngines
- SchemaManager ensures Harper tables exist before syncing begins
- Tables created with integer primary key for fast indexing
- Harper automatically indexes all fields during insert (schemaless)
- Graceful degradation: falls back to schema.graphql if Operations API unavailable

Configuration:
- Added operations section to config.yaml for Operations API credentials
- Updated config-loader.js to preserve operations config
- Prioritizes config.yaml over environment variables for flexibility
- Default config: localhost:9925, admin/password

Logging:
- Clear info messages when tables are created vs already exist
- Helpful error messages if Operations API is unavailable
- Indicates how many fields expected from BigQuery schema

All 91 tests passing. Ready for integration testing with live Harper instance.
Added comprehensive documentation of the dynamic table creation feature:

New documentation:
- docs/internal/dynamic-table-creation-summary.md: Complete feature summary
  - What was built vs what was originally planned
  - Key architectural discovery: Harper is fully schemaless
  - Integration details, configuration, testing
  - Performance optimizations (integer IDs)
  - Migration path for existing deployments

Updated documentation:
- docs/plans/2025-11-13-dynamic-table-creation-design.md:
  - Added "Implementation Notes" section at top
  - Documents what we actually built (much simpler than planned)
  - Lists components that weren't needed due to schemaless discovery
  - Preserved original design for reference

Key insights documented:
- Harper Operations API is fully schemaless (no ALTER needed)
- Tables auto-index ALL fields without pre-definition
- Implementation 90% simpler than originally designed
- No schema polling, migrations, or distributed locking required

All 97 tests passing. Feature ready for merge.
@irjudson irjudson force-pushed the feature/dynamic-table-creation branch from 153659c to a59f4c6 Compare November 14, 2025 15:19
@irjudson irjudson merged commit 8cc7cb5 into main Nov 14, 2025
4 checks passed
@irjudson irjudson deleted the feature/dynamic-table-creation branch November 14, 2025 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dynamic Harper table creation via Operations API

2 participants