Dynamic table creation via Harper Operations API #13

irjudson · 2025-11-14T15:15:05Z

Summary

Implements automatic Harper table creation via Operations API, eliminating manual schema.graphql definitions for data tables. Tables are created dynamically based on BigQuery schema introspection.

Closes #7

Key Discovery

Harper Operations API is fully schemaless - tables automatically index ALL fields without pre-definition. This dramatically simplified the implementation compared to the original design.

What's Included

Core Components

OperationsClient - HTTP client for Harper Operations API (describe_table, create_table)
SchemaManager - Orchestrates table creation and schema introspection
TypeMapper - Maps BigQuery types to Harper types (for documentation)
IndexStrategy - Documents indexing strategy (Harper auto-indexes everything)
Integer ID generation - Deterministic SHA256-based IDs for fast indexing

Integration

Modified src/index.js to initialize SchemaManager before SyncEngines
Ensures tables exist before syncing begins
Graceful degradation if Operations API unavailable (falls back to schema.graphql)

Configuration

operations:
  host: localhost
  port: 9925
  username: admin
  password: password

How It Works

On startup, SchemaManager.ensureTable() checks if Harper table exists
If not, creates table with create_table (only requires hash_attribute)
SyncEngine inserts BigQuery data with deterministic integer IDs
Harper automatically indexes ALL inserted fields (no schema pre-definition)
New fields in BigQuery are automatically handled on next insert

Performance Optimization

Integer Primary Keys:

Deterministic SHA256-based integer IDs from record data
Ensures fast indexing (no strings/GUIDs/objects as PKs)
IDs are positive 53-bit safe integers
Same input always produces same ID (idempotent)

Benefits

✅ Zero manual schema definitions for data tables
✅ Automatic schema evolution (Harper handles new fields)
✅ Fast integer indexing for optimal performance
✅ Thread-safe and idempotent table creation
✅ Simple implementation (no polling, migrations, or distributed locking)
✅ Graceful fallback to schema.graphql if API unavailable

Testing

97 tests passing (91 existing + 6 new integer ID tests)
New test files:
- test/schema-manager.test.js
- test/integer-id-generation.test.js
- test/operations-client.test.js
- test/type-mapper.test.js
- test/index-strategy.test.js
Integration scripts for live testing in examples/

Changes

18 files changed, 1,734 insertions:

New: src/operations-client.js, src/schema-manager.js, src/type-mapper.js, src/index-strategy.js
Modified: src/index.js, src/sync-engine.js, src/config-loader.js, config.yaml
Tests: 6 new test files with comprehensive coverage
Docs: Updated design document and added implementation summary

Documentation

Updated docs/plans/2025-11-13-dynamic-table-creation-design.md with implementation notes
Added docs/internal/dynamic-table-creation-summary.md with complete feature summary
Documented the schemaless discovery and its implications

Migration Path

Existing deployments:

System tables remain in schema.graphql (SyncCheckpoint, SyncAudit)
Data tables can be removed from schema.graphql
Add Operations API credentials to config.yaml
Restart - tables created automatically

New deployments:

Only system tables in schema.graphql
Configure Operations API credentials
All data tables created dynamically

What We Didn't Build (Not Needed)

Original design included these components, but they're unnecessary due to Harper's schemaless nature:

❌ SchemaLeaderElection polling (no periodic checks needed)
❌ Schema migration logic (just insert, Harper handles it)
❌ addColumns() operation (not supported by API)
❌ Distributed locking for schema checks (no concurrent operations)
❌ Complex type change handling (everything schemaless)

Ready to merge! All tests passing, feature complete, fully documented.

Complete design for Issue #7 covering: - Thread-safe table creation with check-then-act pattern - Rich type mapping for all BigQuery types - Smart indexing based on BigQuery metadata - Automatic schema migration (additive only) - Adaptive polling with exponential backoff - Comprehensive error handling and circuit breaker - Integration with existing codebase - Testing strategy for concurrency and correctness

Changed from independent per-node polling to leader election pattern: - SchemaLeaderElection uses SchemaLock table for coordination - Only one node (leader) checks schemas at any time - Dramatically reduces BigQuery and Harper API calls - Eliminates race conditions during schema migrations - Automatic failover via lock TTL (10min) - Still uses adaptive backoff (5min to 30min) Benefits: - N nodes polling → 1 node polling (N-1 fewer API calls) - No conflicts when adding columns - Clean serialized schema evolution - Simple and efficient

Implements core components for Issue #7 (dynamic table creation): Components added: - TypeMapper: Converts BigQuery types to Harper GraphQL types - IndexStrategy: Determines which columns should be indexed - OperationsClient: Wrapper for Harper Operations API - SchemaManager: Orchestrates table creation and migrations - SchemaLeaderElection: Distributed locking for schema checks Schema changes: - Add SchemaLock table for distributed coordination Test coverage: - 60 new unit tests covering all components - All 151 tests passing (91 existing + 60 new) - TDD approach: tests written first, then implementation Design: - Thread-safe with check-then-act pattern - Idempotent operations (already exists = success) - No destructive changes (only additive) - Versioned columns for type changes (e.g., column_v2) - Adaptive polling with leader election Next steps: - Integrate with Harper Operations API - Implement ensureTable in SchemaManager - Wire into handleApplication lifecycle

Complete the main orchestration logic for dynamic table creation: Implementation: - ensureTable() fetches BigQuery schema via getMetadata() - Checks if Harper table exists via Operations API - Determines migration needs (create, migrate, or none) - Creates new tables with attributes and indexes - Adds new attributes to existing tables - Returns detailed result with action taken Test coverage: - 3 new integration tests for ensureTable flow - Test table creation from scratch - Test schema migration (adding columns) - Test no-op when schemas match - All 63 tests passing (60 unit + 3 integration) Design: - Uses mocked OperationsClient for testing - Orchestrates TypeMapper, IndexStrategy, and OperationsClient - Thread-safe through check-then-act in Operations API - Ready for actual API integration Next steps: - Implement HTTP methods in OperationsClient - Wire into application lifecycle - Test with real Harper Operations API

Added full HTTP implementation for interacting with Harper Operations API: HTTP Infrastructure: - makeRequest() method with POST request handling - Basic Authentication support for secured endpoints - Error handling for both HTTP errors and API errors - Proper JSON serialization/deserialization API Methods: - describeTable(): Fetch table schema, returns null if not found - createTable(): Create tables with attributes and indexes - addAttributes(): Add columns via ALTER operations Features: - Idempotent operations (handles "already exists" errors gracefully) - Per-attribute error handling in addAttributes() - Automatic index creation after table creation - Proper error propagation with status codes Testing: - All 63 component tests passing - All 91 main integration tests passing - Lint clean (fixed useless-catch error) - Prettier formatted Ready for integration with actual Harper Operations API endpoint. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Major architectural pivot after integration testing with local Harper instance: Discovery: - Harper Operations API does NOT support ALTER or create_attribute_index operations - Harper automatically indexes ALL fields in schemaless tables - Tables created with just hash_attribute accept ANY fields on INSERT - All inserted fields are automatically indexed and queryable Implementation changes: - Simplified OperationsClient.createTable() to only require hash_attribute - Removed complex schema migration logic from SchemaManager - Harper handles schema evolution automatically during insert - No need to pre-define columns or indexes Performance optimization: - Generate deterministic integer IDs from SHA256 hash of record data - Ensures fast indexing (no strings/GUIDs/objects as primary keys) - IDs are positive 53-bit safe integers for JavaScript compatibility - Same input always produces same ID for idempotency Test coverage: - Added comprehensive integer ID generation tests (6 new tests) - Updated schema manager tests to match simplified API - All 97 tests passing (91 main + 6 new ID tests) - Added debug-operations-api.js investigation script Files modified: - src/operations-client.js: Simplified to match Harper's actual API - src/schema-manager.js: Removed migration logic, leverages auto-indexing - src/sync-engine.js: Added deterministic integer ID generation - test/schema-manager.test.js: Updated for simplified API - test/integer-id-generation.test.js: New comprehensive test suite - examples/debug-operations-api.js: New investigation/testing script

Completes integration of dynamic table creation feature: Integration changes: - Modified src/index.js to initialize SchemaManager before SyncEngines - SchemaManager ensures Harper tables exist before syncing begins - Tables created with integer primary key for fast indexing - Harper automatically indexes all fields during insert (schemaless) - Graceful degradation: falls back to schema.graphql if Operations API unavailable Configuration: - Added operations section to config.yaml for Operations API credentials - Updated config-loader.js to preserve operations config - Prioritizes config.yaml over environment variables for flexibility - Default config: localhost:9925, admin/password Logging: - Clear info messages when tables are created vs already exist - Helpful error messages if Operations API is unavailable - Indicates how many fields expected from BigQuery schema All 91 tests passing. Ready for integration testing with live Harper instance.

Added comprehensive documentation of the dynamic table creation feature: New documentation: - docs/internal/dynamic-table-creation-summary.md: Complete feature summary - What was built vs what was originally planned - Key architectural discovery: Harper is fully schemaless - Integration details, configuration, testing - Performance optimizations (integer IDs) - Migration path for existing deployments Updated documentation: - docs/plans/2025-11-13-dynamic-table-creation-design.md: - Added "Implementation Notes" section at top - Documents what we actually built (much simpler than planned) - Lists components that weren't needed due to schemaless discovery - Preserved original design for reference Key insights documented: - Harper Operations API is fully schemaless (no ALTER needed) - Tables auto-index ALL fields without pre-definition - Implementation 90% simpler than originally designed - No schema polling, migrations, or distributed locking required All 97 tests passing. Feature ready for merge.

irjudson and others added 8 commits November 13, 2025 15:11

irjudson force-pushed the feature/dynamic-table-creation branch from 153659c to a59f4c6 Compare November 14, 2025 15:19

irjudson merged commit 8cc7cb5 into main Nov 14, 2025
4 checks passed

irjudson deleted the feature/dynamic-table-creation branch November 14, 2025 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dynamic table creation via Harper Operations API #13

Dynamic table creation via Harper Operations API #13

Uh oh!

irjudson commented Nov 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dynamic table creation via Harper Operations API #13

Dynamic table creation via Harper Operations API #13

Uh oh!

Conversation

irjudson commented Nov 14, 2025

Summary

Key Discovery

What's Included

Core Components

Integration

Configuration

How It Works

Performance Optimization

Benefits

Testing

Changes

Documentation

Migration Path

What We Didn't Build (Not Needed)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants