Skip to content

Latest commit

 

History

History
384 lines (282 loc) · 12.7 KB

File metadata and controls

384 lines (282 loc) · 12.7 KB

Hash Verification & Data Integrity

Table of Contents

Overview

gitsqlite includes automatic hash verification using SHA-256 to ensure SQL files have not been modified between operations. This prevents corruption and data integrity issues that could occur from manual edits or file corruption.

Hash verification status by operation:

  • clean - Generates hash and appends to SQL output
  • smudge - Verifies hash before restoring to binary (optional or enforced)
  • diff - No hash (diff output is for display/comparison only)

How It Works

During Clean Operation (Binary → SQL)

When gitsqlite converts a binary SQLite database to SQL format:

  1. The SQL content is streamed to the output
  2. A SHA-256 hash is computed on-the-fly as the SQL is written
  3. After all SQL content is written, a hash comment is appended as the last line:
    -- gitsqlite-hash: sha256:a1b2c3d4e5f6...

This happens for:

  • Main data output (stdout)
  • Schema files (when using -schema or -schema-file flags)

Note: The diff command does not include hash verification since it's only used for viewing/comparing, not for round-tripping back to binary.

During Smudge Operation (SQL → Binary)

When gitsqlite converts SQL format back to binary SQLite:

  1. The entire SQL input is read
  2. The last line is checked for the hash comment
  3. The hash is extracted and the content (without hash line) is hashed
  4. If the computed hash matches the stored hash, the content is verified
  5. The SQL (without hash line) is restored to binary SQLite
  6. Behavior depends on enforcement mode (see Enforcement Modes)

This verification happens for:

  • Main data input (stdin)
  • Schema files (when using -schema or -schema-file flags)

Enforcement Modes

Optional Validation (Default)

By default, hash validation is optional:

  • Hash is validated if present
  • Warnings are logged if hash is invalid or missing
  • Operation continues even with validation failures
# Default behavior - validates but doesn't fail
gitsqlite smudge < database.sql > database.db

# Enable logging to see validation status
gitsqlite -log smudge < database.sql > database.db
# Check log file for hash validation warnings

Use case: Development and testing environments where flexibility is needed.

Enforced Validation

With the -verify-hash flag, validation is enforced:

  • Hash must be present and valid
  • Operation fails if hash is invalid or missing
  • No database is created from invalid SQL
# Enforced mode - fails if hash is invalid/missing
gitsqlite -verify-hash smudge < database.sql > database.db

Use case: Production environments or when working with critical data.

Git Configuration

Configure your Git filter to use enforced validation:

# Enforce hash verification in Git workflow
git config filter.gitsqlite.smudge "gitsqlite -verify-hash smudge"
git config filter.gitsqlite.clean "gitsqlite clean"

Benefits

1. Prevents Manual Editing

The hash ensures that SQL files generated by gitsqlite are not manually edited in an editor, which could lead to:

  • Syntax errors that break smudge operations
  • Data corruption
  • Loss of binary database integrity
  • Inconsistent data types or constraints

2. Detects File Corruption

If a SQL file is corrupted during:

  • File system operations
  • Git operations
  • File transfers
  • Storage issues

The hash will catch the corruption before it's restored to binary format.

3. Data Integrity Assurance

The hash provides assurance that the SQL content has not been tampered with, intentionally or accidentally. In enforced mode, this prevents corrupted data from entering your database.

Hash Format

The hash line is always the last line of the SQL file and follows this format:

-- gitsqlite-hash: sha256:<64-character-hex-digest>

Example:

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT);
INSERT INTO users VALUES(1,'Alice');
INSERT INTO users VALUES(2,'Bob');
COMMIT;
-- gitsqlite-hash: sha256:a7f3e8c2d1b6904523f1e0c8a9d4b5c6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2

Error Messages

Enforced Mode Errors

When using -verify-hash, you may see these errors:

Missing Hash:

data hash verification failed: missing gitsqlite hash signature (expected last line to start with '-- gitsqlite-hash: sha256:')

Hash Mismatch:

data hash verification failed: hash verification failed: expected a7f3e8c2d1..., got b8e4f9d3e2... (file may have been modified)

Schema File Hash Errors:

schema hash verification failed: hash verification failed: expected a7f3e8c2d1..., got b8e4f9d3e2... (file may have been modified)

Optional Mode Warnings

When not using -verify-hash, similar messages appear as warnings in the log file but don't prevent the operation from completing.

Testing & Validation

Manual Testing

To manually test hash verification:

# Create a test database
sqlite3 test.db <<EOF
CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT);
INSERT INTO users VALUES(1,'Alice');
INSERT INTO users VALUES(2,'Bob');
EOF

# Convert to SQL with hash
gitsqlite clean < test.db > test.sql

# View the hash at the end
tail -1 test.sql

# Verify it works (enforced mode)
gitsqlite -verify-hash smudge < test.sql > test2.db

# Test hash verification failure - edit the SQL file
sed -i "s/Alice/Charlie/" test.sql

# This should fail with hash verification error (in enforced mode)
gitsqlite -verify-hash smudge < test.sql > test3.db

# In optional mode, this succeeds with a warning
gitsqlite -log smudge < test.sql > test4.db
# Check log file for warning message

Testing Optional vs Enforced Modes

# Create valid SQL file
sqlite3 test.db "CREATE TABLE test(id INTEGER); INSERT INTO test VALUES(1);"
gitsqlite clean < test.db > valid.sql

# Create invalid SQL file (corrupt hash)
cat > invalid.sql << 'EOF'
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE test(id INTEGER);
INSERT INTO test VALUES(1);
COMMIT;
-- gitsqlite-hash: sha256:0000000000000000000000000000000000000000000000000000000000000000
EOF

# Test 1: Valid hash with enforcement - should succeed
gitsqlite -verify-hash smudge < valid.sql > test1.db && echo "✓ Success"

# Test 2: Invalid hash with enforcement - should fail
gitsqlite -verify-hash smudge < invalid.sql > test2.db 2>&1 || echo "✓ Correctly failed"

# Test 3: Invalid hash without enforcement - should succeed with warning
gitsqlite -log smudge < invalid.sql > test3.db && echo "✓ Success (warning logged)"

# Test 4: Missing hash with enforcement - should fail
head -n -1 valid.sql > no_hash.sql
gitsqlite -verify-hash smudge < no_hash.sql > test4.db 2>&1 || echo "✓ Correctly failed"

# Test 5: Missing hash without enforcement - should succeed with warning
gitsqlite smudge < no_hash.sql > test5.db && echo "✓ Success (warning logged)"

Integration Testing

Test the complete Git workflow:

# Create test repository
mkdir /tmp/hash-verification-test && cd /tmp/hash-verification-test
git init
git config user.name "Test"
git config user.email "test@example.com"

# Configure filters (with or without enforcement)
echo '*.db filter=gitsqlite' > .gitattributes

# Without enforcement (default)
git config filter.gitsqlite.clean "gitsqlite clean"
git config filter.gitsqlite.smudge "gitsqlite smudge"

# OR with enforcement (recommended)
# git config filter.gitsqlite.smudge "gitsqlite -verify-hash smudge"

# Create and commit database
sqlite3 app.db "CREATE TABLE logs(id INTEGER, msg TEXT); INSERT INTO logs VALUES(1,'start');"
git add .gitattributes app.db
git commit -m "Initial database"

# VERIFY: Git stored SQL with hash
git show HEAD:app.db | tail -1
# Expected: -- gitsqlite-hash: sha256:<hash>

# Test checkout (smudge with verification)
rm app.db
git checkout HEAD -- app.db
sqlite3 app.db "SELECT * FROM logs;"
# Expected: 1|start

# Cleanup
cd - && rm -rf /tmp/hash-verification-test

Unit Tests

Run the hash package unit tests:

go test -v ./internal/hash/

Expected tests:

  • TestHashWriter - Basic functionality
  • TestHashWriterDeterministic - Consistency check
  • TestVerifyAndStripHash - Successful verification
  • TestVerifyAndStripHashInvalidHash - Wrong hash detection
  • TestVerifyAndStripHashMissingHash - Missing hash detection
  • TestVerifyAndStripHashModifiedContent - Tampering detection
  • TestVerifyAndStripHashEmptyInput - Edge case
  • TestRoundTrip - End-to-end verification

Compatibility

Backward Compatibility

⚠️ Important: SQL files generated by older versions of gitsqlite (without hash) will work differently depending on mode:

  • Optional mode (default): Files without hash work but generate warnings
  • Enforced mode (-verify-hash): Files without hash will fail

To migrate existing repositories to enforced mode:

  1. Check out the repository with files in SQL format
  2. Re-run clean operation to regenerate SQL with hashes:
    git add -f *.db
    git commit -m "Regenerate SQLite dumps with hash verification"
  3. Update Git filter configuration to use -verify-hash flag

Future Versions

The hash format is designed to be extensible. If a different hash algorithm is needed in the future, the prefix format allows for versioning:

  • Current: -- gitsqlite-hash: sha256:...
  • Future: -- gitsqlite-hash: sha512:... or -- gitsqlite-hash-v2: ...

FAQ

Q: What if I want to manually edit the SQL?
A: Don't edit the SQL files directly. Instead, use SQLite tools to edit the binary database, then re-run clean to regenerate the SQL with a valid hash.

Q: Can I merge SQL files manually?
A: Merging SQL files manually is risky (see README caveats). If you must merge, after merging you'll need to restore to binary, then clean again to generate a valid hash.

Q: Does the hash slow down operations?
A: The performance impact is minimal - hash computation happens while streaming data. Only the smudge operation needs to read the full content into memory.

Q: What if I'm using schema/data separation?
A: Both the schema file and data file will have their own independent hashes. Both must verify successfully during smudge (in enforced mode).

Q: Can I verify a hash manually?
A: Yes, you can use standard tools:

# Extract content without hash line
head -n -1 file.sql | sha256sum

# Compare with hash in last line
tail -1 file.sql

Q: Should I use enforced mode or optional mode?
A: Use optional mode (default) for development/testing. Use enforced mode (-verify-hash) for production or critical data where you want to guarantee integrity.

Q: Can I disable hash generation during clean?
A: No, hashes are always generated during clean. However, verification during smudge is optional by default and only enforced when using -verify-hash.

Security Considerations

Hash Algorithm

  • SHA-256 is cryptographically secure and collision-resistant
  • Hashes are computed on the SQL content (excluding the hash line itself)
  • 64-character hexadecimal encoding provides 256 bits of security

Threat Model

  • Hash verification prevents accidental modifications and corruption
  • Not designed to prevent malicious attacks where attacker has write access
  • An attacker with file system access could modify both content and hash
  • For tamper-evidence against malicious actors, use Git's commit signing (GPG)

Best Practices

  1. Use enforced mode (-verify-hash) for production databases
  2. Combine with Git security: Enable signed commits and protected branches
  3. Log validation results: Use -log flag to track validation status
  4. Review warnings: In optional mode, periodically review logs for validation failures
  5. Test before deployment: Validate hash verification in your CI/CD pipeline

Technical Details

  • Algorithm: SHA-256 from Go's crypto/sha256 package
  • Encoding: Lowercase hexadecimal (64 characters)
  • Input: All SQL content excluding the hash line itself
  • Performance: O(n) with streaming computation during clean
  • Memory: Smudge requires full file in memory for verification

Implementation

Key files:

  • internal/hash/hash.go - Core hash implementation
  • internal/hash/hash_test.go - Comprehensive unit tests
  • internal/filters/clean.go - Hash generation during clean
  • internal/filters/smudge.go - Hash verification during smudge
  • internal/filters/diff.go - No hash (diff is for viewing only)