Skip to content

Commit 12f4e05

Browse files
vlaskyWilbert Harriman
andcommitted
Add cosine distance support for binary quantized vectors
Implements distance_cosine_bit() to calculate cosine similarity for bit vectors using popcount operations. Previously, cosine distance would error on binary vectors. Uses optimized u64 popcount when dimensions are divisible by 64, otherwise falls back to u8 hamming table lookup. Merged from upstream PR asg017#212 by wilbertharriman. Co-Authored-By: Wilbert Harriman <[email protected]>
1 parent 89d2de1 commit 12f4e05

File tree

3 files changed

+243
-17
lines changed

3 files changed

+243
-17
lines changed

CLAUDE.md

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
`sqlite-vec` is a lightweight, fast vector search SQLite extension written in pure C with no dependencies. It's a pre-v1 project (current: v0.1.7-alpha.2) that provides vector similarity search capabilities for SQLite databases across all platforms where SQLite runs.
8+
9+
Key features:
10+
- Supports float, int8, and binary vector types via `vec0` virtual tables
11+
- Pure C implementation with optional SIMD optimizations (AVX on x86_64, NEON on ARM)
12+
- Multi-language bindings (Python, Node.js, Ruby, Go, Rust)
13+
- Runs anywhere: Linux/MacOS/Windows, WASM, embedded devices
14+
15+
## Building and Testing
16+
17+
### Build Commands
18+
19+
Run `./scripts/vendor.sh` first to download vendored dependencies (sqlite3.c, shell.c).
20+
21+
**Core builds:**
22+
- `make loadable` - Build `dist/vec0.{so,dylib,dll}` loadable extension
23+
- `make static` - Build `dist/libsqlite_vec0.a` static library and `dist/sqlite-vec.h` header
24+
- `make cli` - Build `dist/sqlite3` CLI with sqlite-vec statically linked
25+
- `make all` - Build all three targets above
26+
- `make wasm` - Build WASM version (requires emcc)
27+
28+
**Platform-specific compiler:**
29+
- Set `CC=` to use a different compiler (default: gcc)
30+
- Set `AR=` to use a different archiver (default: ar)
31+
32+
**SIMD control:**
33+
- SIMD is auto-enabled on Darwin x86_64 (AVX) and Darwin arm64 (NEON)
34+
- Set `OMIT_SIMD=1` to disable SIMD optimizations
35+
36+
### Testing
37+
38+
**Python tests (primary test suite):**
39+
```bash
40+
# Setup test environment with uv
41+
uv sync --directory tests
42+
43+
# Run all Python tests
44+
make test-loadable python=./tests/.venv/bin/python
45+
46+
# Run specific test
47+
./tests/.venv/bin/python -m pytest tests/test-loadable.py::test_name -vv -s -x
48+
49+
# Update snapshots
50+
make test-loadable-snapshot-update
51+
52+
# Watch mode
53+
make test-loadable-watch
54+
```
55+
56+
**Other tests:**
57+
- `make test` - Run basic SQL tests via `test.sql`
58+
- `make test-unit` - Compile and run C unit tests
59+
- `sqlite3 :memory: '.read test.sql'` - Quick smoke test
60+
61+
**Test structure:**
62+
- `tests/test-loadable.py` - Main comprehensive test suite
63+
- `tests/test-metadata.py` - Metadata column tests
64+
- `tests/test-auxiliary.py` - Auxiliary column tests
65+
- `tests/test-partition-keys.py` - Partition key tests
66+
- `tests/conftest.py` - pytest fixtures (loads extension from `dist/vec0`)
67+
68+
### Code Quality
69+
70+
- `make format` - Format C code with clang-format and Python with black
71+
- `make lint` - Check formatting without modifying files
72+
73+
## Architecture
74+
75+
### Core Implementation (sqlite-vec.c)
76+
77+
The entire extension is in a single `sqlite-vec.c` file (~9000 lines). It implements a `vec0` virtual table module using SQLite's virtual table API.
78+
79+
**Key concepts:**
80+
81+
1. **vec0 virtual table**: Declared with `CREATE VIRTUAL TABLE x USING vec0(vector_column TYPE[N], ...)`
82+
- Vector column: Must specify type (float, int8, bit) and dimensions
83+
- Metadata columns: Additional indexed columns for filtering
84+
- Auxiliary columns: Non-indexed columns for associated data
85+
- Partition keys: Special columns for pre-filtering via `partition_key=column_name`
86+
- Chunk size: Configurable via `chunk_size=N` (default varies by type)
87+
88+
2. **Shadow tables**: vec0 creates multiple hidden tables to store data:
89+
- `xyz_chunks` - Chunk metadata (size, validity bitmaps, rowids)
90+
- `xyz_rowids` - Rowid mapping to chunks
91+
- `xyz_vector_chunksNN` - Actual vector data for column NN
92+
- `xyz_auxiliary` - Auxiliary column values
93+
- `xyz_metadatachunksNN` / `xyz_metadatatextNN` - Metadata storage
94+
95+
3. **Query plans**: Determined in xBestIndex, encoded in idxStr:
96+
- `VEC0_QUERY_PLAN_FULLSCAN` - Full table scan
97+
- `VEC0_QUERY_PLAN_POINT` - Single rowid lookup
98+
- `VEC0_QUERY_PLAN_KNN` - K-nearest neighbors vector search
99+
100+
See ARCHITECTURE.md for detailed idxStr encoding and shadow table schemas.
101+
102+
### Language Bindings
103+
104+
All bindings wrap the core C extension:
105+
106+
- **Python** (`bindings/python/`): Minimal wrapper with helper functions in `extra_init.py` for vector serialization
107+
- **Go** (`bindings/go/`): Uses ncruces/go-sqlite3 pure Go implementation
108+
- **Rust** (`bindings/rust/`): Static linking via build.rs, exports `sqlite3_vec_init()`
109+
110+
### Documentation Site
111+
112+
Built with VitePress (Vue-based static site generator):
113+
- `npm --prefix site run dev` - Development server
114+
- `npm --prefix site run build` - Production build
115+
- Source: `site/` directory
116+
- Deployed via GitHub Actions (`.github/workflows/site.yaml`)
117+
118+
## Development Workflow
119+
120+
### Making Changes
121+
122+
1. Edit `sqlite-vec.c` for core functionality
123+
2. Update `sqlite-vec.h.tmpl` if public API changes (regenerated via `make sqlite-vec.h`)
124+
3. Add tests to `tests/test-loadable.py` or other test files
125+
4. Run `make format` before committing
126+
5. Verify with `make test-loadable`
127+
128+
### Release Process
129+
130+
1. Update `VERSION` file (format: `X.Y.Z` or `X.Y.Z-alpha.N`)
131+
2. Run `./scripts/publish-release.sh` - This:
132+
- Commits version changes
133+
- Creates git tag
134+
- Pushes to origin
135+
- Creates GitHub release (pre-release if alpha/beta)
136+
137+
CI/CD (`.github/workflows/release.yaml`) then builds and publishes:
138+
- Platform-specific extensions (Linux, macOS, Windows, Android, WASM)
139+
- Language-specific packages (PyPI, npm, crates.io, RubyGems)
140+
141+
### Working with Tests
142+
143+
**Python test fixtures:**
144+
- `@pytest.fixture() db()` in conftest.py provides SQLite connection with extension loaded
145+
- Tests use `db.execute()` for queries
146+
- Snapshot testing available for regression tests
147+
148+
**Common test patterns:**
149+
```python
150+
def test_example(db):
151+
db.execute("CREATE VIRTUAL TABLE v USING vec0(embedding float[3])")
152+
db.execute("INSERT INTO v(rowid, embedding) VALUES (1, '[1,2,3]')")
153+
result = db.execute("SELECT distance FROM v WHERE embedding MATCH '[1,2,3]'").fetchone()
154+
```
155+
156+
### SIMD Optimizations
157+
158+
SIMD is conditionally compiled based on platform:
159+
- `SQLITE_VEC_ENABLE_AVX` - x86_64 AVX instructions
160+
- `SQLITE_VEC_ENABLE_NEON` - ARM NEON instructions
161+
162+
Code uses preprocessor directives to select implementations. Distance calculations have both scalar and SIMD variants.
163+
164+
## Important Notes
165+
166+
- This is pre-v1 software - breaking changes are expected
167+
- The single-file architecture means recompiling for any change
168+
- Tests must run from repository root (assumes `dist/vec0` exists)
169+
- All bindings depend on the core C extension being built first
170+
- Vector format: JSON arrays `'[1,2,3]'` or raw bytes via helper functions

sqlite-vec.c

Lines changed: 54 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -460,6 +460,58 @@ static double distance_l1_f32(const void *a, const void *b, const void *d) {
460460
return l1_f32(a, b, d);
461461
}
462462

463+
// https://github.com/facebookresearch/faiss/blob/77e2e79cd0a680adc343b9840dd865da724c579e/faiss/utils/hamming_distance/common.h#L34
464+
static u8 hamdist_table[256] = {
465+
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4,
466+
2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
467+
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4,
468+
2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
469+
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6,
470+
4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
471+
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5,
472+
3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
473+
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6,
474+
4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
475+
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8};
476+
477+
static f32 distance_cosine_bit_u64(u64 *a, u64 *b, size_t n) {
478+
f32 dot = 0;
479+
f32 aMag = 0;
480+
f32 bMag = 0;
481+
482+
for (size_t i = 0; i < n; i++) {
483+
dot += __builtin_popcountl(a[i] & b[i]);
484+
aMag += __builtin_popcountl(a[i]);
485+
bMag += __builtin_popcountl(b[i]);
486+
}
487+
488+
return 1 - (dot / (sqrt(aMag) * sqrt(bMag)));
489+
}
490+
491+
static f32 distance_cosine_bit_u8(u8 *a, u8 *b, size_t n) {
492+
f32 dot = 0;
493+
f32 aMag = 0;
494+
f32 bMag = 0;
495+
496+
for (size_t i = 0; i < n; i++) {
497+
dot += hamdist_table[a[i] & b[i]];
498+
aMag += hamdist_table[a[i]];
499+
bMag += hamdist_table[b[i]];
500+
}
501+
502+
return 1 - (dot / (sqrt(aMag) * sqrt(bMag)));
503+
}
504+
505+
static f32 distance_cosine_bit(const void *pA, const void *pB,
506+
const void *pD) {
507+
size_t dim = *((size_t *)pD);
508+
509+
if ((dim % 64) == 0) {
510+
return distance_cosine_bit_u64((u64 *)pA, (u64 *)pB, dim / 8 / CHAR_BIT);
511+
}
512+
return distance_cosine_bit_u8((u8 *)pA, (u8 *)pB, dim / CHAR_BIT);
513+
}
514+
463515
static f32 distance_cosine_float(const void *pVect1v, const void *pVect2v,
464516
const void *qty_ptr) {
465517
f32 *pVect1 = (f32 *)pVect1v;
@@ -497,20 +549,6 @@ static f32 distance_cosine_int8(const void *pA, const void *pB,
497549
return 1 - (dot / (sqrt(aMag) * sqrt(bMag)));
498550
}
499551

500-
// https://github.com/facebookresearch/faiss/blob/77e2e79cd0a680adc343b9840dd865da724c579e/faiss/utils/hamming_distance/common.h#L34
501-
static u8 hamdist_table[256] = {
502-
0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4,
503-
2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
504-
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4,
505-
2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
506-
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6,
507-
4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
508-
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5,
509-
3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
510-
2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6,
511-
4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
512-
4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8};
513-
514552
static f32 distance_hamming_u8(u8 *a, u8 *b, size_t n) {
515553
int same = 0;
516554
for (unsigned long i = 0; i < n; i++) {
@@ -1167,9 +1205,8 @@ static void vec_distance_cosine(sqlite3_context *context, int argc,
11671205

11681206
switch (elementType) {
11691207
case SQLITE_VEC_ELEMENT_TYPE_BIT: {
1170-
sqlite3_result_error(
1171-
context, "Cannot calculate cosine distance between two bitvectors.",
1172-
-1);
1208+
f32 result = distance_cosine_bit(a, b, &dimensions);
1209+
sqlite3_result_double(context, result);
11731210
goto finish;
11741211
}
11751212
case SQLITE_VEC_ELEMENT_TYPE_FLOAT32: {

tests/test-loadable.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -423,6 +423,25 @@ def check(a, b, dtype=np.float32):
423423
check([1, 2, 3], [-9, -8, -7], dtype=np.int8)
424424
assert vec_distance_cosine("[1.1, 1.0]", "[1.2, 1.2]") == 0.001131898257881403
425425

426+
vec_distance_cosine_bit = lambda *args: db.execute(
427+
"select vec_distance_cosine(vec_bit(?), vec_bit(?))", args
428+
).fetchone()[0]
429+
assert isclose(
430+
vec_distance_cosine_bit(b"\xff", b"\x01"),
431+
npy_cosine([1,1,1,1,1,1,1,1], [0,0,0,0,0,0,0,1]),
432+
abs_tol=1e-6
433+
)
434+
assert isclose(
435+
vec_distance_cosine_bit(b"\xab", b"\xab"),
436+
npy_cosine([1,0,1,0,1,0,1,1], [1,0,1,0,1,0,1,1]),
437+
abs_tol=1e-6
438+
)
439+
# test 64-bit
440+
assert isclose(
441+
vec_distance_cosine_bit(b"\xaa" * 8, b"\xff" * 8),
442+
npy_cosine([1,0] * 32, [1] * 64),
443+
abs_tol=1e-6
444+
)
426445

427446
def test_vec_distance_hamming():
428447
vec_distance_hamming = lambda *args: db.execute(

0 commit comments

Comments
 (0)