-
Notifications
You must be signed in to change notification settings - Fork 12
Description
This was originally reported in sqlite-vec asg017/sqlite-vec#245 but the issue is demonstrated to be with sqlite-lembed, for certain types of input.
I wanted to integrate this work in my static website building tool, AkashaCMS. The test directory contains a number of input files of various kinds. The other bug report concerned the code I wrote in AkashaCMS. I decided to create a standalone test program so we could more easily reason over what's going on.
The test demonstrates that the SELECT lembed(?, ?) command fails on certain file content.
If you want to inspect the test files, go here: https://github.com/akashacms/akasharender/tree/0.9/test and look in the documents directory.
I made a list of the test files using this command:
find documents -type f -print | grep -v index.html.md | grep -v .placeholder | grep -v .json | grep -v .jpg >docfiles.txtNext, I put together the following using the snippets of code that are in AkashaCMS:
import { promises as fsp } from 'node:fs';
import path from 'node:path';
import * as sqlite_regex from "sqlite-regex";
import * as sqlite_vec from 'sqlite-vec';
import * as sqlite_lembed from 'sqlite-lembed';
import { AsyncDatabase } from 'promised-sqlite3';
const sqdb = await AsyncDatabase.open('test-lembed.db');
sqdb.inner.loadExtension(sqlite_regex.getLoadablePath());
sqlite_lembed.load(sqdb.inner);
sqlite_vec.load(sqdb.inner);
const fulltest = false;
const lembedModelFile = '/home/david/Projects/akasharender/akasharender/test/all-MiniLM-L6-v2.e4ce9877.q8_0.gguf';
const lembedModelName = 'all-MiniLM-L6-v2';
// const lembedModelFile = '/home/david/Projects/akasharender/akasharender/test/nomic-embed-text-v1.5.Q8_0.gguf';
// const lembedModelName = 'nomic-embed-text-v1.5';
await sqdb.run(`
INSERT INTO temp.lembed_models(name, model)
select ?, lembed_model_from_file(?);
`, [
lembedModelName,
lembedModelFile
]);
await sqdb.run('PRAGMA journal_mode=WAL;');
sqdb.inner.on('error', err => {
console.error(err);
});
await sqdb.run(`
CREATE TABLE IF NOT EXISTS documents (
path TEXT,
body TEXT
);
`);
await sqdb.run(`
CREATE VIRTUAL TABLE IF NOT EXISTS vec_documents USING vec0(
vpath TEXT,
body_embeddings FLOAT[384]
);
`);
const doclist = await fsp.readFile('./docfiles.txt', 'utf-8');
const docs = doclist.split('\n');
for (const doc of docs) {
console.log(doc);
if (doc.endsWith('asciidoctor-handlebars.html.adoc')
|| doc.endsWith('asciidoctor-nunjucks.html.adoc')
|| doc.endsWith('asciidoctor.html.adoc')
|| doc.endsWith('asciidoctor-liquid.html.adoc')
|| doc.endsWith('style.css.less')
|| doc.endsWith('select-elements.html.md')
|| doc === '') {
console.log(`... skipping ${doc}`);
continue;
}
const DOCTXT = await fsp.readFile(doc, 'utf-8');
if (fulltest === true) {
await sqdb.run(`
INSERT INTO documents ( path, body ) VALUES (?, ?);
`, [ path, DOCTXT ]);
// console.log(`after INSERT INTO documents`);
await sqdb.run(`
INSERT INTO vec_documents ( vpath, body_embeddings ) VALUES (
?, lembed(?, ?)
);
`, [ path, lembedModelName, DOCTXT ]);
} else {
await sqdb.run(`
SELECT lembed(?, ?);
`, [ lembedModelName, DOCTXT ]);
}
// console.log(`after INSERT INTO vec_documents`);
}
await sqdb.close();The 'promised-sqlite3' driver is a Promised wrapper around node-sqlite3;
The code sets up the tables - reads in the file list - for each one, reads the contents, then runs SQL command(s) to do the table inserts.
With fulltest as true it performs the full INSERT INTO commands. Otherwise it runs only the SELECT lembed(?, ?)
The if that causes the docs entry to be skipped is so that I can skip files which caused the core dump. I developed the list one file at a time, so each of the files named repeatedly caused the core dump.
Each of those files caused the core dump for both values of fulltest. Namely, for both the full INSERT INTO and the SELECT tests.
For the files that cause the core dump:
asciidoctor*.adoc-- These files are in AsciiDoctor formatselect-elements.html.md-- This file is solely HTML elements.style.css.less-- This is a CSS file which is actually in LESSCSS format. (see https://lesscss.org/)