Create a language model based on a body of text and get high-quality predictions (next word, next phrase, next pixel, etc.).
npm i next-token-prediction
Put this /training/
directory in the root of your project.
Now you just need to create your app's index.js
file and run it. Your model will start training on the .txt files located in /training/documents/
. After training is complete it will run these 4 queries:
const { Language: LM } = require('next-token-prediction');
const MyLanguageModel = async () => {
const agent = await LM({
bootstrap: true
});
// Predict the next word
agent.getTokenPrediction('what');
// Predict the next 5 words
agent.getTokenSequencePrediction('what is', 5);
// Complete the phrase
agent.complete('hopefully');
// Get a top k sample of completion predictions
agent.getCompletions('The sun');
};
MyLanguageModel();
Put this /training/
directory in the root of your project.
Because training data was committed to this repo, you can optionally skip training, and just use the bootstrapped training data, like this:
const { dirname } = require('path');
const __root = dirname(require.main.filename);
const { Language: LM } = require('next-token-prediction');
const OpenSourceBooksDataset = require(`${__root}/training/datasets/OpenSourceBooks`);
const MyLanguageModel = async () => {
const agent = await LM({
dataset: OpenSourceBooksDataset
});
// Complete the phrase
agent.complete('hopefully');
};
MyLanguageModel();
Or, train on your own provided text files:
const { dirname } = require('path');
const __root = dirname(require.main.filename);
const { Language: LM } = require('next-token-prediction');
const MyLanguageModel = () => {
// The following .txt files should exist in a `/training/documents/`
// directory in the root of your project
const agent = await LM({
files: [
'marie-antoinette',
'pride-and-prejudice',
'to-kill-a-mockingbird',
'basic-algebra',
'a-history-of-war',
'introduction-to-c-programming'
]
});
// Complete the phrase
agent.complete('hopefully');
};
MyLanguageModel();
Note
By default, next-token prediction does not use vector search. To enable it, set VARIANCE=1
(any value higher than 0
) in an .env
. This will change the prediction from returning the next likeliest token (n-gram search) to returning the most similar token (vector search) e.g. "The quick brown fox jumped..." (n-gram prediction) vs "The quick brown fox juked..." (vector similarity). Note that vector search is considerably slower and more resource intensive.
When running the n-gram training using the built-in training method, vector embeddings (144-dimensional) are also created for each token pair to capture context and semantics (e.g. the token Jordan
has different values in the fragment Michael Jordan
than it does in the fragment Syria, Jordan
). The goal of vector search is to optionally enable paraphrasing, slang and profanity filtering, and more.
Run tests
npm test
readline-completion.mp4
readline-completion-verbose.mp4
With more training data you can get more suggestions, eventually hitting a tipping point where it can complete anything.
autocomplete.mp4
3Blue1Brown video on YouTube:
Watch: YouTube
- Provide a high-quality text prediction library for:
- autocomplete
- autocorrect
- spell checking
- search/lookup
- summarizing
- paraphrasing
-
Create pixel and audio transformers for other prediction formats
-
Demystify LLMs & simplify methodologies
-
Make a high-quality, free/open chat-focused LLM in JavaScript, and an equally sophisticated image-focused diffusion model. Working on this here.