V2Index and TokensDFA extension : Compilation and Mask times improvment. #194
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
V2Index and TokensDFA
A. TokensDFA : Description
This new version of Index includes a TokensDFA object.
This TokenDFA can be seen as an extension of DFA in that it leverages DFA optimizations to reduce the computational complexity of constructing the tokens transitions table.
The trade-off that is made is to spend time upstream of the construction of the transition table in order to gain advantages during construction.
Regex's world is such a childish world. Only 256 different values to manage, all of them with one byte size.
Tokens world has no limit of different values with no limit of size. Dante described it as "Malebolge"
The structure of the TokensDFA is very similar to the current index. The difference lies in the initialization.
A series of five optimizations has been implemented:
1. Reduce Vocabulary size
A static analysis of the regex is made in order to make the list of the 'dead bytes'.
'dead bytes' are bytes that will not be allowed at any place in the regex.
It allows us to quickly discriminate all the tokens that have at least one of the dead bytes.
Before going further, one thing very important to know about DFA is, when it compile, it tries to regroup bytes by class.
Bytes in the same class has same effect on the regex's graph.
In this example, all the char from 'a' to 'z' has the same class because they trigger the same behavior.
So, there are 2 states and only one transition.
Conversely, with the regex
"^a[a-z]$"
the char 'a' will have a different class than the chars 'b' to 'z'.Because only the 'a' is allowed as transition at state 0. Then, two classes are allowed. The one of 'a' and the one of [b-z].
It allows the DFA to reduce drastically the number of transitions by considering classes as transitions values.
We will use and abuse of these classes.
2. Tokens Classification
We take the ByteClasses of the DFA and we construct the class of each token by concating the classes of each of their byte.
In other world, if the range of bytes
[a-z]
has the class[a]
, the token'man'
will have the class[a][a][a]
like all thetokens of 3 letters.
So we put all the tokens behind their classes which allows us to only consider the classes for the construction of the transition table.
3. Prefix-Based Graph
After grouping tokens by their regex byte classes, we construct directed prefix-based graphs to efficiently represent token hierarchies and optimize pattern matching traversal.
By traversing the DFA transition table with each prefix-based graph, this allows us to quickly discriminate entire sections of tokens as soon as one of their prefixes encounters a dead state.
4. Good old Parallelization
The previous optimisation, a bunch of graphs which have no intersection, unlock the possibilities to to go through the DFA in parallel, with a thread by graph.
5. Ultima Optima : Mute Literals and coalescence
At this stage of optimization, the compilation times were already pretty good for sample regexes benchmark.
But it was weak for JSON structure :
After investigation it turns out that the problem comes from the literals !
Literals are worst nightmare for DFA (and by extension, TokensDFA).
It's easy to understand why. If we reconsidered our last regex
"^a[a-z]$"
, the char 'a' is a literal.With classification, the char 'a' will not have the same class as the other letters.
By extension, every token for a given size, with a letter 'a' will not have the same classe as the other tokens with exact same size.
If we take two classes
'a' -> [a]
and'b-z' -> [b]
, the words "hand", "five" and "ante" respectively have the classes'[b][a][b][b]' , '[b][b][b][b]' and '[a][b][b][b]'. It increases drastically the size of the alphabet, the number of transitions and the number of reached state.
And the big issue is that there is a lot of literals in JSON structures. (Every keys of attributes at least, every symboles {, ",}, etc...)
The best example is the 'HTTPS' regex.
(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?
Here, 'https' is a literal but also 'http', 'h', 't' and 'p'. It a huge stab in the previous optimisation.
Now, if we transform the 'https' determinist sequence by two 'ghost' symbols. (one for 'http', the other for 's' because 's' is optionnal with '?') :
(∟1(∟2)?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w \.-]*)*\/?
Yes, it's a huge improvment. Again, literals are the worst nightmare of Regexes.
So, at the beginning, we add an other static analysis of the regex to extract every literals (or 'determinist sequence') with alphanumeric chars.
For each of them, we will find the best combination of tokens to express them. This is where coalescence takes place.
If we extract the literal 'filename', we can express it with tokens 'file', 'name', 'f', 'i', 'l', 'e', 'n', 'a', 'm', 'e'.
Then, we find the smallest combination, here, the tokens 'file' and 'name'. For these tokens, we create two 'ghost' symbols.
'Ghost' tokens are choosen with char which have small probabilities to appear in the regex and zero probabilities to be a prefix of real tokens.
So, every 'Ghost' tokens begins by the char "\x1C" which is the File separator (Very Rare) then we concate with iteration index.
In our example, 'file' will be [28, 49] (byte values for "\x1C1") and 'name' will be [28,50] (byte values for "\x1C2").
We affect to 'ghost' tokens same ids than their respective real token and we create new regex with ghost tokens combination instead of the literals.
6 Minimize Transitions Table
We use the same structure as the CompressIndex here : https://github.com/agourdel/outlines-core/tree/opt/new-index-struct
to reduce the index size on average after compilation and increase the performance to serve the allowed tokens.
When we reduce, we replace the ghost tokens by the real tokens.
Bitset Masks of allowed tokens are already initiate for every state.
B - Compilations Benchmark (From Rust)
C - Memory Sizes Benchmark (From Rust)
D - Average Time to Inner Mask (From Python)
Using mask reference as parameter
E - Ready-To-Use
With this branch, the V2Index is directly integrated into the Index python class without any breaking changes.
It's ready to use.
The 'get_tokens()' and 'advance()' functions can be used as previous version.
Or, they can be used with a reference to a mask. (Much faster)
TODO