-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trimming large value fields in spans #23
Conversation
a873879
to
a80d6e0
Compare
per slack discussion, can we make this change in the preprocessor rather than indexer? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test?
Also could make sense to log a warning when we are truncating a field.
@@ -46,11 +47,19 @@ public static Trace.KeyValue makeTraceKV(String key, Object value, Schema.Schema | |||
switch (type) { | |||
case KEYWORD -> { | |||
tagBuilder.setFieldType(Schema.SchemaFieldType.KEYWORD); | |||
tagBuilder.setVStr(value.toString()); | |||
if (value.toString().length() > MAX_TERM_LENGTH) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can this be a function?
also, probably need this logic in BINARY
too? or does the indexer pass those through without issue on length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw only for type KEYWORD
. Not for any other type.
Working on adding test case. I have deployed to production and do not see the errors on the indexer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only applies to KEYWORD and TEXT types. BINARY isn't parsed so its length doesn't trigger this.
KEYWORDs are treated as single terms, so they must be less than the max term size. TEXT can have multiple terms in it, so they can be longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indexwriter [addDocument](https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#addDocument-java.lang.Iterable-)
functionality checks for each term in the document to be of size MAX_TERM_LENGTH
or less. So it is worth to check TEXT
, KEYWORD
and BINARY
and trim where applicable.
astra/src/main/java/com/slack/astra/logstore/DocumentBuilder.java
Outdated
Show resolved
Hide resolved
@@ -46,11 +47,19 @@ public static Trace.KeyValue makeTraceKV(String key, Object value, Schema.Schema | |||
switch (type) { | |||
case KEYWORD -> { | |||
tagBuilder.setFieldType(Schema.SchemaFieldType.KEYWORD); | |||
tagBuilder.setVStr(value.toString()); | |||
if (value.toString().length() > MAX_TERM_LENGTH) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only applies to KEYWORD and TEXT types. BINARY isn't parsed so its length doesn't trigger this.
KEYWORDs are treated as single terms, so they must be less than the max term size. TEXT can have multiple terms in it, so they can be longer.
/* helper function to set tag builder value based of schema field type */ | ||
private static void setTagBuilderValue( | ||
Trace.KeyValue.Builder tagBuilder, Object value, Schema.SchemaFieldType type) { | ||
if (type == Schema.SchemaFieldType.KEYWORD || type == Schema.SchemaFieldType.TEXT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should just apply for KEYWORD and not TEXT. TEXT can be longer since it may be composed of multiple terms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if the length of the TEXT
is more than MAX_TERM_LENGTH
when adding the document, it can throw IllegalArgumentException
ref
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that each term in the document can be no longer than MAX_TERM_LENGTH in bytes, otherwise an IllegalArgumentException will be thrown.
Terms are tokenized parts of a field. A keyword field has one term which is the whole field. A text field may have many terms and so can be longer than max term length. A binary field has no terms because it isn't tokenized or indexed. Binary field sizes as stored fields are limited by the IndexWriter.MAX_STORED_STRING_LENGTH
which is roughly Integer.MAX_VALUE / 3
.
private static void setTagBuilderValue( | ||
Trace.KeyValue.Builder tagBuilder, Object value, Schema.SchemaFieldType type) { | ||
if (type == Schema.SchemaFieldType.KEYWORD || type == Schema.SchemaFieldType.TEXT) { | ||
if (value.toString().length() > MAX_TERM_LENGTH) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing this here, you could add it as another stanza beside the skip conditions for ignore_above
, which does something similar, and can be used to configure OS to do this https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html. There's two places:
- convertKVtoProto
- convertKVtoProtoDefault
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore_above
will skip indexing the field completely. We want to trim down the field to MAX_TERM_LENGTH
and still index how much ever information we can preserve.
assertThat(actualValue.getKey()).isEqualTo(expectedValue.getKey()); | ||
assertThat(actualValue.getVBinary()) | ||
.isEqualTo(expectedValue.getVBinary().substring(0, SpanFormatter.MAX_TERM_LENGTH)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in addition to checking that we are enforcing max length, can we also call the indexer function in this test that was spitting out the error and ensure it succeeds?
the back and forth to string makes it hard to follow if this solves the problem on the indexer.
public class SpanFormatterTest { | ||
|
||
@Test | ||
public void testMakeTraceKV() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: probably makes sense to have a meaningful test name and could make sense to break up into multiple tests per field?
e.g. instead of comments like this, can be in the test case name:
shouldNotTruncateKeywordValueWhenUnderMaxLength
|
||
// schema type: KEYWORD, key: error, value: 1234...9999 (greater than 32766) | ||
String errorMsg = | ||
IntStream.range(1, 10000).boxed().map(String::valueOf).collect(Collectors.joining("")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we assert that errorMessage is too large before the truncate logic?
Closing this PR after internal team discussion. The PR is not required if types are handle correctly. |
Summary
When any of the field size is greater than 32766, indexwriter drops the entire span and throws an exception (
IllegalArgumentException
). This can specifically happen in case of values of type Binary, String, keyword and Text. Adding an extra layer of check in preprocessor while converting the document to Span.Requirements