Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trimming large value fields in spans #23

Closed
wants to merge 10 commits into from

Conversation

zarna1parekh
Copy link
Collaborator

@zarna1parekh zarna1parekh commented Feb 6, 2025

Summary

When any of the field size is greater than 32766, indexwriter drops the entire span and throws an exception (IllegalArgumentException). This can specifically happen in case of values of type Binary, String, keyword and Text. Adding an extra layer of check in preprocessor while converting the document to Span.

Requirements

@zarna1parekh zarna1parekh force-pushed the zparekh/large_span_fields branch from a873879 to a80d6e0 Compare February 6, 2025 22:16
@zarna1parekh zarna1parekh changed the title Zparekh/large span fields Skipping large value fields in spans Feb 6, 2025
@autata
Copy link
Collaborator

autata commented Feb 7, 2025

per slack discussion, can we make this change in the preprocessor rather than indexer?

Copy link
Collaborator

@autata autata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test?

Also could make sense to log a warning when we are truncating a field.

@@ -46,11 +47,19 @@ public static Trace.KeyValue makeTraceKV(String key, Object value, Schema.Schema
switch (type) {
case KEYWORD -> {
tagBuilder.setFieldType(Schema.SchemaFieldType.KEYWORD);
tagBuilder.setVStr(value.toString());
if (value.toString().length() > MAX_TERM_LENGTH) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can this be a function?

also, probably need this logic in BINARY too? or does the indexer pass those through without issue on length.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw only for type KEYWORD. Not for any other type.

Working on adding test case. I have deployed to production and do not see the errors on the indexer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only applies to KEYWORD and TEXT types. BINARY isn't parsed so its length doesn't trigger this.
KEYWORDs are treated as single terms, so they must be less than the max term size. TEXT can have multiple terms in it, so they can be longer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indexwriter [addDocument](https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#addDocument-java.lang.Iterable-) functionality checks for each term in the document to be of size MAX_TERM_LENGTH or less. So it is worth to check TEXT, KEYWORD and BINARY and trim where applicable.

@@ -46,11 +47,19 @@ public static Trace.KeyValue makeTraceKV(String key, Object value, Schema.Schema
switch (type) {
case KEYWORD -> {
tagBuilder.setFieldType(Schema.SchemaFieldType.KEYWORD);
tagBuilder.setVStr(value.toString());
if (value.toString().length() > MAX_TERM_LENGTH) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only applies to KEYWORD and TEXT types. BINARY isn't parsed so its length doesn't trigger this.
KEYWORDs are treated as single terms, so they must be less than the max term size. TEXT can have multiple terms in it, so they can be longer.

/* helper function to set tag builder value based of schema field type */
private static void setTagBuilderValue(
Trace.KeyValue.Builder tagBuilder, Object value, Schema.SchemaFieldType type) {
if (type == Schema.SchemaFieldType.KEYWORD || type == Schema.SchemaFieldType.TEXT) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should just apply for KEYWORD and not TEXT. TEXT can be longer since it may be composed of multiple terms.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the length of the TEXT is more than MAX_TERM_LENGTH when adding the document, it can throw IllegalArgumentException ref

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that each term in the document can be no longer than MAX_TERM_LENGTH in bytes, otherwise an IllegalArgumentException will be thrown.

Terms are tokenized parts of a field. A keyword field has one term which is the whole field. A text field may have many terms and so can be longer than max term length. A binary field has no terms because it isn't tokenized or indexed. Binary field sizes as stored fields are limited by the IndexWriter.MAX_STORED_STRING_LENGTH
which is roughly Integer.MAX_VALUE / 3.

private static void setTagBuilderValue(
Trace.KeyValue.Builder tagBuilder, Object value, Schema.SchemaFieldType type) {
if (type == Schema.SchemaFieldType.KEYWORD || type == Schema.SchemaFieldType.TEXT) {
if (value.toString().length() > MAX_TERM_LENGTH) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this here, you could add it as another stanza beside the skip conditions for ignore_above, which does something similar, and can be used to configure OS to do this https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html. There's two places:

  1. convertKVtoProto
  2. convertKVtoProtoDefault

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore_above will skip indexing the field completely. We want to trim down the field to MAX_TERM_LENGTH and still index how much ever information we can preserve.

@zarna1parekh zarna1parekh changed the title Skipping large value fields in spans Trimming large value fields in spans Feb 10, 2025
assertThat(actualValue.getKey()).isEqualTo(expectedValue.getKey());
assertThat(actualValue.getVBinary())
.isEqualTo(expectedValue.getVBinary().substring(0, SpanFormatter.MAX_TERM_LENGTH));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in addition to checking that we are enforcing max length, can we also call the indexer function in this test that was spitting out the error and ensure it succeeds?

the back and forth to string makes it hard to follow if this solves the problem on the indexer.

public class SpanFormatterTest {

@Test
public void testMakeTraceKV() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably makes sense to have a meaningful test name and could make sense to break up into multiple tests per field?

e.g. instead of comments like this, can be in the test case name:
shouldNotTruncateKeywordValueWhenUnderMaxLength


// schema type: KEYWORD, key: error, value: 1234...9999 (greater than 32766)
String errorMsg =
IntStream.range(1, 10000).boxed().map(String::valueOf).collect(Collectors.joining(""));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we assert that errorMessage is too large before the truncate logic?

@zarna1parekh
Copy link
Collaborator Author

Closing this PR after internal team discussion. The PR is not required if types are handle correctly.

@zarna1parekh zarna1parekh deleted the zparekh/large_span_fields branch February 12, 2025 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants