Trimming large value fields in spans #23

zarna1parekh · 2025-02-06T22:16:12Z

Summary

When any of the field size is greater than 32766, indexwriter drops the entire span and throws an exception (IllegalArgumentException). This can specifically happen in case of values of type Binary, String, keyword and Text. Adding an extra layer of check in preprocessor while converting the document to Span.

Requirements

I've read and understood the Contributing Guidelines and have done my best effort to follow them.
I've read and agree to the Code of Conduct.

autata · 2025-02-07T19:45:03Z

per slack discussion, can we make this change in the preprocessor rather than indexer?

autata

Can you add a test?

Also could make sense to log a warning when we are truncating a field.

autata · 2025-02-07T22:18:27Z

astra/src/main/java/com/slack/astra/writer/SpanFormatter.java

@@ -46,11 +47,19 @@ public static Trace.KeyValue makeTraceKV(String key, Object value, Schema.Schema
      switch (type) {
        case KEYWORD -> {
          tagBuilder.setFieldType(Schema.SchemaFieldType.KEYWORD);
-          tagBuilder.setVStr(value.toString());
+          if (value.toString().length() > MAX_TERM_LENGTH) {


nit: can this be a function?

also, probably need this logic in BINARY too? or does the indexer pass those through without issue on length.

I saw only for type KEYWORD. Not for any other type.

Working on adding test case. I have deployed to production and do not see the errors on the indexer.

Updated the PR

It only applies to KEYWORD and TEXT types. BINARY isn't parsed so its length doesn't trigger this.
KEYWORDs are treated as single terms, so they must be less than the max term size. TEXT can have multiple terms in it, so they can be longer.

The indexwriter [addDocument](https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#addDocument-java.lang.Iterable-) functionality checks for each term in the document to be of size MAX_TERM_LENGTH or less. So it is worth to check TEXT, KEYWORD and BINARY and trim where applicable.

astra/src/main/java/com/slack/astra/logstore/DocumentBuilder.java

baroquebobcat · 2025-02-07T23:30:36Z

astra/src/main/java/com/slack/astra/writer/SpanFormatter.java

@@ -46,11 +47,19 @@ public static Trace.KeyValue makeTraceKV(String key, Object value, Schema.Schema
      switch (type) {
        case KEYWORD -> {
          tagBuilder.setFieldType(Schema.SchemaFieldType.KEYWORD);
-          tagBuilder.setVStr(value.toString());
+          if (value.toString().length() > MAX_TERM_LENGTH) {


It only applies to KEYWORD and TEXT types. BINARY isn't parsed so its length doesn't trigger this.
KEYWORDs are treated as single terms, so they must be less than the max term size. TEXT can have multiple terms in it, so they can be longer.

baroquebobcat · 2025-02-07T23:34:35Z

astra/src/main/java/com/slack/astra/writer/SpanFormatter.java

+  /* helper function to set tag builder value based of schema field type */
+  private static void setTagBuilderValue(
+      Trace.KeyValue.Builder tagBuilder, Object value, Schema.SchemaFieldType type) {
+    if (type == Schema.SchemaFieldType.KEYWORD || type == Schema.SchemaFieldType.TEXT) {


I think this should just apply for KEYWORD and not TEXT. TEXT can be longer since it may be composed of multiple terms.

But if the length of the TEXT is more than MAX_TERM_LENGTH when adding the document, it can throw IllegalArgumentException ref

Note that each term in the document can be no longer than MAX_TERM_LENGTH in bytes, otherwise an IllegalArgumentException will be thrown.

Terms are tokenized parts of a field. A keyword field has one term which is the whole field. A text field may have many terms and so can be longer than max term length. A binary field has no terms because it isn't tokenized or indexed. Binary field sizes as stored fields are limited by the IndexWriter.MAX_STORED_STRING_LENGTH
which is roughly Integer.MAX_VALUE / 3.

baroquebobcat · 2025-02-07T23:38:29Z

astra/src/main/java/com/slack/astra/writer/SpanFormatter.java

+  private static void setTagBuilderValue(
+      Trace.KeyValue.Builder tagBuilder, Object value, Schema.SchemaFieldType type) {
+    if (type == Schema.SchemaFieldType.KEYWORD || type == Schema.SchemaFieldType.TEXT) {
+      if (value.toString().length() > MAX_TERM_LENGTH) {


Instead of doing this here, you could add it as another stanza beside the skip conditions for ignore_above, which does something similar, and can be used to configure OS to do this https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html. There's two places:

convertKVtoProto

convertKVtoProtoDefault

ignore_above will skip indexing the field completely. We want to trim down the field to MAX_TERM_LENGTH and still index how much ever information we can preserve.

autata · 2025-02-07T23:36:27Z

astra/src/test/java/com/slack/astra/writer/SpanFormatterTest.java

+    assertThat(actualValue.getKey()).isEqualTo(expectedValue.getKey());
+    assertThat(actualValue.getVBinary())
+        .isEqualTo(expectedValue.getVBinary().substring(0, SpanFormatter.MAX_TERM_LENGTH));
+  }


in addition to checking that we are enforcing max length, can we also call the indexer function in this test that was spitting out the error and ensure it succeeds?

the back and forth to string makes it hard to follow if this solves the problem on the indexer.

autata · 2025-02-07T23:51:44Z

astra/src/test/java/com/slack/astra/writer/SpanFormatterTest.java

+public class SpanFormatterTest {
+
+  @Test
+  public void testMakeTraceKV() {


nit: probably makes sense to have a meaningful test name and could make sense to break up into multiple tests per field?

e.g. instead of comments like this, can be in the test case name:
shouldNotTruncateKeywordValueWhenUnderMaxLength

autata · 2025-02-07T23:55:10Z

astra/src/test/java/com/slack/astra/writer/SpanFormatterTest.java

+
+    // schema type: KEYWORD, key: error, value: 1234...9999 (greater than 32766)
+    String errorMsg =
+        IntStream.range(1, 10000).boxed().map(String::valueOf).collect(Collectors.joining(""));


can we assert that errorMessage is too large before the truncate logic?

zarna1parekh · 2025-02-12T18:17:09Z

Closing this PR after internal team discussion. The PR is not required if types are handle correctly.

Skipping large field values in the span

a80d6e0

zarna1parekh force-pushed the zparekh/large_span_fields branch from a873879 to a80d6e0 Compare February 6, 2025 22:16

zarna1parekh changed the title ~~Zparekh/large span fields~~ Skipping large value fields in spans Feb 6, 2025

zarna1parekh added 3 commits February 6, 2025 14:19

Fixing formatting

44e6136

Fixing test case

929868b

Fixing formatting

0cfc9e1

zarna1parekh requested review from baroquebobcat and autata February 7, 2025 06:50

zarna1parekh added 2 commits February 7, 2025 12:13

Moving trimming large field value to pre-processor

ec65a9c

Reverting test file changes

2616c26

autata reviewed Feb 7, 2025

View reviewed changes

zarna1parekh added 4 commits February 7, 2025 14:56

Adding test cases

79511e5

Formatting

1db27bc

More test cases

b19563d

Creating helper function

789abee

baroquebobcat reviewed Feb 7, 2025

View reviewed changes

zarna1parekh changed the title ~~Skipping large value fields in spans~~ Trimming large value fields in spans Feb 10, 2025

autata reviewed Feb 11, 2025

View reviewed changes

zarna1parekh closed this Feb 12, 2025

zarna1parekh deleted the zparekh/large_span_fields branch February 12, 2025 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trimming large value fields in spans #23

Trimming large value fields in spans #23

zarna1parekh commented Feb 6, 2025 •

edited

Loading

autata commented Feb 7, 2025

autata left a comment •

edited

Loading

autata Feb 7, 2025

zarna1parekh Feb 7, 2025

zarna1parekh Feb 7, 2025

baroquebobcat Feb 7, 2025

zarna1parekh Feb 10, 2025

baroquebobcat Feb 7, 2025

baroquebobcat Feb 7, 2025

zarna1parekh Feb 10, 2025

baroquebobcat Feb 10, 2025

baroquebobcat Feb 7, 2025

zarna1parekh Feb 10, 2025

autata Feb 7, 2025

autata Feb 7, 2025

autata Feb 7, 2025

zarna1parekh commented Feb 12, 2025

Trimming large value fields in spans #23

Trimming large value fields in spans #23

Conversation

zarna1parekh commented Feb 6, 2025 • edited Loading

Summary

Requirements

autata commented Feb 7, 2025

autata left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zarna1parekh commented Feb 12, 2025

zarna1parekh commented Feb 6, 2025 •

edited

Loading

autata left a comment •

edited

Loading