Skip to content

[SPARK-50017][SS] Support Avro encoding for TransformWithState operator #48401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

ericm-db
Copy link
Contributor

@ericm-db ericm-db commented Oct 9, 2024

What changes were proposed in this pull request?

Currently, we use the internal byte representation to store state for stateful streaming operators in the StateStore. This PR introduces Avro serialization and deserialization capabilities in the RocksDBStateEncoder so that we can instead use Avro encoding to store state. This is currently enabled for the TransformWithState operator via SQLConf to support all functionality supported by TWS

Why are the changes needed?

UnsafeRow is an inherently unstable format that makes no guarantees of being backwards-compatible. Therefore, if the format changes between Spark releases, this could cause StateStore corruptions. Avro is more stable, and inherently enables schema evolution.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Amended and added to unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@ericm-db ericm-db changed the title [WIP] Avrfo [WIP] Avro Oct 9, 2024
@ericm-db ericm-db changed the title [WIP] Avro [SPARK-50017] Support Avro encoding for TransformWithState operator - ValueState Oct 17, 2024
@ericm-db ericm-db force-pushed the avro branch 2 times, most recently from ec1e07a to 1aca8f4 Compare October 18, 2024 18:47
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to move this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's used in AvroOptions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered introducing a deprecated class under org.apache.spark.sql.avro that retains all the existing public methods, while moving their implementations into sql/core?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can do this.

@ericm-db ericm-db closed this Oct 22, 2024
@ericm-db ericm-db reopened this Oct 24, 2024
@ericm-db ericm-db changed the title [SPARK-50017] Support Avro encoding for TransformWithState operator - ValueState [SPARK-50017] Support Avro encoding for TransformWithState operator - ValueState, ListState Oct 24, 2024

@deprecated("Use org.apache.spark.sql.core.avro.SchemaConverters instead", "4.0.0")
@Evolving
object DeprecatedSchemaConverters {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the name SchemaConverters and don't have Deprecated in the object name

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are SOOO much cleaner now, thank you. It can get even cleaner though:

  1. I feel like you can add a Serde interface for the StateEncoder code changes. That should simplify the code even further
  2. Any reason we just didn't extend the suites with a different SQLConf to test out the different encoding type? I feel that would remove a ton of code changes as well

@@ -563,13 +684,233 @@ class RangeKeyScanStateEncoder(
writer.getRow()
}

def encodePrefixKeyForRangeScan(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a scaladoc please?

out.toByteArray
}

def decodePrefixKeyForRangeScan(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto on scaladoc please

virtualColFamilyId: Option[Short] = None)
extends RocksDBKeyStateEncoderBase(useColumnFamilies, virtualColFamilyId) {
virtualColFamilyId: Option[Short] = None,
avroEnc: Option[AvroEncoder] = None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of avroEnc, I would honestly introduce another interface:

trait Serde {

  def encodeToBytes(...)

  def decodeToUnsafeRow(...)
  
  def encodePrefixKeyForRangeScan(...)

  def decodePrefixKeyForRangeScan(...)
}

and move the logic in there so that you don't have to keep on doing avroEnc.isDefined for these

The logic seems pretty similar except for the input data. The AvroStateSerde or whatever you want to name it would have the private lazy val remainingKeyAvroType = SchemaConverters.toAvroType(remainingKeySchema)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke offline - it doesn't look like this simplifies things an awful lot - can be a follow-up.

virtualColFamilyId: Option[Short] = None)
extends RocksDBKeyStateEncoderBase(useColumnFamilies, virtualColFamilyId) {
virtualColFamilyId: Option[Short] = None,
avroEnc: Option[AvroEncoder] = None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto on the Serde.

Some(newColFamilyId), avroEnc), RocksDBStateEncoder.getValueEncoder(valueSchema,
useMultipleValuesPerKey, avroEnc)))
}
private def getAvroSerializer(schema: StructType): AvroSerializer = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: line before the method please

@@ -74,10 +75,71 @@ private[sql] class RocksDBStateStoreProvider
isInternal: Boolean = false): Unit = {
verifyColFamilyCreationOrDeletion("create_col_family", colFamilyName, isInternal)
val newColFamilyId = rocksDB.createColFamilyIfAbsent(colFamilyName)
// Create cache key using store ID to avoid collisions
val avroEncCacheKey = s"${stateStoreId.operatorId}_" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have the stream runId (maybe it's available in the HadoopConf)? We should add runId, otherwise there could be collisions

Comment on lines 41 to 42
// Avro encoder that is used by the RocksDBStateStoreProvider and RocksDBStateEncoder
// in order to serialize from UnsafeRow to a byte array of Avro encoding.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please turn this into a proper scaladoc?

/**
 * ...
 */

TestWithBothChangelogCheckpointingEnabledAndDisabled ) { colFamiliesEnabled =>
val testSchema: StructType = StructType(
Seq(
StructField("ordering-1", LongType, false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, why'd you have to change these? If these are not supported by Avro, do we have any check anywhere to disallow the usage of the Avro encoder?

Copy link
Contributor Author

@ericm-db ericm-db Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avro code would just throw an error, saying that there are invalid characters in the field name

Comment on lines 131 to 132
def testWithEncodingTypes(testName: String, testTags: Tag*)
(testBody: => Any): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one parameter per line like below please

@brkyvz
Copy link
Contributor

brkyvz commented Nov 21, 2024

oh forgot - we need to add the stream run id to the Avro encoder cache key, otherwise we may risk some unintended re-use of avro encoders. we should limit the size of that cache and add expiry to it

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@HeartSaVioR
Copy link
Contributor

@brkyvz asked me to help merging this - while I could do (or just having a quick through the code change from my side before doing that), I'd like to make sure that @gengliangwang is OK with this change.

@gengliangwang Do you have any further outstanding comment, or was splitting the PR your only concern? I'm going to merge this if you have no outstanding comment by tomorrow.

@HeartSaVioR
Copy link
Contributor

No update. I'll proceed.

Thanks! Merging to master. (DISCLAIMER: I'm merging on behalf of @brkyvz )

@HeartSaVioR HeartSaVioR changed the title [SPARK-50017] Support Avro encoding for TransformWithState operator [SPARK-50017][SS] Support Avro encoding for TransformWithState operator Nov 26, 2024
@brkyvz
Copy link
Contributor

brkyvz commented Nov 26, 2024 via email

@LuciferYang
Copy link
Contributor

LuciferYang commented Nov 27, 2024

After this PR was merged, Maven daily test started to fail:

image

We can use the following method to confirm that this PR has caused similar test failures and reproduce the issue:

  • Before this pr:
git reset --hard f7122137006e941393c8be619fb51b3b713a24cb // before this one: [SPARK-50415][BUILD] Upgrade `zstd-jni` to 1.5.6-8
build/mvn clean install -DskipTests -pl sql/core -am
build/mvn test -pl sql/core -DwildcardSuites=none -Dtest=test.org.apache.spark.sql.JavaDatasetSuite

[INFO] Running test.org.apache.spark.sql.JavaDatasetSuite
00:29:17.513 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

[INFO] Tests run: 47, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.54 s -- in test.org.apache.spark.sql.JavaDatasetSuite
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 47, Failures: 0, Errors: 0, Skipped: 0
  • After this pr:
git reset --hard 69d433bcfd5a2d69f3cd7f8c4e310a3b5854fc74 // [SPARK-50387][SS] Update condition for timer expiry and relevant test
build/mvn clean install -DskipTests -pl sql/core -am
build/mvn test -pl sql/core -DwildcardSuites=none -Dtest=test.org.apache.spark.sql.JavaDatasetSuite

[INFO] Running test.org.apache.spark.sql.JavaDatasetSuite
00:40:09.702 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

00:40:16.384 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 in stage 4.0 (TID 3)
java.lang.NoSuchMethodError: 'void org.apache.spark.util.NonFateSharingCache.<init>(com.google.common.cache.Cache)'
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$.<clinit>(RocksDBStateStoreProvider.scala:623)
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(StateStore.scala:499)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.initNewStateStoreAndProcessData(TransformWithStateExec.scala:636)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$1(TransformWithStateExec.scala:571)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$1$adapted(TransformWithStateExec.scala:549)
	at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinHelper$StateStoreAwareZipPartitionsRDD.compute(StreamingSymmetricHashJoinHelper.scala:295)
00:40:16.384 ERROR org.apache.spark.executor.Executor: Exception in task 3.0 in stage 4.0 (TID 4)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(StateStore.scala:499)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.initNewStateStoreAndProcessData(TransformWithStateExec.scala:636)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$1(TransformWithStateExec.scala:571)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$1$adapted(TransformWithStateExec.scala:549)
	at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinHelper$StateStoreAwareZipPartitionsRDD.compute(StreamingSymmetricHashJoinHelper.scala:295)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
00:40:16.393 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 (TID 3) (localhost executor driver): java.lang.NoSuchMethodError: 'void org.apache.spark.util.NonFateSharingCache.<init>(com.google.common.cache.Cache)'
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$.<clinit>(RocksDBStateStoreProvider.scala:623)
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(Stat...

00:40:16.394 ERROR org.apache.spark.scheduler.TaskSetManager: Task 1 in stage 4.0 failed 1 times; aborting job

00:40:16.394 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 3.0 in stage 4.0 (TID 4) (localhost executor driver): java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(StateStore.scala:499)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.initNewStateStoreAndP...

00:40:16.399 ERROR org.apache.spark.executor.Executor: Exception in task 4.0 in stage 4.0 (TID 5)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(StateStore.scala:499)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.initNewStateStoreAndProcessData(TransformWithStateExec.scala:636)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$1(TransformWithStateExec.scala:571)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$1$adapted(TransformWithStateExec.scala:549)
	at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinHelper$StateStoreAwareZipPartitionsRDD.compute(StreamingSymmetricHashJoinHelper.scala:295)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
00:40:20.421 ERROR org.apache.spark.executor.Executor: Exception in task 3.0 in stage 2.0 (TID 3)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(StateStore.scala:499)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.initNewStateStoreAndProcessData(TransformWithStateExec.scala:636)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$5(TransformWithStateExec.scala:597)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$5$adapted(TransformWithStateExec.scala:596)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:918)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:918)
00:40:20.421 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 in stage 2.0 (TID 2)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(StateStore.scala:499)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.initNewStateStoreAndProcessData(TransformWithStateExec.scala:636)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$5(TransformWithStateExec.scala:597)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$5$adapted(TransformWithStateExec.scala:596)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:918)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:918)
00:40:20.422 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 3.0 in stage 2.0 (TID 3) (localhost executor driver): java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(StateStore.scala:499)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.initNewStateStoreAndP...

00:40:20.422 ERROR org.apache.spark.scheduler.TaskSetManager: Task 3 in stage 2.0 failed 1 times; aborting job

00:40:20.423 ERROR org.apache.spark.scheduler.TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$Lambda$4365/0x000000700221fb70@74f74897 rejected from java.util.concurrent.ThreadPoolExecutor@36d4df95[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 4]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2065)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:833)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1365)
	at org.apache.spark.scheduler.TaskResultGetter.enqueueFailedTask(TaskResultGetter.scala:140)
	at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:813)
	at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:786)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:73)
[ERROR] Tests run: 47, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 11.15 s <<< FAILURE! -- in test.org.apache.spark.sql.JavaDatasetSuite

@ericm-db Could you help fix the issue mentioned above?
also cc @HeartSaVioR @dongjoon-hyun

@ericm-db
Copy link
Contributor Author

@LuciferYang am looking into this, think I have to add a dependency in the sql/core pom.xml
cc @HeartSaVioR @dongjoon-hyun

HeartSaVioR pushed a commit that referenced this pull request Nov 28, 2024
… in RocksDBStateStoreProvider

### What changes were proposed in this pull request?

There are maven errors introduced by the guava dependency in `sql/core`, as we use the Guava cache to store the Avro encoders, outlined in this comment: #48401 (comment)
Introduced a new constructor for the NonFateSharingCache and used this with the RocksDBStateStoreProvider.

### Why are the changes needed?

To resolve maven build errors, so that the Avro change here: #48401 does not get reverted.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit tests are sufficient and maven build works on devbox
```
[INFO] Tests run: 47, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 17.64 s -- in test.org.apache.spark.sql.JavaDatasetSuite
[INFO]
[INFO] Results:
[INFO]
[INFO] Tests run: 47, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO]
[INFO] --- surefire:3.2.5:test (test)  spark-sql_2.13 ---
[INFO] Skipping execution of surefire because it has already been run for this configuration
[INFO]
[INFO] --- scalatest:2.2.0:test (test)  spark-sql_2.13 ---
[INFO] ScalaTest report directory: /home/eric.marnadi/spark/sql/core/target/surefire-reports
WARNING: Using incubator modules: jdk.incubator.vector
Discovery starting.
Discovery completed in 2 seconds, 737 milliseconds.
Run starting. Expected test count is: 0
DiscoverySuite:
Run completed in 2 seconds, 765 milliseconds.
Total number of tests run: 0
Suites: completed 1, aborted 0
Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
No tests were executed.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  03:15 min
[INFO] Finished at: 2024-11-28T01:10:36Z
[INFO] ------------------------------------------------------------------------
```
### Was this patch authored or co-authored using generative AI tooling?

No

Closes #48996 from ericm-db/chm.

Authored-by: Eric Marnadi <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
HeartSaVioR pushed a commit that referenced this pull request Feb 21, 2025
…g to use the correct number of version bytes

### What changes were proposed in this pull request?

There are currently two bugs:
- The NoPrefixKeyStateEncoder adds an extra version byte to each row when UnsafeRow encoding is used: #47107
- Rows written with Avro encoding do not include a version byte: #48401

**Neither of these bugs have been released, since these bugs are only triggered with multiple column families, and transformWithState is only using it, which is going to be released for Spark 4.0.0.**

This change fixes both of these bugs.

### Why are the changes needed?

These changes are needed in order to conform with the expected state row encoding format.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #49996 from ericm-db/SPARK-51249.

Lead-authored-by: Eric Marnadi <[email protected]>
Co-authored-by: Eric Marnadi <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
HeartSaVioR pushed a commit that referenced this pull request Feb 21, 2025
…g to use the correct number of version bytes

### What changes were proposed in this pull request?

There are currently two bugs:
- The NoPrefixKeyStateEncoder adds an extra version byte to each row when UnsafeRow encoding is used: #47107
- Rows written with Avro encoding do not include a version byte: #48401

**Neither of these bugs have been released, since these bugs are only triggered with multiple column families, and transformWithState is only using it, which is going to be released for Spark 4.0.0.**

This change fixes both of these bugs.

### Why are the changes needed?

These changes are needed in order to conform with the expected state row encoding format.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #49996 from ericm-db/SPARK-51249.

Lead-authored-by: Eric Marnadi <[email protected]>
Co-authored-by: Eric Marnadi <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
(cherry picked from commit 42ab97a)
Signed-off-by: Jungtaek Lim <[email protected]>
Pajaraja pushed a commit to Pajaraja/spark that referenced this pull request Mar 6, 2025
…g to use the correct number of version bytes

### What changes were proposed in this pull request?

There are currently two bugs:
- The NoPrefixKeyStateEncoder adds an extra version byte to each row when UnsafeRow encoding is used: apache#47107
- Rows written with Avro encoding do not include a version byte: apache#48401

**Neither of these bugs have been released, since these bugs are only triggered with multiple column families, and transformWithState is only using it, which is going to be released for Spark 4.0.0.**

This change fixes both of these bugs.

### Why are the changes needed?

These changes are needed in order to conform with the expected state row encoding format.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#49996 from ericm-db/SPARK-51249.

Lead-authored-by: Eric Marnadi <[email protected]>
Co-authored-by: Eric Marnadi <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants