[SPARK-50017][SS] Support Avro encoding for TransformWithState operator #48401

ericm-db · 2024-10-09T16:59:26Z

What changes were proposed in this pull request?

Currently, we use the internal byte representation to store state for stateful streaming operators in the StateStore. This PR introduces Avro serialization and deserialization capabilities in the RocksDBStateEncoder so that we can instead use Avro encoding to store state. This is currently enabled for the TransformWithState operator via SQLConf to support all functionality supported by TWS

Why are the changes needed?

UnsafeRow is an inherently unstable format that makes no guarantees of being backwards-compatible. Therefore, if the format changes between Spark releases, this could cause StateStore corruptions. Avro is more stable, and inherently enables schema evolution.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Amended and added to unit tests

Was this patch authored or co-authored using generative AI tooling?

No

connector/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroDataSourceV2.scala

gengliangwang · 2024-10-18T21:36:23Z

sql/core/src/main/java/org/apache/spark/sql/core/avro/AvroFileFormat.scala

why do we need to move this file?

Because it's used in AvroOptions

gengliangwang · 2024-10-18T21:38:37Z

sql/core/src/main/java/org/apache/spark/sql/core/avro/SchemaConverters.scala

Have we considered introducing a deprecated class under org.apache.spark.sql.avro that retains all the existing public methods, while moving their implementations into sql/core?

Sure, we can do this.

connector/avro/src/main/scala/org/apache/spark/sql/avro/DeprecatedSchemaConverters.scala

connector/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroPartitionReaderFactory.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/pom.xml

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ListStateImpl.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ValueStateImplWithTTL.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala

gengliangwang · 2024-10-24T22:15:29Z

connector/avro/src/main/scala/org/apache/spark/sql/avro/DeprecatedSchemaConverters.scala

+
+@deprecated("Use org.apache.spark.sql.core.avro.SchemaConverters instead", "4.0.0")
+@Evolving
+object DeprecatedSchemaConverters {


Let's keep the name SchemaConverters and don't have Deprecated in the object name

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala

brkyvz

The changes are SOOO much cleaner now, thank you. It can get even cleaner though:

I feel like you can add a Serde interface for the StateEncoder code changes. That should simplify the code even further
Any reason we just didn't extend the suites with a different SQLConf to test out the different encoding type? I feel that would remove a ton of code changes as well

brkyvz · 2024-11-21T13:07:32Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala

@@ -563,13 +684,233 @@ class RangeKeyScanStateEncoder(
    writer.getRow()
  }

+  def encodePrefixKeyForRangeScan(


Can you add a scaladoc please?

brkyvz · 2024-11-21T13:08:49Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala

+    out.toByteArray
+  }
+
+  def decodePrefixKeyForRangeScan(


ditto on scaladoc please

brkyvz · 2024-11-21T13:27:48Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala

-    virtualColFamilyId: Option[Short] = None)
-  extends RocksDBKeyStateEncoderBase(useColumnFamilies, virtualColFamilyId) {
+    virtualColFamilyId: Option[Short] = None,
+    avroEnc: Option[AvroEncoder] = None)


Instead of avroEnc, I would honestly introduce another interface:

trait Serde { def encodeToBytes(...) def decodeToUnsafeRow(...) def encodePrefixKeyForRangeScan(...) def decodePrefixKeyForRangeScan(...) }

and move the logic in there so that you don't have to keep on doing avroEnc.isDefined for these

The logic seems pretty similar except for the input data. The AvroStateSerde or whatever you want to name it would have the private lazy val remainingKeyAvroType = SchemaConverters.toAvroType(remainingKeySchema)

Spoke offline - it doesn't look like this simplifies things an awful lot - can be a follow-up.

brkyvz · 2024-11-21T13:29:50Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala

-    virtualColFamilyId: Option[Short] = None)
-  extends RocksDBKeyStateEncoderBase(useColumnFamilies, virtualColFamilyId) {
+    virtualColFamilyId: Option[Short] = None,
+    avroEnc: Option[AvroEncoder] = None)


ditto on the Serde.

brkyvz · 2024-11-21T13:34:41Z

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala

+          Some(newColFamilyId), avroEnc), RocksDBStateEncoder.getValueEncoder(valueSchema,
+          useMultipleValuesPerKey, avroEnc)))
+    }
+    private def getAvroSerializer(schema: StructType): AvroSerializer = {


nit: line before the method please

brkyvz · 2024-11-21T13:37:32Z

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala

@@ -74,10 +75,71 @@ private[sql] class RocksDBStateStoreProvider
        isInternal: Boolean = false): Unit = {
      verifyColFamilyCreationOrDeletion("create_col_family", colFamilyName, isInternal)
      val newColFamilyId = rocksDB.createColFamilyIfAbsent(colFamilyName)
+      // Create cache key using store ID to avoid collisions
+      val avroEncCacheKey = s"${stateStoreId.operatorId}_" +


Do we have the stream runId (maybe it's available in the HadoopConf)? We should add runId, otherwise there could be collisions

brkyvz · 2024-11-21T13:39:42Z

...n/scala/org/apache/spark/sql/execution/streaming/state/StateSchemaCompatibilityChecker.scala

+// Avro encoder that is used by the RocksDBStateStoreProvider and RocksDBStateEncoder
+// in order to serialize from UnsafeRow to a byte array of Avro encoding.


Can you please turn this into a proper scaladoc?

/** * ... */

brkyvz · 2024-11-21T13:42:17Z

...e/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreSuite.scala

    TestWithBothChangelogCheckpointingEnabledAndDisabled ) { colFamiliesEnabled =>
    val testSchema: StructType = StructType(
      Seq(
-        StructField("ordering-1", LongType, false),


oh, why'd you have to change these? If these are not supported by Avro, do we have any check anywhere to disallow the usage of the Avro encoder?

Avro code would just throw an error, saying that there are invalid characters in the field name

brkyvz · 2024-11-21T13:42:46Z

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

+  def testWithEncodingTypes(testName: String, testTags: Tag*)
+                           (testBody: => Any): Unit = {


one parameter per line like below please

...e/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreSuite.scala

brkyvz · 2024-11-21T13:46:51Z

oh forgot - we need to add the stream run id to the Avro encoder cache key, otherwise we may risk some unintended re-use of avro encoders. we should limit the size of that cache and add expiry to it

brkyvz

LGTM!

...rc/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreProvider.scala

...n/scala/org/apache/spark/sql/execution/streaming/state/StateSchemaCompatibilityChecker.scala

HeartSaVioR · 2024-11-25T05:01:34Z

@brkyvz asked me to help merging this - while I could do (or just having a quick through the code change from my side before doing that), I'd like to make sure that @gengliangwang is OK with this change.

@gengliangwang Do you have any further outstanding comment, or was splitting the PR your only concern? I'm going to merge this if you have no outstanding comment by tomorrow.

HeartSaVioR · 2024-11-26T04:31:46Z

No update. I'll proceed.

Thanks! Merging to master. (DISCLAIMER: I'm merging on behalf of @brkyvz )

brkyvz · 2024-11-26T07:18:35Z

Thank you!

…

On Mon, Nov 25, 2024, 8:33 PM Jungtaek Lim ***@***.***> wrote: Closed #48401 <#48401> via 331d0bf <331d0bf> . — Reply to this email directly, view it on GitHub <#48401 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABIAE62V4SZFXMEYAIZ72DT2CP2Z3AVCNFSM6AAAAABPU7JMVCVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJVGQZTAMBYGE4DCNQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

LuciferYang · 2024-11-27T16:52:11Z

After this PR was merged, Maven daily test started to fail:

We can use the following method to confirm that this PR has caused similar test failures and reproduce the issue:

Before this pr:

git reset --hard f7122137006e941393c8be619fb51b3b713a24cb // before this one: [SPARK-50415][BUILD] Upgrade `zstd-jni` to 1.5.6-8
build/mvn clean install -DskipTests -pl sql/core -am
build/mvn test -pl sql/core -DwildcardSuites=none -Dtest=test.org.apache.spark.sql.JavaDatasetSuite

[INFO] Running test.org.apache.spark.sql.JavaDatasetSuite
00:29:17.513 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

[INFO] Tests run: 47, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.54 s -- in test.org.apache.spark.sql.JavaDatasetSuite
[INFO] 
[INFO] Results:
[INFO] 
[INFO] Tests run: 47, Failures: 0, Errors: 0, Skipped: 0

After this pr:

git reset --hard 69d433bcfd5a2d69f3cd7f8c4e310a3b5854fc74 // [SPARK-50387][SS] Update condition for timer expiry and relevant test
build/mvn clean install -DskipTests -pl sql/core -am
build/mvn test -pl sql/core -DwildcardSuites=none -Dtest=test.org.apache.spark.sql.JavaDatasetSuite

[INFO] Running test.org.apache.spark.sql.JavaDatasetSuite
00:40:09.702 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

00:40:16.384 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 in stage 4.0 (TID 3)
java.lang.NoSuchMethodError: 'void org.apache.spark.util.NonFateSharingCache.<init>(com.google.common.cache.Cache)'
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$.<clinit>(RocksDBStateStoreProvider.scala:623)
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(StateStore.scala:499)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.initNewStateStoreAndProcessData(TransformWithStateExec.scala:636)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$1(TransformWithStateExec.scala:571)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$1$adapted(TransformWithStateExec.scala:549)
	at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinHelper$StateStoreAwareZipPartitionsRDD.compute(StreamingSymmetricHashJoinHelper.scala:295)
00:40:16.384 ERROR org.apache.spark.executor.Executor: Exception in task 3.0 in stage 4.0 (TID 4)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(StateStore.scala:499)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.initNewStateStoreAndProcessData(TransformWithStateExec.scala:636)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$1(TransformWithStateExec.scala:571)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$1$adapted(TransformWithStateExec.scala:549)
	at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinHelper$StateStoreAwareZipPartitionsRDD.compute(StreamingSymmetricHashJoinHelper.scala:295)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
00:40:16.393 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 (TID 3) (localhost executor driver): java.lang.NoSuchMethodError: 'void org.apache.spark.util.NonFateSharingCache.<init>(com.google.common.cache.Cache)'
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$.<clinit>(RocksDBStateStoreProvider.scala:623)
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(Stat...

00:40:16.394 ERROR org.apache.spark.scheduler.TaskSetManager: Task 1 in stage 4.0 failed 1 times; aborting job

00:40:16.394 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 3.0 in stage 4.0 (TID 4) (localhost executor driver): java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(StateStore.scala:499)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.initNewStateStoreAndP...

00:40:16.399 ERROR org.apache.spark.executor.Executor: Exception in task 4.0 in stage 4.0 (TID 5)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(StateStore.scala:499)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.initNewStateStoreAndProcessData(TransformWithStateExec.scala:636)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$1(TransformWithStateExec.scala:571)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$1$adapted(TransformWithStateExec.scala:549)
	at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinHelper$StateStoreAwareZipPartitionsRDD.compute(StreamingSymmetricHashJoinHelper.scala:295)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:374)
00:40:20.421 ERROR org.apache.spark.executor.Executor: Exception in task 3.0 in stage 2.0 (TID 3)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(StateStore.scala:499)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.initNewStateStoreAndProcessData(TransformWithStateExec.scala:636)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$5(TransformWithStateExec.scala:597)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$5$adapted(TransformWithStateExec.scala:596)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:918)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:918)
00:40:20.421 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 in stage 2.0 (TID 2)
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(StateStore.scala:499)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.initNewStateStoreAndProcessData(TransformWithStateExec.scala:636)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$5(TransformWithStateExec.scala:597)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.$anonfun$doExecute$5$adapted(TransformWithStateExec.scala:596)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:918)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:918)
00:40:20.422 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 3.0 in stage 2.0 (TID 3) (localhost executor driver): java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider$
	at org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider.init(RocksDBStateStoreProvider.scala:393)
	at org.apache.spark.sql.execution.streaming.state.StateStoreProvider$.createAndInit(StateStore.scala:499)
	at org.apache.spark.sql.execution.streaming.TransformWithStateExec.initNewStateStoreAndP...

00:40:20.422 ERROR org.apache.spark.scheduler.TaskSetManager: Task 3 in stage 2.0 failed 1 times; aborting job

00:40:20.423 ERROR org.apache.spark.scheduler.TaskSchedulerImpl: Exception in statusUpdate
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.scheduler.TaskResultGetter$$Lambda$4365/0x000000700221fb70@74f74897 rejected from java.util.concurrent.ThreadPoolExecutor@36d4df95[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 4]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2065)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:833)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1365)
	at org.apache.spark.scheduler.TaskResultGetter.enqueueFailedTask(TaskResultGetter.scala:140)
	at org.apache.spark.scheduler.TaskSchedulerImpl.liftedTree2$1(TaskSchedulerImpl.scala:813)
	at org.apache.spark.scheduler.TaskSchedulerImpl.statusUpdate(TaskSchedulerImpl.scala:786)
	at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:73)
[ERROR] Tests run: 47, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 11.15 s <<< FAILURE! -- in test.org.apache.spark.sql.JavaDatasetSuite

@ericm-db Could you help fix the issue mentioned above?
also cc @HeartSaVioR @dongjoon-hyun

ericm-db · 2024-11-27T21:33:01Z

@LuciferYang am looking into this, think I have to add a dependency in the sql/core pom.xml
cc @HeartSaVioR @dongjoon-hyun

… in RocksDBStateStoreProvider ### What changes were proposed in this pull request? There are maven errors introduced by the guava dependency in `sql/core`, as we use the Guava cache to store the Avro encoders, outlined in this comment: #48401 (comment) Introduced a new constructor for the NonFateSharingCache and used this with the RocksDBStateStoreProvider. ### Why are the changes needed? To resolve maven build errors, so that the Avro change here: #48401 does not get reverted. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests are sufficient and maven build works on devbox ``` [INFO] Tests run: 47, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 17.64 s -- in test.org.apache.spark.sql.JavaDatasetSuite [INFO] [INFO] Results: [INFO] [INFO] Tests run: 47, Failures: 0, Errors: 0, Skipped: 0 [INFO] [INFO] [INFO] --- surefire:3.2.5:test (test) spark-sql_2.13 --- [INFO] Skipping execution of surefire because it has already been run for this configuration [INFO] [INFO] --- scalatest:2.2.0:test (test) spark-sql_2.13 --- [INFO] ScalaTest report directory: /home/eric.marnadi/spark/sql/core/target/surefire-reports WARNING: Using incubator modules: jdk.incubator.vector Discovery starting. Discovery completed in 2 seconds, 737 milliseconds. Run starting. Expected test count is: 0 DiscoverySuite: Run completed in 2 seconds, 765 milliseconds. Total number of tests run: 0 Suites: completed 1, aborted 0 Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 No tests were executed. [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 03:15 min [INFO] Finished at: 2024-11-28T01:10:36Z [INFO] ------------------------------------------------------------------------ ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #48996 from ericm-db/chm. Authored-by: Eric Marnadi <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

…g to use the correct number of version bytes ### What changes were proposed in this pull request? There are currently two bugs: - The NoPrefixKeyStateEncoder adds an extra version byte to each row when UnsafeRow encoding is used: #47107 - Rows written with Avro encoding do not include a version byte: #48401 **Neither of these bugs have been released, since these bugs are only triggered with multiple column families, and transformWithState is only using it, which is going to be released for Spark 4.0.0.** This change fixes both of these bugs. ### Why are the changes needed? These changes are needed in order to conform with the expected state row encoding format. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #49996 from ericm-db/SPARK-51249. Lead-authored-by: Eric Marnadi <[email protected]> Co-authored-by: Eric Marnadi <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

…g to use the correct number of version bytes ### What changes were proposed in this pull request? There are currently two bugs: - The NoPrefixKeyStateEncoder adds an extra version byte to each row when UnsafeRow encoding is used: #47107 - Rows written with Avro encoding do not include a version byte: #48401 **Neither of these bugs have been released, since these bugs are only triggered with multiple column families, and transformWithState is only using it, which is going to be released for Spark 4.0.0.** This change fixes both of these bugs. ### Why are the changes needed? These changes are needed in order to conform with the expected state row encoding format. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #49996 from ericm-db/SPARK-51249. Lead-authored-by: Eric Marnadi <[email protected]> Co-authored-by: Eric Marnadi <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]> (cherry picked from commit 42ab97a) Signed-off-by: Jungtaek Lim <[email protected]>

…g to use the correct number of version bytes ### What changes were proposed in this pull request? There are currently two bugs: - The NoPrefixKeyStateEncoder adds an extra version byte to each row when UnsafeRow encoding is used: apache#47107 - Rows written with Avro encoding do not include a version byte: apache#48401 **Neither of these bugs have been released, since these bugs are only triggered with multiple column families, and transformWithState is only using it, which is going to be released for Spark 4.0.0.** This change fixes both of these bugs. ### Why are the changes needed? These changes are needed in order to conform with the expected state row encoding format. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#49996 from ericm-db/SPARK-51249. Lead-authored-by: Eric Marnadi <[email protected]> Co-authored-by: Eric Marnadi <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>

ericm-db changed the title ~~[WIP] Avrfo~~ [WIP] Avro Oct 9, 2024

github-actions bot added SQL STRUCTURED STREAMING BUILD AVRO labels Oct 9, 2024

ericm-db changed the title ~~[WIP] Avro~~ [SPARK-50017] Support Avro encoding for TransformWithState operator - ValueState Oct 17, 2024

ericm-db force-pushed the avro branch 2 times, most recently from ec1e07a to 1aca8f4 Compare October 18, 2024 18:47

gengliangwang reviewed Oct 18, 2024

View reviewed changes

connector/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroDataSourceV2.scala Outdated Show resolved Hide resolved

gengliangwang reviewed Oct 18, 2024

View reviewed changes

ericm-db requested a review from gengliangwang October 21, 2024 19:22

ericm-db closed this Oct 22, 2024

ericm-db reopened this Oct 24, 2024