-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50017] Support Avro encoding for TransformWithState operator #48401
base: master
Are you sure you want to change the base?
Conversation
ec1e07a
to
1aca8f4
Compare
connector/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroDataSourceV2.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to move this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it's used in AvroOptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we considered introducing a deprecated class under org.apache.spark.sql.avro that retains all the existing public methods, while moving their implementations into sql/core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can do this.
connector/avro/src/main/scala/org/apache/spark/sql/avro/DeprecatedSchemaConverters.scala
Outdated
Show resolved
Hide resolved
connector/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroPartitionReaderFactory.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ListStateImpl.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StateTypesEncoderUtils.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ValueStateImplWithTTL.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala
Outdated
Show resolved
Hide resolved
|
||
@deprecated("Use org.apache.spark.sql.core.avro.SchemaConverters instead", "4.0.0") | ||
@Evolving | ||
object DeprecatedSchemaConverters { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the name SchemaConverters
and don't have Deprecated
in the object name
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/package.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStore.scala
Outdated
Show resolved
Hide resolved
@@ -563,13 +684,113 @@ class RangeKeyScanStateEncoder( | |||
writer.getRow() | |||
} | |||
|
|||
def encodePrefixKeyForRangeScan( | |||
row: UnsafeRow, avroType: Schema): Array[Byte] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets move each arg to new line ?
...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MapStateImplWithTTL.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MapStateImplWithTTL.scala
Outdated
Show resolved
Hide resolved
.../main/scala/org/apache/spark/sql/execution/streaming/StateStoreColumnFamilySchemaUtils.scala
Outdated
Show resolved
Hide resolved
.../main/scala/org/apache/spark/sql/execution/streaming/StateStoreColumnFamilySchemaUtils.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/TransformWithStateExec.scala
Outdated
Show resolved
Hide resolved
...e/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreSuite.scala
Outdated
Show resolved
Hide resolved
...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala
Show resolved
Hide resolved
...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala
Outdated
Show resolved
Hide resolved
...e/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateStoreSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/StateStoreSuite.scala
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala
Show resolved
Hide resolved
.../main/scala/org/apache/spark/sql/execution/streaming/StateStoreColumnFamilySchemaUtils.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/streaming/TransformWithListStateSuite.scala
Show resolved
Hide resolved
@@ -104,7 +106,10 @@ case class TransformWithStateExec( | |||
* @return a new instance of the driver processor handle | |||
*/ | |||
private def getDriverProcessorHandle(): DriverStatefulProcessorHandleImpl = { | |||
val driverProcessorHandle = new DriverStatefulProcessorHandleImpl(timeMode, keyEncoder) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra newline ?
StructType(schema.fields.zipWithIndex.flatMap { case (field, idx) => | ||
if ((ordinals.isEmpty || ordinalSet.contains(idx)) && isFixedSize(field.dataType)) { | ||
// For each numeric field, create two fields: | ||
// 1. Byte marker for null, positive, or negative values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets add a note to mention that ByteType
in StructType
doesn't work as expected here
def encodePrefixKeyForRangeScan( | ||
row: UnsafeRow, | ||
avroType: Schema | ||
): Array[Byte] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: lets confirm the style here
@@ -91,7 +91,8 @@ case class CkptIdCollectingStateStoreWrapper(innerStore: StateStore) extends Sta | |||
valueSchema: StructType, | |||
keyStateEncoderSpec: KeyStateEncoderSpec, | |||
useMultipleValuesPerKey: Boolean = false, | |||
isInternal: Boolean = false): Unit = { | |||
isInternal: Boolean = false, | |||
avroEnc: Option[AvroEncoder]): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: lets use default args here as well ?
StructField("key2", StringType, false), | ||
StructField("ordering-2", IntegerType, false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a test to verify the behavior if -
is used within the state var names since its not supported in Avro ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with pending nits
What changes were proposed in this pull request?
Currently, we use the internal byte representation to store state for stateful streaming operators in the StateStore. This PR introduces Avro serialization and deserialization capabilities in the RocksDBStateEncoder so that we can instead use Avro encoding to store state. This is currently enabled for the TransformWithState operator via SQLConf to support all functionality supported by TWS
Why are the changes needed?
UnsafeRow is an inherently unstable format that makes no guarantees of being backwards-compatible. Therefore, if the format changes between Spark releases, this could cause StateStore corruptions. Avro is more stable, and inherently enables schema evolution.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Amended and added to unit tests
Was this patch authored or co-authored using generative AI tooling?
No