Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50017] Support Avro encoding for TransformWithState operator #48401

Open
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

ericm-db
Copy link
Contributor

@ericm-db ericm-db commented Oct 9, 2024

What changes were proposed in this pull request?

Currently, we use the internal byte representation to store state for stateful streaming operators in the StateStore. This PR introduces Avro serialization and deserialization capabilities in the RocksDBStateEncoder so that we can instead use Avro encoding to store state. This is currently enabled for the TransformWithState operator via SQLConf to support all functionality supported by TWS

Why are the changes needed?

UnsafeRow is an inherently unstable format that makes no guarantees of being backwards-compatible. Therefore, if the format changes between Spark releases, this could cause StateStore corruptions. Avro is more stable, and inherently enables schema evolution.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Amended and added to unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@ericm-db ericm-db changed the title [WIP] Avrfo [WIP] Avro Oct 9, 2024
@ericm-db ericm-db changed the title [WIP] Avro [SPARK-50017] Support Avro encoding for TransformWithState operator - ValueState Oct 17, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to move this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it's used in AvroOptions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered introducing a deprecated class under org.apache.spark.sql.avro that retains all the existing public methods, while moving their implementations into sql/core?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can do this.

sql/core/pom.xml Outdated Show resolved Hide resolved
sql/core/pom.xml Outdated Show resolved Hide resolved
@ericm-db ericm-db changed the title [SPARK-50017] Support Avro encoding for TransformWithState operator - ValueState [SPARK-50017] Support Avro encoding for TransformWithState operator - ValueState, ListState Oct 24, 2024

@deprecated("Use org.apache.spark.sql.core.avro.SchemaConverters instead", "4.0.0")
@Evolving
object DeprecatedSchemaConverters {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the name SchemaConverters and don't have Deprecated in the object name

@@ -563,13 +684,113 @@ class RangeKeyScanStateEncoder(
writer.getRow()
}

def encodePrefixKeyForRangeScan(
row: UnsafeRow, avroType: Schema): Array[Byte] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets move each arg to new line ?

@@ -104,7 +106,10 @@ case class TransformWithStateExec(
* @return a new instance of the driver processor handle
*/
private def getDriverProcessorHandle(): DriverStatefulProcessorHandleImpl = {
val driverProcessorHandle = new DriverStatefulProcessorHandleImpl(timeMode, keyEncoder)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra newline ?

StructType(schema.fields.zipWithIndex.flatMap { case (field, idx) =>
if ((ordinals.isEmpty || ordinalSet.contains(idx)) && isFixedSize(field.dataType)) {
// For each numeric field, create two fields:
// 1. Byte marker for null, positive, or negative values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add a note to mention that ByteType in StructType doesn't work as expected here

def encodePrefixKeyForRangeScan(
row: UnsafeRow,
avroType: Schema
): Array[Byte] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lets confirm the style here

@@ -91,7 +91,8 @@ case class CkptIdCollectingStateStoreWrapper(innerStore: StateStore) extends Sta
valueSchema: StructType,
keyStateEncoderSpec: KeyStateEncoderSpec,
useMultipleValuesPerKey: Boolean = false,
isInternal: Boolean = false): Unit = {
isInternal: Boolean = false,
avroEnc: Option[AvroEncoder]): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lets use default args here as well ?

StructField("key2", StringType, false),
StructField("ordering-2", IntegerType, false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test to verify the behavior if - is used within the state var names since its not supported in Avro ?

Copy link
Contributor

@anishshri-db anishshri-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm with pending nits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants