[SPARK-46679][SQL]: Handling of generic parameter with bounds while creating encoders #48158

ahshahid · 2024-09-19T04:02:20Z

What changes were proposed in this pull request?

If a bean has generic types with bounds ( eg T <: SomeClass>) , as getters/setters, then depending upon the nature of the bounds, if it is java Serializable, KryoSerializable or a UDT Type, then appropriate encoder is created. If the bound is of first two types, then the data is represented as a BinaryType, while if the bound is of UDT type then schema / behaviour follows UDT Type.

Following things are considered while fixing the issue:

Since the concrete class type of the generic parameter is not available, it is not possible to create instance of the class ( during deser), if the bound represents any other type than the 3 mentioned above.
Because a UDT class can appear anywhwere in the bound's hierarchy, all the super classes of the bound ( including the bound) is considered and checked . To create the encoder the preference is UDTType followed by JavaSerializer or KryoSerializer, whichever shows first.
The majority of the code change in JavaTypeInference is a boolean check , to ignore any data type match when considering bound, except UDT and Type Variable ( type variable is included because T <: S and say S <: Serializable).

Following cases are considered which are sort of boundary cases:

If the generic bean is of type
`
Bean[T <: UDTClass] {
@BeanProperty var udt: T = _

}
Then the UDTEncoder will be created for the field

But if the Bean is of type

Bean[T <: UDTDerivedClass] {
@BeanProperty var udt: T = _

}

where UDTDerivedClass <: UDTClass
Then a JavaSerializable encoder will be created , even though the class hierarchy of UDTDerivedClass contains UDTClass. The reason being that concrete instance created by UDTType would be of UDTClass which is not assignable to
UDTDerivedClass
`

similarly for non generic bean class having UDTDerivedClass as bean property will also use Java Serialization encoder. ( added test for the same). The reason for JavaSerializationEncoder is same as that for Generic one.

Why are the changes needed?

To fix the regression in spark 3.3 onwards, where the bean having a generic type as return value, throws EncoderNotFoundException.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added bug tests..

Was this patch authored or co-authored using generative AI tooling?

No

…e/SPARK-46679 CDPD-58844. Upgrade janino to 3.1.10 Change-Id: I8744bb020e5fedcc0e9e4bc08c556c98a80406ba Sync workflow files | Triggered by Kitchen/RE-github-workflows Sync workflow files | Triggered by Kitchen/RE-github-workflows CDPD-58844. Upgrade janino to 3.1.10 Change-Id: I8744bb020e5fedcc0e9e4bc08c556c98a80406ba Sync workflow files | Triggered by Kitchen/RE-github-workflows Sync workflow files | Triggered by Kitchen/RE-github-workflows

…ng exception in creating encoder

…ant class

…tyle errors

…anges

…anges. added tests. cleanup of tests

hvanhovell · 2024-09-23T15:31:02Z

sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

+      typeVariables: Map[TypeVariable[_], Type] = Map.empty,
+      forGenericBound: Boolean = false): AgnosticEncoder[_] = t match {
+
+    case c: Class[_] if !forGenericBound && c == java.lang.Boolean.TYPE => PrimitiveBooleanEncoder


For readability purposes I would create a branch at the beginning where you handle case tv: TypeVariable[_] if forGenericBound => This way the rest of the code is less impacted.

Thanks. I will do that. was thinking how to do that.. . This neat idea did not struck to me.

I have sort of done it now..

hvanhovell · 2024-09-23T15:34:07Z

sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

+      // should not consider class as bean, as otherwise it will be treated as empty schema
+      // and loose the data on deser.
+      if (properties.isEmpty && seenTypeSet.nonEmpty) {
+        if (classOf[KryoSerializable].isAssignableFrom(c)) {


This will be an issue for Connect. While the API supports Kryo, Connect can't support Kryo in its current form. Either we have detect whether we are in connect mode, or we have to just fall back to java serialization.

sure. I was not aware of that.. will write test for it and check.
will be checking in a code with some refactoring..

@hvanhovell Added check for client connect while creating encoders, to not use kryo based encoder. Using threadlocal to detect the same, instead of api change.

hvanhovell · 2024-09-23T15:45:07Z

sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

  }

+  private def getAllSuperClasses(typee: Type): Array[Class[_]] = Option(typee)


This sort of begs for a queue and a loop. I am reasonable sure that is more readable...

will flatten the recursion

@hvanhovell Done. Kindly check.

hvanhovell · 2024-09-23T15:47:20Z

sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala

  def getJavaBeanReadableProperties(beanClass: Class[_]): Array[PropertyDescriptor] = {
    val beanInfo = Introspector.getBeanInfo(beanClass)
    beanInfo.getPropertyDescriptors
      .filterNot(_.getName == "class")
      .filterNot(_.getName == "declaringClass")
      .filter(_.getReadMethod != null)
  }
+
+  object UseSerializationEncoder {
+    def unapply(th: Throwable): Option[Class[_] => AgnosticEncoder[_]] = th match {


Can we use the actual superclasses here instead of going through the error message? I'd also prefer if you unify this with the other java serialization code.

Unifying it will other java serialziation code, is something which I intend to do in my refactoring..

I had thought about using the Serializable encoders as part of the match/case statement, but it appears tricky as the idea is that the Serializable encoders are to be used as last resort. And the way current logic works ( if I am not wrong), it collects all the interfaces/classes via the CommonUtils method, and it is possible , I think, that Serializable interface may show early as the introspection data is converted to a map in the Bean class

val parentClassesTypeMap = JavaTypeUtils.getTypeArguments(c, classOf[Object]).asScala.toMap

I am hesitant at this point to further complicate the code, unless you all think that its worthwhile to do it in this PR.

…ests for edge cases. Flattened the recursive call

…lient Connect, as per review feedback

ashahid added 8 commits August 23, 2024 15:37

CDPD-73233. Fixing couple of test failures

6ba024d

SPARK-46679. generic type parameter field of type Serializable throwi…

43da6b8

…ng exception in creating encoder

Merge branch 'master' into SPARK-46679

6cf43ae

SPARK-46679. Handling of generic parameter with bounds

fcbfa98

SPARK-46679. Handling of generic parameter with bounds. added more tests

5a37ffc

SPARK-46679. Handling of generic parameter with bounds. added more tests

73866d8

Merge branch 'master' into SPARK-46679

e3be277

github-actions bot added the SQL label Sep 19, 2024

ashahid added 11 commits September 18, 2024 21:05

SPARK-46679. Handling of generic parameter with bounds. remove redund…

31fdb62

…ant class

SPARK-46679. Handling of generic parameter with bounds. fixed scala s…

df69ed3

…tyle errors

Merge branch 'master' into SPARK-46679

ca2b4c8

SPARK-46679. Handling of generic parameter with bounds. formatting ch…

8aa5553

…anges

SPARK-46679. Handling of generic parameter with bounds. formatting ch…

8b37a34

…anges

SPARK-46679. Handling of generic parameter with bounds. formatting ch…

7a32ab8

…anges

SPARK-46679. Handling of generic parameter with bounds. formatting ch…

4ba4d61

…anges

SPARK-46679. Handling of generic parameter with bounds. formatting ch…

de946f4

…anges

SPARK-46679. Handling of generic parameter with bounds. formatting ch…

5aa0186

…anges

SPARK-46679. Handling of generic parameter with bounds. formatting ch…

e818f45

…anges. added tests. cleanup of tests

SPARK-46679. Handling of generic parameter with bounds. formatting ch…

ad5bb72

…anges. added tests. cleanup of tests

ahshahid changed the title ~~[WIP][SPARK-46679][SQL]: Handling of generic parameter with bounds while creating encoders~~ [SPARK-46679][SQL]: Handling of generic parameter with bounds while creating encoders Sep 20, 2024

SPARK-46679. Added bug test for SPARK-49727

2100681

ahshahid force-pushed the SPARK-46679 branch from 03eff84 to 2100681 Compare September 20, 2024 07:14

hvanhovell reviewed Sep 23, 2024

View reviewed changes

ashahid added 3 commits September 23, 2024 12:26

SPARK-46679: Refactored the code as per review feedback. Added more t…

de5a1ea

…ests for edge cases. Flattened the recursive call

SPARK-46679: Refactored the code as per review feedback. Added more t…

6529cec

…ests for edge cases. Flattened the recursive call

SPARK-46679: Added check to skip using KryoSerializable encoder for C…

09489f7

…lient Connect, as per review feedback

github-actions bot added the CONNECT label Sep 23, 2024

ashahid added 3 commits September 23, 2024 18:08

SPARK-46679: refactoring

72081cb

SPARK-46679: refactoring

96ae01f

SPARK-46679: refactoring

3876006

ahshahid closed this Sep 25, 2024

ahshahid deleted the SPARK-46679 branch September 25, 2024 20:54

ahshahid mentioned this pull request Sep 25, 2024

[SPARK-49789][SQL] Handling of generic parameter with bounds while creating encoders #48252

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46679][SQL]: Handling of generic parameter with bounds while creating encoders #48158

[SPARK-46679][SQL]: Handling of generic parameter with bounds while creating encoders #48158

ahshahid commented Sep 19, 2024 •

edited

Loading

hvanhovell Sep 23, 2024

ahshahid Sep 23, 2024

ahshahid Sep 23, 2024

hvanhovell Sep 23, 2024

ahshahid Sep 23, 2024

ahshahid Sep 23, 2024

hvanhovell Sep 23, 2024

ahshahid Sep 23, 2024

ahshahid Sep 23, 2024

hvanhovell Sep 23, 2024

ahshahid Sep 23, 2024

ahshahid Sep 23, 2024

		}

		private def getAllSuperClasses(typee: Type): Array[Class[_]] = Option(typee)

[SPARK-46679][SQL]: Handling of generic parameter with bounds while creating encoders #48158

[SPARK-46679][SQL]: Handling of generic parameter with bounds while creating encoders #48158

Conversation

ahshahid commented Sep 19, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahshahid commented Sep 19, 2024 •

edited

Loading