Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-46679][SQL]: Handling of generic parameter with bounds while creating encoders #48158

Closed
wants to merge 26 commits into from

Conversation

ahshahid
Copy link

@ahshahid ahshahid commented Sep 19, 2024

What changes were proposed in this pull request?

If a bean has generic types with bounds ( eg T <: SomeClass>) , as getters/setters, then depending upon the nature of the bounds, if it is java Serializable, KryoSerializable or a UDT Type, then appropriate encoder is created. If the bound is of first two types, then the data is represented as a BinaryType, while if the bound is of UDT type then schema / behaviour follows UDT Type.

Following things are considered while fixing the issue:

  1. Since the concrete class type of the generic parameter is not available, it is not possible to create instance of the class ( during deser), if the bound represents any other type than the 3 mentioned above.
  2. Because a UDT class can appear anywhwere in the bound's hierarchy, all the super classes of the bound ( including the bound) is considered and checked . To create the encoder the preference is UDTType followed by JavaSerializer or KryoSerializer, whichever shows first.
  3. The majority of the code change in JavaTypeInference is a boolean check , to ignore any data type match when considering bound, except UDT and Type Variable ( type variable is included because T <: S and say S <: Serializable).

Following cases are considered which are sort of boundary cases:

  1. If the generic bean is of type
    `
    Bean[T <: UDTClass] {
    @BeanProperty var udt: T = _

}
Then the UDTEncoder will be created for the field

But if the Bean is of type

Bean[T <: UDTDerivedClass] {
@BeanProperty var udt: T = _

}

where UDTDerivedClass <: UDTClass
Then a JavaSerializable encoder will be created , even though the class hierarchy of UDTDerivedClass contains UDTClass. The reason being that concrete instance created by UDTType would be of UDTClass which is not assignable to
UDTDerivedClass
`

similarly for non generic bean class having UDTDerivedClass as bean property will also use Java Serialization encoder. ( added test for the same). The reason for JavaSerializationEncoder is same as that for Generic one.

Why are the changes needed?

To fix the regression in spark 3.3 onwards, where the bean having a generic type as return value, throws EncoderNotFoundException.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added bug tests..

Was this patch authored or co-authored using generative AI tooling?

No

ashahid added 8 commits August 23, 2024 15:37
…e/SPARK-46679

CDPD-58844. Upgrade janino to 3.1.10

Change-Id: I8744bb020e5fedcc0e9e4bc08c556c98a80406ba

Sync workflow files | Triggered by Kitchen/RE-github-workflows

Sync workflow files | Triggered by Kitchen/RE-github-workflows

CDPD-58844. Upgrade janino to 3.1.10

Change-Id: I8744bb020e5fedcc0e9e4bc08c556c98a80406ba

Sync workflow files | Triggered by Kitchen/RE-github-workflows

Sync workflow files | Triggered by Kitchen/RE-github-workflows
@github-actions github-actions bot added the SQL label Sep 19, 2024
@ahshahid ahshahid changed the title [WIP][SPARK-46679][SQL]: Handling of generic parameter with bounds while creating encoders [SPARK-46679][SQL]: Handling of generic parameter with bounds while creating encoders Sep 20, 2024
typeVariables: Map[TypeVariable[_], Type] = Map.empty,
forGenericBound: Boolean = false): AgnosticEncoder[_] = t match {

case c: Class[_] if !forGenericBound && c == java.lang.Boolean.TYPE => PrimitiveBooleanEncoder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability purposes I would create a branch at the beginning where you handle case tv: TypeVariable[_] if forGenericBound => This way the rest of the code is less impacted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will do that. was thinking how to do that.. . This neat idea did not struck to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have sort of done it now..

// should not consider class as bean, as otherwise it will be treated as empty schema
// and loose the data on deser.
if (properties.isEmpty && seenTypeSet.nonEmpty) {
if (classOf[KryoSerializable].isAssignableFrom(c)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be an issue for Connect. While the API supports Kryo, Connect can't support Kryo in its current form. Either we have detect whether we are in connect mode, or we have to just fall back to java serialization.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. I was not aware of that.. will write test for it and check.
will be checking in a code with some refactoring..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell Added check for client connect while creating encoders, to not use kryo based encoder. Using threadlocal to detect the same, instead of api change.

}

private def getAllSuperClasses(typee: Type): Array[Class[_]] = Option(typee)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of begs for a queue and a loop. I am reasonable sure that is more readable...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will flatten the recursion

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell Done. Kindly check.

def getJavaBeanReadableProperties(beanClass: Class[_]): Array[PropertyDescriptor] = {
val beanInfo = Introspector.getBeanInfo(beanClass)
beanInfo.getPropertyDescriptors
.filterNot(_.getName == "class")
.filterNot(_.getName == "declaringClass")
.filter(_.getReadMethod != null)
}

object UseSerializationEncoder {
def unapply(th: Throwable): Option[Class[_] => AgnosticEncoder[_]] = th match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the actual superclasses here instead of going through the error message? I'd also prefer if you unify this with the other java serialization code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unifying it will other java serialziation code, is something which I intend to do in my refactoring..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought about using the Serializable encoders as part of the match/case statement, but it appears tricky as the idea is that the Serializable encoders are to be used as last resort. And the way current logic works ( if I am not wrong), it collects all the interfaces/classes via the CommonUtils method, and it is possible , I think, that Serializable interface may show early as the introspection data is converted to a map in the Bean class

    val parentClassesTypeMap =
      JavaTypeUtils.getTypeArguments(c, classOf[Object]).asScala.toMap

I am hesitant at this point to further complicate the code, unless you all think that its worthwhile to do it in this PR.

ashahid added 3 commits September 23, 2024 12:26
…ests for edge cases. Flattened the recursive call
…ests for edge cases. Flattened the recursive call
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants