Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49789][SQL] Handling of generic parameter with bounds while creating encoders #48252

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

ahshahid
Copy link

What changes were proposed in this pull request?

If a bean has generic types with bounds ( eg T <: SomeClass>) , as getters/setters, then depending upon the nature of the bounds, if it is java Serializable, KryoSerializable or a UDT Type, then appropriate encoder is created. If the bound is of first two types, then the data is represented as a BinaryType, while if the bound is of UDT type then schema / behaviour follows UDT Type.

Following things are considered while fixing the issue:

Since the concrete class type of the generic parameter is not available, it is not possible to create instance of the class ( during deser), if the bound represents any other type than the 3 mentioned above.
Because a UDT class can appear anywhwere in the bound's hierarchy, all the super classes of the bound ( including the bound) is considered and checked . To create the encoder the preference is UDTType followed by JavaSerializer or KryoSerializer, whichever shows first.
The majority of the code change in JavaTypeInference is a boolean check , to ignore any data type match when considering bound, except UDT and Type Variable ( type variable is included because T <: S and say S <: Serializable).
Following cases are considered which are sort of boundary cases:

If the generic bean is of type
`
Bean[T <: UDTClass] {
@BeanProperty var udt: T = _
}
Then the UDTEncoder will be created for the field

But if the Bean is of type

Bean[T <: UDTDerivedClass] {
@BeanProperty var udt: T = _

}

where UDTDerivedClass <: UDTClass
Then a JavaSerializable encoder will be created , even though the class hierarchy of UDTDerivedClass contains UDTClass. The reason being that concrete instance created by UDTType would be of UDTClass which is not assignable to
UDTDerivedClass
`

similarly for non generic bean class having UDTDerivedClass as bean property will also use Java Serialization encoder. ( added test for the same). The reason for JavaSerializationEncoder is same as that for Generic one.

Why are the changes needed?

To fix the regression in spark 3.3 onwards, where the bean having a generic type as return value, throws EncoderNotFoundException.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added bug tests..

Was this patch authored or co-authored using generative AI tooling?

No

ashahid added 26 commits August 23, 2024 15:37
…e/SPARK-46679

CDPD-58844. Upgrade janino to 3.1.10

Change-Id: I8744bb020e5fedcc0e9e4bc08c556c98a80406ba

Sync workflow files | Triggered by Kitchen/RE-github-workflows

Sync workflow files | Triggered by Kitchen/RE-github-workflows

CDPD-58844. Upgrade janino to 3.1.10

Change-Id: I8744bb020e5fedcc0e9e4bc08c556c98a80406ba

Sync workflow files | Triggered by Kitchen/RE-github-workflows

Sync workflow files | Triggered by Kitchen/RE-github-workflows
…ests for edge cases. Flattened the recursive call
…ests for edge cases. Flattened the recursive call
@ahshahid
Copy link
Author

@hvanhovell
I have filed a new bug SPARK-49789 and linked this PR to it.
Previously this PR was created against bug SPARK-46679

It was brought to my notice by @squito that they are 2 different issues.
SPARK-46679 is a bug due to missing code in handling of parameterized bean, while this bug is for the case of generic bean which is not parameterized.
For Paramerized bean, there is further scope of introspection.

I am also adding the review comments put in the previous PR ( which I will be closing)

hvanhovell 2 days ago
For readability purposes I would create a branch at the beginning where you handle case tv: TypeVariable[_] if forGenericBound => This way the rest of the code is less impacted.

done

hvanhovell 2 days ago
This will be an issue for Connect. While the API supports Kryo, Connect can't support Kryo in its current form. Either we have detect whether we are in connect mode, or we have to just fall back to java serialization.

done

hvanhovell 2 days ago
This sort of begs for a queue and a loop. I am reasonable sure that is more readable...

done

hvanhovell 2 days ago
Can we use the actual superclasses here instead of going through the error message? I'd also prefer if you unify this with the other java serialization code.

Unifying it will other java serialziation code, is something which I intend to do in my refactoring..

Author
@ahshahid ahshahid 2 days ago
I had thought about using the Serializable encoders as part of the match/case statement, but it appears tricky as the idea is that the Serializable encoders are to be used as last resort. And the way current logic works ( if I am not wrong), it collects all the interfaces/classes via the CommonUtils method, and it is possible , I think, that Serializable interface may show early as the introspection data is converted to a map in the Bean class

val parentClassesTypeMap =
  JavaTypeUtils.getTypeArguments(c, classOf[Object]).asScala.toMap

I am hesitant at this point to further complicate the code, unless you all think that its worthwhile to do it in this PR.

…ders for generic type and parameterized type
@HyukjinKwon HyukjinKwon changed the title [SPARK-49789][SQL]: Handling of generic parameter with bounds while creating encoders [SPARK-49789][SQL] Handling of generic parameter with bounds while creating encoders Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant