Skip to content

Conversation

crystalxyz
Copy link
Contributor

@crystalxyz crystalxyz commented Aug 28, 2025

Which issue does this PR close?

Rationale for this change

As described in the issue, the original version of Eq in Foreign UDF/UDAF/UDWF does comparison based on pointer hash, which fails to recognize the same UDFs if their pointers differ. This feature fixes this by exposing hash method in the FFI interface so that Eq will compare the actual hash values of udfs (as well as their signatures and aliases).

What changes are included in this PR?

For FFI UDF, UDAF and UDWF, I have made the following changes:

  • The hash function is exposed in FFI module
  • In Foreign_ module, the PartialEq trait will do comparison on the results of the hash functions
  • In Foreign_ module, the Hash trait will now use the result of hash function

Are these changes tested?

Yes. The added unit tests for UDF, UDAF and UDWF failed before this feature, but passed now.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the ffi Changes to the ffi crate label Aug 28, 2025
@crystalxyz crystalxyz changed the title fix: Fix Eq by adding hash to FFI udf/udaf/udwf fix: Expose hash to FFI udf/udaf/udwf to fix their Eq Aug 29, 2025
@crystalxyz crystalxyz marked this pull request as ready for review August 29, 2025 02:59
@timsaucer
Copy link
Contributor

I'm wondering if we can save ourselves some effort here and do this:

pub struct FFI_ScalarUDF {
    ...
    /// Internal hash function result
    pub hash_value: u64,
    ...
}

impl From<Arc<ScalarUDF>> for FFI_ScalarUDF {
    fn from(udf: Arc<ScalarUDF>) -> Self {
        let name = udf.name().into();
        let aliases = udf.aliases().iter().map(|a| a.to_owned().into()).collect();
        let volatility = udf.signature().volatility.into();
        let short_circuits = udf.short_circuits();

        let mut state = DefaultHasher::new();
        udf.hash(&mut state);
        let hash_value = state.finish();

        let private_data = Box::new(ScalarUDFPrivateData { udf });

        Self {
            name,
            aliases,
            volatility,
            short_circuits,
            invoke_with_args: invoke_with_args_fn_wrapper,
            return_type: return_type_fn_wrapper,
            return_field_from_args: return_field_from_args_fn_wrapper,
            coerce_types: coerce_types_fn_wrapper,
            hash_value,
            clone: clone_fn_wrapper,
            release: release_fn_wrapper,
            private_data: Box::into_raw(private_data) as *mut c_void,
        }
    }
}

impl PartialEq for ForeignScalarUDF {
    fn eq(&self, other: &Self) -> bool {
        let Self {
            name,
            aliases,
            udf,
            signature,
        } = self;
        name == &other.name
            && aliases == &other.aliases
            && signature == &other.signature
            && udf.hash_value == other.udf.hash_value
    }
}

impl Hash for ForeignScalarUDF {
    fn hash<H: Hasher>(&self, state: &mut H) {
        let Self {
            name,
            aliases,
            udf,
            signature,
        } = self;
        name.hash(state);
        aliases.hash(state);
        // This appears to be a hash of the hash value, but if you review how
        // u64 is hashed, it is just pushing the byte values into state.
        udf.hash_value.hash(state);
        signature.hash(state);
    }
}

And further, I wonder if we even need fn hash() to include the name, aliases, and signature at all. Those all come from the FFI calls, so wouldn't they be included in the internal call to hash?

It seems like we have an opportunity here to have a simpler path, but I'm not 100% confident I haven't overlooked some need to call hash() function every time. I don't think we expect these functions to have a state that is changing and thus hash is updating.

@crystalxyz
Copy link
Contributor Author

Thanks for your feedback! Will take a look and think more carefully about the cases

Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! Thank you for diving in!

@findepi
Copy link
Member

findepi commented Sep 4, 2025

This feature fixes this by exposing hash method in the FFI interface so that Eq will depend on the actual hash values of udfs.

It may sounds like being dependent on hash not colliding.
It's not the case, right?

@crystalxyz
Copy link
Contributor Author

crystalxyz commented Sep 4, 2025

It may sounds like being dependent on hash not colliding. It's not the case, right?

@findepi Sorry for the confusion in wording and I just fixed the feature description. For Eq, we do compare signatures and aliases in addition to hash values.

@findepi
Copy link
Member

findepi commented Sep 4, 2025

If two different functions return same hash (e.g. 42), will Eq return incorrect result in such case?

@timsaucer
Copy link
Contributor

If two different functions return same hash (e.g. 42), will Eq return incorrect result in such case?

I must be missing something - how is that any different than any of the UDFs that aren't going through ffi? This is calling the same function during initialization.

@findepi
Copy link
Member

findepi commented Sep 5, 2025

For normal functions, the Eq is not based on the Hash and is not susceptible to hash collisions.
The hash can be pre-calculated and if it's different, then Eq can short-circuit before checking other fields.
However, checking pre-calculated hash alone -- or some fields and the pre-calculated hash -- is not a correct Eq implementation.

@findepi
Copy link
Member

findepi commented Sep 5, 2025

You can think of this the following way. If I replace hash function with a function that always returns the same thing (e.g. 42), is my code maybe slower, but still correct? If yes, we're good.

However, looking at impl PartialEq for ForeignAggregateUDF, I don't think this is the case. So I don't think we're good.

@timsaucer timsaucer self-requested a review September 5, 2025 10:48
@timsaucer
Copy link
Contributor

Ok, I think @findepi makes a fair point. We can probably just revert the last commits and I'll re-review. Sorry for the extra work.

@alamb alamb requested a review from findepi September 5, 2025 13:26
@crystalxyz
Copy link
Contributor Author

crystalxyz commented Sep 5, 2025

Sorry I might be missing some context here. I understand that hash values can only be used to avoid more expensive comparisons if they are unequal. But if hash values are equal, it doesn't guarantee anything about equality, so how should we check equality of ForeignUDFs? I'm not sure if having hash() as a function or hash_value as a constant makes a difference here either.

@findepi
Copy link
Member

findepi commented Sep 8, 2025

@timsaucer what if we simply don't do #17087 ?

@findepi findepi marked this pull request as draft September 8, 2025 13:23
@crystalxyz
Copy link
Contributor Author

crystalxyz commented Sep 8, 2025

Yeah, I guess it's better to have Eq to return false negative (return false for equal udfs) than false positive (return true for unequal udfs), although it would be good if we can return the result more accurately.

@findepi
Copy link
Member

findepi commented Sep 9, 2025

Yeah, I guess it's better to have Eq to return false negative (return false for equal udfs) than false positive (return true for unequal udfs),

Agreed!

although it would be good if we can return the result more accurately.

Agreed too.
Longer answer under the issue: #17087 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ffi Changes to the ffi crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable udf hashing across FFI boundary
3 participants