Skip to content

Fix ANALYZE when Hive partition has non-canonical value #24973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

denodo-research-labs
Copy link
Contributor

@denodo-research-labs denodo-research-labs commented Apr 24, 2025

Description

Extracted from trinodb/trino#15995

In Hive it may well happen that a partition value is written by the writer process as a string,
e.g. : month=02, even though the column is registered in Hive as an integer.

When updating the table or when doing ANALYZE, the output in Presto of the statistics computation though for the partition
from the example above will be though 2, ending in the the following error: All computed statistics must be used.

Motivation and Context

While performing ANALYZE on the following partitioned dataset:
store_sales/d_year=2025/d_month=01/d_day=10/d_hour=00

the following exception occurs:

com.google.common.base.VerifyException: All computed statistics must be used
at com.google.common.base.Verify.verify(Verify.java:126)
at com.facebook.presto.hive.HiveMetadata.finishStatisticsCollection(HiveMetadata.java:1552)
at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorMetadata.finishStatisticsCollection(ClassLoaderSafeConnectorMetadata.java:210)
at com.facebook.presto.metadata.MetadataManager.finishStatisticsCollection(MetadataManager.java:761)

This PR addresses the above mentioned issue by parsing the partition values to Presto values in order to avoid ignoring computed statistics.

Test Plan

Added test method testAnalyzePartitionedTableWithNonCanonicalValues

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Hive Connector Changes
* Fix incorrectly ignoring computed table statistics in `ANALYZE`


@denodo-research-labs denodo-research-labs requested a review from a team as a code owner April 24, 2025 11:45
@denodo-research-labs denodo-research-labs force-pushed the analyze_hive_partitions branch 2 times, most recently from b377fc7 to ac05613 Compare April 24, 2025 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant