Skip to content

speedup date_trunc (~7x faster) in some cases #16859

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 28, 2025

Conversation

waynexia
Copy link
Member

@waynexia waynexia commented Jul 22, 2025

Which issue does this PR close?

Rationale for this change

Follows the comment #14593 (comment) to implement an array version of date_trunc.

While implementing, I found that there is a lot of code handling boring calendar things, which is totally unrelated to some small granularities (less or equal to "hour"). So I removed them as well. Benchmark shows this simplification can bring 4x performance.

I also updated the function document BTW. Include millisecond and microsecond to supported granularity list.

What changes are included in this PR?

a faster implementation for date_trunc with granularities from "microsecond" to "hour"

date_trunc_minute_1000  time:   [16.600 µs 16.643 µs 16.684 µs]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) low mild

date_trunc_minute_1000  time:   [2.3474 µs 2.3519 µs 2.3579 µs]
                        change: [-85.946% -85.909% -85.870%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

Are these changes tested?

using existing unit tests

Are there any user-facing changes?

@github-actions github-actions bot added the functions Changes to functions implementation label Jul 22, 2025
Signed-off-by: Ruihang Xia <[email protected]>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jul 22, 2025
Comment on lines 193 to 197
granularity.as_str(),
"second" | "minute" | "millisecond" | "microsecond"
) || (parsed_tz.is_none() && granularity.as_str() == "hour")
{
let result = general_date_trunc_array_fine_granularity(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assume zone offsets are multiples of a whole hour?

(for example, Asia/Kathmandu is +05:45)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parsed_tz.is_none() check ensures this array is not associated with any timezone (hence the default UTC) and is safe to do in this way.

Added a test case f50fb3f for this case (TIL

Copy link
Member

@findepi findepi Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parsed_tz.is_none() check

it's there for hour, but not for minute.
Why have different zone-related conditions for hour and minute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because (AFAIK) no timezone has a different, non-integer offset on minute like +04:32:10, so we can truncate all timestamps at minute level regardless of their timezone

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. For minute and smaller granularities the new code path is always fine¹
For hour it's also fine, for zone-less data.

image
  • for zone-less data (timestamp without time zone), day granularity can also be applied with just divide and mul, just like hour. Or am i blind again?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT date_trunc('minute', arrow_cast('1910-01-01T00:00:00Z', 'Timestamp(Second, Some("Asia/Kathmandu"))'));
SELECT date_trunc('minute', arrow_cast(v, 'Timestamp(Second, Some("Asia/Kathmandu"))')) FROM (VALUES ('1910-01-01T00:00:00Z')) t(v);
+-----------------------------------------------------------------------------------------------------------------------+
| date_trunc(Utf8("minute"),arrow_cast(Utf8("1910-01-01T00:00:00Z"),Utf8("Timestamp(Second, Some("Asia/Kathmandu"))"))) |
+-----------------------------------------------------------------------------------------------------------------------+
| 1910-01-01T05:41:16+05:41                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.065 seconds.

+----------------------------------------------------------------------------------------------+
| date_trunc(Utf8("minute"),arrow_cast(t.v,Utf8("Timestamp(Second, Some("Asia/Kathmandu"))"))) |
+----------------------------------------------------------------------------------------------+
| 1910-01-01T05:41:16+05:41                                                                    |
+----------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.012 seconds.

this doesn't look correct to me
however, this is the same result as on main, so it's not a regression.
Is it because the existing date_trunc code do similar logic as the new optimized code path?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look right to me either. First, it's not truncating to the minute, and second I have my doubts that TZ is correct with that time. I'd have to think about that a bit more but I think a ticket should be filed to have that looked at some more

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Not sure whether chrono respects that

it looks it does

let timestamp = chrono::DateTime::parse_from_rfc3339("1910-01-01T00:00:00Z").unwrap();
dbg!(&timestamp);
let tz = chrono_tz::Asia::Kathmandu;
// let tz = arrow::array::timezone::Tz::from_str("Asia/Kathmandu").unwrap();
dbg!(&tz);
let zoned_date_time = timestamp.with_timezone(&tz);
dbg!(&zoned_date_time);
let local_date_time = zoned_date_time.naive_local();
dbg!(&local_date_time);
let truncated_local = local_date_time.with_second(0).unwrap();
dbg!(&truncated_local);
let mapped_back = match truncated_local.and_local_timezone(tz) {
    MappedLocalTime::Single(mapped) => mapped,
    _ => panic!(),
};
dbg!(&mapped_back);
let timestamp = mapped_back.to_utc();
dbg!(&timestamp);
dbg!(timestamp.timestamp());
[tests/chrono.rs:7:1] &timestamp = 1910-01-01T00:00:00+00:00
[tests/chrono.rs:10:1] &tz = Asia/Kathmandu
[tests/chrono.rs:12:1] &zoned_date_time = 1910-01-01T05:41:16LMT
[tests/chrono.rs:14:1] &local_date_time = 1910-01-01T05:41:16
[tests/chrono.rs:16:1] &truncated_local = 1910-01-01T05:41:00
[tests/chrono.rs:21:1] &mapped_back = 1910-01-01T05:41:00LMT
[tests/chrono.rs:23:1] &timestamp = 1909-12-31T23:59:44Z
[tests/chrono.rs:24:1] timestamp.timestamp() = -1893456016

however either us, or arrow, makes assumption that offsets are whole minutes. this is visible in the output above (https://github.com/apache/datafusion/pull/16859/files#r2229281161). maybe that's rendering in the CLI output rendering layer. it makes reasoning about this harder.

SELECT cast(date_trunc('minute', arrow_cast(v, 'Timestamp(Second, Some("Asia/Kathmandu"))')) as bigint) FROM (VALUES ('1910-01-01T00:00:00Z')) t(v);
  • main: -1893456000_000000000
  • PR: -1893456000_000000000

while per-chrono above, the correct result seems to be -1893456016_000000000

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your very detailed check and explanation. I find another example with an extra search (that website is great)...

image

@alamb alamb added the performance Make DataFusion faster label Jul 24, 2025
@alamb
Copy link
Contributor

alamb commented Jul 24, 2025

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubuntu SMP Wed May 28 02:40:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing date-trunc-fast (f50fb3f) to a0ce581 diff
BENCH_NAME=date_trunc
BENCH_COMMAND=cargo bench --bench date_trunc
BENCH_FILTER=
BENCH_BRANCH_NAME=date-trunc-fast
Results will be posted here when complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @waynexia -- this makes sense to me

Thank you @findepi for the good review question

@alamb
Copy link
Contributor

alamb commented Jul 24, 2025

🤖: Benchmark completed

Details

group                     date-trunc-fast                        main
-----                     ---------------                        ----
date_trunc_minute_1000    1.00      4.7±0.01µs        ? ?/sec    6.17     28.7±0.03µs        ? ?/sec

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a code comment

// fast path for fine granularities
if matches!(
granularity.as_str(),
"second" | "minute" | "millisecond" | "microsecond"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minutes is correct for modern zones, but not historical dates.
The old code apparently has some issues with them too though (https://github.com/apache/datafusion/pull/16859/files#r2229547803)
Let's add a code comment about our conscious ignorance of historical zone offsets here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some comments in 346ae59

if matches!(
granularity.as_str(),
"second" | "minute" | "millisecond" | "microsecond"
) || (parsed_tz.is_none() && granularity.as_str() == "hour")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's unlock the optimization for "day" in zoneless case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense 👍 346ae59

@waynexia waynexia merged commit 764d547 into apache:main Jul 28, 2025
29 checks passed
@waynexia waynexia deleted the date-trunc-fast branch July 28, 2025 21:42
@waynexia
Copy link
Member Author

Appreciate your detailed review! Learned a lot about timezone 🫨

Standing-Man pushed a commit to Standing-Man/datafusion that referenced this pull request Aug 4, 2025
// fast path for fine granularities
if matches!(
granularity.as_str(),
// For morden timezones, it's correct to truncate "minute" in this way.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

morden -> modern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix them all 😎 #17135

@waynexia waynexia mentioned this pull request Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation functions Changes to functions implementation performance Make DataFusion faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants