-
Notifications
You must be signed in to change notification settings - Fork 1.6k
speedup date_trunc
(~7x faster) in some cases
#16859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
granularity.as_str(), | ||
"second" | "minute" | "millisecond" | "microsecond" | ||
) || (parsed_tz.is_none() && granularity.as_str() == "hour") | ||
{ | ||
let result = general_date_trunc_array_fine_granularity( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this assume zone offsets are multiples of a whole hour?
(for example, Asia/Kathmandu is +05:45)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parsed_tz.is_none()
check ensures this array is not associated with any timezone (hence the default UTC) and is safe to do in this way.
Added a test case f50fb3f for this case (TIL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
parsed_tz.is_none()
check
it's there for hour
, but not for minute
.
Why have different zone-related conditions for hour
and minute
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because (AFAIK) no timezone has a different, non-integer offset on minute like +04:32:10
, so we can truncate all timestamps at minute level regardless of their timezone
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. For minute and smaller granularities the new code path is always fine¹
For hour it's also fine, for zone-less data.
- (¹) in the past (early XX century and earlier) named zones often had offset changes not multiple of minutes at list this is what https://www.timeanddate.com/time/zone/nepal/kathmandu shows me. Not sure whether chrono respects that

- for zone-less data (timestamp without time zone), day granularity can also be applied with just divide and mul, just like hour. Or am i blind again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT date_trunc('minute', arrow_cast('1910-01-01T00:00:00Z', 'Timestamp(Second, Some("Asia/Kathmandu"))'));
SELECT date_trunc('minute', arrow_cast(v, 'Timestamp(Second, Some("Asia/Kathmandu"))')) FROM (VALUES ('1910-01-01T00:00:00Z')) t(v);
+-----------------------------------------------------------------------------------------------------------------------+
| date_trunc(Utf8("minute"),arrow_cast(Utf8("1910-01-01T00:00:00Z"),Utf8("Timestamp(Second, Some("Asia/Kathmandu"))"))) |
+-----------------------------------------------------------------------------------------------------------------------+
| 1910-01-01T05:41:16+05:41 |
+-----------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.065 seconds.
+----------------------------------------------------------------------------------------------+
| date_trunc(Utf8("minute"),arrow_cast(t.v,Utf8("Timestamp(Second, Some("Asia/Kathmandu"))"))) |
+----------------------------------------------------------------------------------------------+
| 1910-01-01T05:41:16+05:41 |
+----------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.012 seconds.
this doesn't look correct to me
however, this is the same result as on main
, so it's not a regression.
Is it because the existing date_trunc code do similar logic as the new optimized code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look right to me either. First, it's not truncating to the minute, and second I have my doubts that TZ is correct with that time. I'd have to think about that a bit more but I think a ticket should be filed to have that looked at some more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Not sure whether chrono respects that
it looks it does
let timestamp = chrono::DateTime::parse_from_rfc3339("1910-01-01T00:00:00Z").unwrap();
dbg!(×tamp);
let tz = chrono_tz::Asia::Kathmandu;
// let tz = arrow::array::timezone::Tz::from_str("Asia/Kathmandu").unwrap();
dbg!(&tz);
let zoned_date_time = timestamp.with_timezone(&tz);
dbg!(&zoned_date_time);
let local_date_time = zoned_date_time.naive_local();
dbg!(&local_date_time);
let truncated_local = local_date_time.with_second(0).unwrap();
dbg!(&truncated_local);
let mapped_back = match truncated_local.and_local_timezone(tz) {
MappedLocalTime::Single(mapped) => mapped,
_ => panic!(),
};
dbg!(&mapped_back);
let timestamp = mapped_back.to_utc();
dbg!(×tamp);
dbg!(timestamp.timestamp());
[tests/chrono.rs:7:1] ×tamp = 1910-01-01T00:00:00+00:00
[tests/chrono.rs:10:1] &tz = Asia/Kathmandu
[tests/chrono.rs:12:1] &zoned_date_time = 1910-01-01T05:41:16LMT
[tests/chrono.rs:14:1] &local_date_time = 1910-01-01T05:41:16
[tests/chrono.rs:16:1] &truncated_local = 1910-01-01T05:41:00
[tests/chrono.rs:21:1] &mapped_back = 1910-01-01T05:41:00LMT
[tests/chrono.rs:23:1] ×tamp = 1909-12-31T23:59:44Z
[tests/chrono.rs:24:1] timestamp.timestamp() = -1893456016
however either us, or arrow, makes assumption that offsets are whole minutes. this is visible in the output above (https://github.com/apache/datafusion/pull/16859/files#r2229281161). maybe that's rendering in the CLI output rendering layer. it makes reasoning about this harder.
SELECT cast(date_trunc('minute', arrow_cast(v, 'Timestamp(Second, Some("Asia/Kathmandu"))')) as bigint) FROM (VALUES ('1910-01-01T00:00:00Z')) t(v);
- main: -1893456000_000000000
- PR: -1893456000_000000000
while per-chrono above, the correct result seems to be -1893456016_000000000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Ruihang Xia <[email protected]>
🤖 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖: Benchmark completed Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a code comment
// fast path for fine granularities | ||
if matches!( | ||
granularity.as_str(), | ||
"second" | "minute" | "millisecond" | "microsecond" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minutes
is correct for modern zones, but not historical dates.
The old code apparently has some issues with them too though (https://github.com/apache/datafusion/pull/16859/files#r2229547803)
Let's add a code comment about our conscious ignorance of historical zone offsets here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some comments in 346ae59
if matches!( | ||
granularity.as_str(), | ||
"second" | "minute" | "millisecond" | "microsecond" | ||
) || (parsed_tz.is_none() && granularity.as_str() == "hour") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's unlock the optimization for "day" in zoneless case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense 👍 346ae59
Signed-off-by: Ruihang Xia <[email protected]>
Appreciate your detailed review! Learned a lot about timezone 🫨 |
// fast path for fine granularities | ||
if matches!( | ||
granularity.as_str(), | ||
// For morden timezones, it's correct to truncate "minute" in this way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
morden -> modern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix them all 😎 #17135
Which issue does this PR close?
date_trunc
(~20% time reduction) #14593Rationale for this change
Follows the comment #14593 (comment) to implement an array version of
date_trunc
.While implementing, I found that there is a lot of code handling boring calendar things, which is totally unrelated to some small granularities (less or equal to "hour"). So I removed them as well. Benchmark shows this simplification can bring 4x performance.
I also updated the function document BTW. Include millisecond and microsecond to supported granularity list.
What changes are included in this PR?
a faster implementation for
date_trunc
with granularities from "microsecond" to "hour"Are these changes tested?
using existing unit tests
Are there any user-facing changes?