-
Notifications
You must be signed in to change notification settings - Fork 1k
Support calendar-based chunking #9119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bfc2b2b to
51c039c
Compare
422e5a0 to
44e5d7f
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #9119 +/- ##
==========================================
+ Coverage 82.45% 82.55% +0.10%
==========================================
Files 245 247 +2
Lines 48050 48685 +635
Branches 12244 12455 +211
==========================================
+ Hits 39618 40194 +576
- Misses 3586 3641 +55
- Partials 4846 4850 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
284ecf4 to
1f59453
Compare
|
@fabriziomello, @natalya-aksman: please review this pull request.
|
|
Going to work on tests for continuous aggregates and other things. |
Is it the same thing as time_bucket, i.e. generates the same intervals for chunks if configured with the same bucket width? |
Yes, it is similar. But there are also differences, including (but probably not limited to):
Btw, when comparing to time_bucket(), I also found a potential issue with time_bucket(): #9136 |
515ad6b to
2d01b6f
Compare
Add support for specifying an origin point for aligning chunk boundaries. This allows chunks to be aligned to a specific reference point instead of the Unix epoch (or zero for integer dimensions). Changes: - Add interval_origin column to dimension catalog table - Add origin parameter to create_hypertable(), set_chunk_time_interval(), set_partitioning_interval(), and add_dimension() SQL functions - Add time_origin and integer_origin columns to dimensions view - Modify chunk boundary calculation to use origin when specified - Add type validation for origin (integer vs timestamp compatibility) - Add chunk_origin test for origin parameter functionality
Hypertables can now be configured to create chunks that align with the start and end of days, weeks, months, and years in the local time zone. This includes days and months that vary in length due to daylight savings or month of the year. This "calendar-based chunking" is achieved by anchoring chunk ranges at a user-configurable "origin" and calculating chunk ranges using local time zones and in units of, e.g., variable-length days and months. Therefore, a day-sized chunk can sometimes be 23 or 25 hours if it covers a daylight savings change. Currently, calendar-based chunking is guarded by a GUC, and turned off by default to preserve the existing behavior. To use calendar-based chunking, the GUC must be turned on and the hypertable configured with an Interval-type chunk interval. Existing hypertables are not affected by the GUC setting. The default origin is set to '2001-01-01 00:00' because that is the start of a new year and a Monday, so it also aligns with the start of a week (ISO). This means that chunk intervals set to `1 week` will lead to chunks that start on a Monday and ends on a Sunday. The origin-based approach was chosen because of the flexibility it gives; setting a different origin allows, e.g., daily chunks to start at noon instead of midnight. It also makes supporting chunk intervals of multiple months easy, as opposed to a truncation-based approach (e.g., date_trunc()), which only works with singular days, weeks, or months. Implementation-wise, a challenge of the origin-based approach is to calculate a chunk range for a point in the future from the given origin. Since, e.g., a `1 month` interval varies in size depending on which month it is, simple fixed-size interval arithmetics are not possible to calculate the N:th chunk range from the origin. Instead, the approach taken is to break down the calculations into full month, day, and sub-day units. But this only works for intervals that are non-fractional units of months, days, etc. As a fallback for arbitrary intervals, the range for a particular chunk is calculated from the origin by iteratively adding intervals until the desired point in time is covered. This iterative approach is optimized by (under-)estimating the number of intervals to the desired point, and then iterating from there. Since the iterative approach works for all types of intervals, the question is whether this approach is good enough for all cases, and the "broken-down" calculations are not needed. However, for this change, both approaches exist together, although this decision can be revisited in a future change. Changes include: - Add `interval` column to dimension catalog for INTERVAL type storage - Add `partition_origin` parameter to `by_range()` for origin specification - Create chunk_range.c/h for calendar-based interval calculations - Add GUC `timescaledb.enable_calendar_chunking` (default off) - Update dimension handling to support both fixed and calendar intervals - Update SQL API functions to support calendar intervals - Make caggs inherit calendar chunking from main hypertable - Add calendar_chunking regression test - Add calendar_chunking_integration test
A hypertable is either using legacy or calendar-based chunking as determined at hypertable creation time. This setting is sticky (stored in metadata) and it doesn't change even if the calendar chunking GUC is turned on. To allow users to switch to calendar-based chunking (and back), add a parameter `calendar_chunking=>true` to `set_chunk_time_interval()` and `set_partitioning_interval()`.
Expand test coverage for calendar-based chunking to include: - timestamp (without timezone): daily, weekly, yearly chunks and custom origins - date: daily, weekly, yearly, quarterly chunks and custom origins Also add comprehensive integer origin tests for all integer types: - smallint, int, bigint using both old API (create_hypertable with chunk_time_origin) and new API (by_range with partition_origin)
2d01b6f to
edd48d9
Compare
Hypertables can now be configured to create chunks that align with the start and end of days, weeks, months, and years in the local time zone. This includes days and months that vary in length due to daylight savings or month of the year.
This "calendar-based chunking" is achieved by anchoring chunk ranges at a user-configurable "origin" and calculating chunk ranges using local time zones and in units of, e.g., variable-length days and months. Therefore, a day-sized chunk can sometimes be 23 or 25 hours if it covers a daylight savings change.
Currently, calendar-based chunking is guarded by a GUC, and turned off by default to preserve the existing behavior. To use calendar-based chunking, the GUC must be turned on and the hypertable configured with an Interval-type chunk interval. Existing hypertables are not affected by the GUC setting.
The default origin is set to '2001-01-01 00:00' because that is the start of a new year and a Monday, so it also aligns with the start of a week (ISO). This means that chunk intervals set to
1 weekwill lead to chunks that start on a Monday and ends on a Sunday.The origin-based approach was chosen because of the flexibility it gives; setting a different origin allows, e.g., daily chunks to start at noon instead of midnight. It also makes supporting chunk intervals of multiple months easy, as opposed to a truncation-based approach (e.g., date_trunc()), which only works with singular days, weeks, or months.
Implementation-wise, a challenge of the origin-based approach is to calculate a chunk range for a point in the future from the given origin. Since, e.g., a
1 monthinterval varies in size depending on which month it is, simple fixed-size interval arithmetic are not possible to calculate the N:th chunk range from the origin. Instead, the approach taken is to break down the calculations into full month, day, and sub-day units. But this only works for intervals that are non-fractional units of months, days, etc. As a fallback for arbitrary intervals, the range for a particular chunk is calculated from the origin by iteratively adding intervals until the desired point in time is covered. This iterative approach is optimized by (under-)estimating the number of intervals to the desired point, and then iterating from there.Since the iterative approach works for all types of intervals, the question is whether this approach is good enough for all cases, and the "broken-down" calculations are not needed. However, for this change, both approaches exist together, although this decision can be revisited in a future change.
Changes include:
intervalcolumn to dimension catalog for INTERVAL type storagepartition_originparameter toby_range()for origin specificationtimescaledb.enable_calendar_chunking(default off)The PR is divided into two commits. The first commit introduces the origin parameter:
Add support for specifying an origin point for aligning chunk boundaries. This allows chunks to be aligned to a specific reference point instead of the Unix epoch (or zero for integer dimensions).
Changes:
set_partitioning_interval(), and add_dimension() SQL functions
Disable-check: commit-count
Closes: #1500
Example usage: