[Enhancement]: Long wait time and high IO read when compress copied data by months #7225

xinyifly · 2024-08-31T10:53:17Z

What type of enhancement is this?

Performance

What subsystems and features will be improved?

Compression

What does the enhancement do?

I'm migrating a histories table to a hyper table to make use of timescaledb's compression ability.

The data starts from 2024-01-01 and would be kinda large to finish it in one shot. So I migrate them month by month.

Here is what I did to create an empty hyper table:

SELECT create_hypertable('histories', by_range('datetime'));
SELECT add_dimension('histories', by_hash('house_id', 4));
CREATE UNIQUE INDEX index_histories_on_house_id_and_datetime ON public.histories USING btree (house_id, datetime DESC);

ALTER TABLE histories SET (timescaledb.compress, timescaledb.compress_segmentby = 'house_id');
SELECT add_compression_policy('histories', INTERVAL '7 days');

Now migrate first month from production db instance to timescaledb instance

-- In production db instance
\copy (SELECT * FROM histories WHERE NOT datetime < '2024-01-01' AND datetime < '2024-02-01') TO 'histories-2024-01.csv' WITH CSV HEADER

-- In timescaledb instance
\copy histories FROM 'histories-2024-01.csv' WITH CSV HEADER

-- COPY 90892167 (Rows)

And manually trigger a compress

CALL run_job(1000);

So far, it works amazing perfect! Only took around 30~60 minutes to compress around from 50GB to 2GB. And I can almost immediately see disk usage decreasing from df -h

Repeat for second month didn't make me notice much difference.

\copy (SELECT * FROM histories WHERE NOT datetime < '2024-02-01' AND datetime < '2024-03-01') TO 'histories-2024-02.csv' WITH CSV HEADER

\copy histories FROM 'histories-2024-02.csv' WITH CSV HEADER

CALL run_job(1000);

From 3rd month, here is where performance issue begins.

For the observed 4th month. It takes over 3 hours to notice any difference on df -h , while with a high disk IO read. And another hour to actually execute the compress. (I think)

(Compress procedure from the tooltip to the end where Disk BPS reaches zero)

Reading at 130MB/s * 3 hours is around 1371GB data while my whole summed uncompressed csv dataset files only takes 200GB

I want to know blocking at where needing this high amount of IO read.

\dx
                                                List of installed extensions
    Name     | Version |   Schema   |                                      Description
-------------+---------+------------+---------------------------------------------------------------------------------------
 plpgsql     | 1.0     | pg_catalog | PL/pgSQL procedural language
 timescaledb | 2.13.0  | public     | Enables scalable inserts and complex queries for time-series data (Community Edition)
(2 rows)

SELECT version();
                                   version
------------------------------------------------------------------------------
 PostgreSQL 15.7 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 12.3.0, 64-bit
(1 row)

Implementation challenges

No response

The text was updated successfully, but these errors were encountered:

xinyifly added the enhancement An enhancement to an existing feature for functionality label Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement]: Long wait time and high IO read when compress copied data by months #7225

[Enhancement]: Long wait time and high IO read when compress copied data by months #7225

xinyifly commented Aug 31, 2024 •

edited

Loading

[Enhancement]: Long wait time and high IO read when compress copied data by months #7225

[Enhancement]: Long wait time and high IO read when compress copied data by months #7225

Comments

xinyifly commented Aug 31, 2024 • edited Loading

What type of enhancement is this?

What subsystems and features will be improved?

What does the enhancement do?

Implementation challenges

xinyifly commented Aug 31, 2024 •

edited

Loading