Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement]: Long wait time and high IO read when compress copied data by months #7225

Open
xinyifly opened this issue Aug 31, 2024 · 0 comments
Labels
enhancement An enhancement to an existing feature for functionality

Comments

@xinyifly
Copy link

xinyifly commented Aug 31, 2024

What type of enhancement is this?

Performance

What subsystems and features will be improved?

Compression

What does the enhancement do?

I'm migrating a histories table to a hyper table to make use of timescaledb's compression ability.

The data starts from 2024-01-01 and would be kinda large to finish it in one shot. So I migrate them month by month.

Here is what I did to create an empty hyper table:

SELECT create_hypertable('histories', by_range('datetime'));
SELECT add_dimension('histories', by_hash('house_id', 4));
CREATE UNIQUE INDEX index_histories_on_house_id_and_datetime ON public.histories USING btree (house_id, datetime DESC);

ALTER TABLE histories SET (timescaledb.compress, timescaledb.compress_segmentby = 'house_id');
SELECT add_compression_policy('histories', INTERVAL '7 days');

Now migrate first month from production db instance to timescaledb instance

-- In production db instance
\copy (SELECT * FROM histories WHERE NOT datetime < '2024-01-01' AND datetime < '2024-02-01') TO 'histories-2024-01.csv' WITH CSV HEADER

-- In timescaledb instance
\copy histories FROM 'histories-2024-01.csv' WITH CSV HEADER

-- COPY 90892167 (Rows)

And manually trigger a compress

CALL run_job(1000);

So far, it works amazing perfect! Only took around 30~60 minutes to compress around from 50GB to 2GB. And I can almost immediately see disk usage decreasing from df -h

Repeat for second month didn't make me notice much difference.

\copy (SELECT * FROM histories WHERE NOT datetime < '2024-02-01' AND datetime < '2024-03-01') TO 'histories-2024-02.csv' WITH CSV HEADER

\copy histories FROM 'histories-2024-02.csv' WITH CSV HEADER

CALL run_job(1000);

From 3rd month, here is where performance issue begins.

For the observed 4th month. It takes over 3 hours to notice any difference on df -h , while with a high disk IO read. And another hour to actually execute the compress. (I think)

Screenshot 2024-08-31 at 18-41-20 ECS Console

(Compress procedure from the tooltip to the end where Disk BPS reaches zero)

Reading at 130MB/s * 3 hours is around 1371GB data while my whole summed uncompressed csv dataset files only takes 200GB

I want to know blocking at where needing this high amount of IO read.

\dx
                                                List of installed extensions
    Name     | Version |   Schema   |                                      Description
-------------+---------+------------+---------------------------------------------------------------------------------------
 plpgsql     | 1.0     | pg_catalog | PL/pgSQL procedural language
 timescaledb | 2.13.0  | public     | Enables scalable inserts and complex queries for time-series data (Community Edition)
(2 rows)

SELECT version();
                                   version
------------------------------------------------------------------------------
 PostgreSQL 15.7 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 12.3.0, 64-bit
(1 row)

Implementation challenges

No response

@xinyifly xinyifly added the enhancement An enhancement to an existing feature for functionality label Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement to an existing feature for functionality
Projects
None yet
Development

No branches or pull requests

1 participant