Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3_endpoint setting is getting lost for parquet_describe #207

Closed
2 tasks done
wasd171 opened this issue Feb 4, 2025 · 1 comment · Fixed by #209
Closed
2 tasks done

s3_endpoint setting is getting lost for parquet_describe #207

wasd171 opened this issue Feb 4, 2025 · 1 comment · Fixed by #209
Labels
bug Something isn't working good first issue Good for newcomers priority-high High priority issue user-request This issue was directly requested by a user

Comments

@wasd171
Copy link

wasd171 commented Feb 4, 2025

What happens?

s3_endpoint setting is not persisted between connections for the parquet_describe

To Reproduce

I am using an S3 bucket in the eu-central-2 region. Let's set up the database

postgres=# CREATE DATABASE test TEMPLATE template0;
CREATE DATABASE
postgres=# \c test
You are now connected to database "test" as user "postgres".
test=# CREATE EXTENSION pg_analytics;
CREATE EXTENSION
test=# CREATE FOREIGN DATA WRAPPER parquet_wrapper HANDLER parquet_fdw_handler VALIDATOR parquet_fdw_validator;
CREATE FOREIGN DATA WRAPPER
test=# CREATE SERVER parquet_server FOREIGN DATA WRAPPER parquet_wrapper;
CREATE SERVER
test=# CREATE USER MAPPING FOR public
SERVER parquet_server OPTIONS (
  TYPE 'S3',
  PROVIDER 'CREDENTIAL_CHAIN',
  CHAIN 'env',
  REGION 'eu-central-2',
  ENDPOINT 's3.eu-central-2.amazonaws.com'
);
CREATE USER MAPPING
test=# CREATE FOREIGN TABLE demo ()
SERVER parquet_server
OPTIONS (files 's3://###/#.parquet');
CREATE FOREIGN TABLE
test=# SELECT COUNT(*) FROM demo;
  count
---------
 1006448
(1 row)

test=# SELECT * FROM parquet_describe('s3://###/#.parquet');
   column_name   | column_type | null | key | default | extra
-----------------+-------------+------+-----+---------+-------
 title           | VARCHAR     | YES  |     |         |
 link            | VARCHAR     | YES  |     |         |
 activity_status | VARCHAR     | YES  |     |         |
 profile_id      | VARCHAR     | YES  |     |         |
 id              | BIGINT      | YES  |     |         |
(5 rows)

Looks good! Let's disconnect, connect again, and re-run parquet_describe

test=# SELECT * FROM parquet_describe('s3://###/#.parquet');
ERROR:  HTTP Error: HTTP GET error on 'https://###.s3.amazonaws.com/#.parquet' (HTTP 400)
test=# SELECT COUNT(*) FROM demo;
  count
---------
 1006448
(1 row)

test=# SELECT * FROM parquet_describe('s3://###/#.parquet');
   column_name   | column_type | null | key | default | extra
-----------------+-------------+------+-----+---------+-------
 title           | VARCHAR     | YES  |     |         |
 link            | VARCHAR     | YES  |     |         |
 activity_status | VARCHAR     | YES  |     |         |
 profile_id      | VARCHAR     | YES  |     |         |
 id              | BIGINT      | YES  |     |         |
(5 rows)

As you can see, initially s3.amazonaws.com is used as an endpoint, instead of the provided s3.eu-central-2.amazonaws.com. Running SELECTs against a foreign table works though and fixes it for the next calls to the parquet_describe within the same connection. Another workaround is to explicitly set s3_endpoint:

test=# SELECT duckdb_execute($$SET s3_endpoint = 's3.eu-central-2.amazonaws.com'$$);
 duckdb_execute
----------------

(1 row)

test=# SELECT * FROM parquet_describe('s3://###/#.parquet');
   column_name   | column_type | null | key | default | extra
-----------------+-------------+------+-----+---------+-------
 title           | VARCHAR     | YES  |     |         |
 link            | VARCHAR     | YES  |     |         |
 activity_status | VARCHAR     | YES  |     |         |
 profile_id      | VARCHAR     | YES  |     |         |
 id              | BIGINT      | YES  |     |         |
(5 rows)

OS:

Docker container based on ghcr.io/cloudnative-pg/postgresql:17.2-33-bookworm, arm64

ParadeDB Version:

[email protected]

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB pg_analytics Extension

Full Name:

Konstantin Nesterov

Affiliation:

CareerLunch AG

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

  • Yes, I have
@wasd171 wasd171 added the bug Something isn't working label Feb 4, 2025
@philippemnoel philippemnoel added good first issue Good for newcomers priority-high High priority issue user-request This issue was directly requested by a user labels Feb 4, 2025
@philippemnoel
Copy link
Collaborator

Thanks ofor reporting, fixed in #209

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers priority-high High priority issue user-request This issue was directly requested by a user
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants