-
Notifications
You must be signed in to change notification settings - Fork 106
Replace trino sql with postgresql on site SQL for on premise #5813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @ydayagi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the data access and processing mechanisms to support PostgreSQL as a primary data store for on-premise deployments. It introduces a flexible database accessor pattern, allowing the system to seamlessly adapt to either Trino or PostgreSQL environments. The changes primarily focus on abstracting database interactions, updating data ingestion pipelines for Parquet files, and providing conditional feature flag management, all aimed at enhancing deployment flexibility and maintainability for different environments. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant architectural change by abstracting database access to support both Trino and PostgreSQL, enabling on-premise deployments to use Postgres instead of a Trino/Hive/Parquet stack. The changes are well-structured, using an abstract base class and factory pattern to switch between database backends based on the ONPREM setting.
My review focuses on the correctness and security of the new implementation. I've identified a few critical issues:
- A potential SQL injection vulnerability in the PostgreSQL accessor.
- A bug in the Trino accessor that would cause SQL syntax errors.
- An incomplete (stubbed) implementation for data deletion in Postgres, which could lead to data inconsistencies.
- A date-handling bug in partition creation logic that would fail at the end of the year.
I've provided detailed comments and suggestions for each of these points. Addressing them will be crucial for the stability and security of the new on-premise data architecture.
| def get_schema_check_sql(self, schema_name: str): | ||
| """Return the SQL to check if a schema exists.""" | ||
| return f"SELECT 1 FROM information_schema.schemata WHERE schema_name = '{schema_name}'" | ||
|
|
||
| def get_table_check_sql(self, table_name: str, schema_name: str): | ||
| return f"SELECT 1 FROM information_schema.tables WHERE table_name = '{table_name}' AND table_schema = '{schema_name}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods construct SQL queries using f-strings with values that could potentially be controlled by external factors, which is a SQL injection risk. While schema_name and table_name might currently come from trusted sources, it is a security best practice to always use parameterized queries.
A proper fix would involve a broader refactoring:
- The
ReportDBAccessorinterface (e.g.,get_schema_check_sql) should be changed to return a tuple of(sql_template, parameters). - The calling method (
_execute_trino_sqlinreport_parquet_processor_base.py) should be updated to accept these parameters and pass them tocursor.execute().
This would make the database interaction safer and more robust against injection attacks.
| partition_values_lower = [f"'{self._provider_uuid}'", f"'{self._year}'", f"'{self._month}'"] | ||
| partition_values_upper = [f"'{self._provider_uuid}'", f"'{self._year}'", f"'{int(self._month)+1:02d}'"] | ||
| sql = get_report_db_accessor().get_partition_create_sql(self._schema_name, self._table_name, self._partition_name, partition_values_lower, partition_values_upper) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation for partition_values_upper does not correctly handle the year-end rollover. When self._month is '12', int(self._month)+1 will become 13, which is an invalid month and will cause an error when creating partitions for December data.
You should use date arithmetic, for example with dateutil.relativedelta, to correctly calculate the start of the next month. You will also need to ensure datetime is imported in this file (import datetime).
| partition_values_lower = [f"'{self._provider_uuid}'", f"'{self._year}'", f"'{self._month}'"] | |
| partition_values_upper = [f"'{self._provider_uuid}'", f"'{self._year}'", f"'{int(self._month)+1:02d}'"] | |
| sql = get_report_db_accessor().get_partition_create_sql(self._schema_name, self._table_name, self._partition_name, partition_values_lower, partition_values_upper) | |
| start_date = datetime.date(int(self._year), int(self._month), 1) | |
| next_month_date = start_date + relativedelta(months=1) | |
| partition_values_lower = [f"'{self._provider_uuid}'", f"'{start_date.year}'", f"'{start_date.strftime('%m')}'"] | |
| partition_values_upper = [f"'{self._provider_uuid}'", f"'{next_month_date.year}'", f"'{next_month_date.strftime('%m')}'"] | |
| sql = get_report_db_accessor().get_partition_create_sql(self._schema_name, self._table_name, self._partition_name, partition_values_lower, partition_values_upper) |
d897f90 to
376790c
Compare
4755d08 to
61820b7
Compare
cccceb1 to
e86c9c6
Compare
e86c9c6 to
96a5a59
Compare
|
@ydayagi |
lcouzens
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the initial plan for on-prem was OCP data only to start with. Is there a need to update all these cloud provider parts too (AWS, Azure, GCP)?
I'm concerned we are trying to change too much in one go rather than incremental steps. I'd rather see OCP work exclusively first before we try anything beyond that. It would also help during the review process too since it would be more focused.
In addition to that were you able to run the full IQE integration tests on this locally and get passes? That's the first step I suggest we take, get the tests running and see what results we can get compared to main.
| @@ -0,0 +1,14 @@ | |||
| SELECT | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no plans today to do subs integration on prem, so there shouldnt be any need to add postgres versions of these files. (Everything under subs can be skipped)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was easier to convert everything than to pick the 100% necessary parts. i think that the tests and CI do not have a clear distinction for only ocp.
|
@ydayagi testing the image with the on-prem chart raises an error in the cost-onprem-celery-worker-ocp deployment: the ONPREM env variable is set for True for this deployment. Edit: Issue was solved by rebuilding the image. seems the image was outdated and didn't contain the latest state of this PR. |
| @@ -0,0 +1,154 @@ | |||
| CREATE TABLE IF NOT EXISTS {{schema | sqlsafe}}.managed_aws_openshift_daily_temp | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So technically the SQL directory is also postgres_sql. I think instead of postgres_sql as the directory name can we change it to on_site since this is SQL that will be running outside of the SASS moving forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. though, the change is about postgres instead of trino. we do not have to tie it to on-site/on-prem
hive and django were never used together and Django is unaware of hive. django db api is only used for postgres. after my changes, the use of hive and postgres via trino is replaced by using django. even if u see the word hive in log, it is not really hive. it is just the log message that was not changed |
it was easier to convert everything. i think that the tests (unit , iqe) do not have a 100% ocp only set. in addition to that, i was told we are going to do ocp on cloud as well. otherwise, what's the point in cost mgmt? and if it is not goign to be used, then no need to review or test. but reverting it would be a great . |
sorry about that. i had a few fixes missing in the image |
9161571 to
75c5dc8
Compare
FLPATH-2957 https://issues.redhat.com/browse/FLPATH-2957 - check_table_exists(): Use get_table_check_sql() instead of SHOW TABLES - find_expired_partitions(): Use get_expired_data_ocp_sql() instead of $partitions metadata query with date_parse() - drop_expired_partitions(): Use get_delete_by_month_sql() instead of hardcoded DELETE with hive. prefix
When ONPREM is True, use MockUnleashClient instead of real UnleashClient to avoid dependency on Unleash service.
31abca3 to
14a37ce
Compare
14a37ce to
aa53670
Compare
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (60.6%) is below the target coverage (90.0%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #5813 +/- ##
=======================================
- Coverage 94.3% 93.7% -0.6%
=======================================
Files 345 348 +3
Lines 29740 30155 +415
Branches 3239 3277 +38
=======================================
+ Hits 28054 28260 +206
- Misses 1098 1298 +200
- Partials 588 597 +9 🚀 New features to boost your workflow:
|
Jira Ticket
FLPATH-2822
Description
Replacing Trino+Hive+Parquet with Postgres for on-prem deployment
Testing
Release Notes