feat(go/adbc/driver/snowflake): improve GetObjects performance and semantics #2254

zeroshade · 2024-10-14T20:33:43Z

Improves the channel handling and query building for metadata conversion to Arrow for better performance.

For all cases except when retrieving Column metadata we'll now utilize SHOW queries and build the patterns into those queries. This allows those GetObjects calls with appropriate depths to be called without having to specify a current database or schema.

zeroshade · 2024-10-14T20:35:06Z

CC @davidhcoe Can you take a look at this and confirm it fixes your issues with the exception of the All and Columns depths?

lidavidm · 2024-10-15T00:47:13Z

go/adbc/driver/snowflake/connection.go

+		if before[len(before)-1] != '\\' {
+			b.WriteByte('\\')
+		}


I suppose this is to handle pre-escaped characters? But what if the escape is itself escaped? (Or is that not allowed?)

I guess not from our own spec 😅, we specify escapes aren't supported at all

I guess not from our own spec 😅, we specify escapes aren't supported at all

Yeah, I pointed this out in #1508

Maybe it's time I open a branch for 1.2.0...

Maybe it's time I open a branch for 1.2.0...

I've started some work on that in conjunction with multiple results sets. https://github.com/CurtHagenlocher/arrow-adbc/tree/MoreResults

Yea, this is intended to handle pre-escaped characters. The logic is taken from snowflake's JDBC driver, I figured handling one level of escaping was sufficient given our current spec. 😄

Maybe it's time I open a branch for 1.2.0...

I've started some work on that in conjunction with multiple results sets. https://github.com/CurtHagenlocher/arrow-adbc/tree/MoreResults

Oh this is great!

go/adbc/driver/snowflake/connection.go

joellubi · 2024-10-15T03:07:44Z

go/adbc/driver/snowflake/connection.go

+		gQueryIDs.Go(func() error {
+			return conn.Raw(func(driverConn any) (err error) {
+				query := "SHOW TERSE /* ADBC:getObjectsDBSchemas */ DATABASES"
+				if catalog != nil && len(*catalog) > 0 && *catalog != "%" && *catalog != ".*" {
+					query += " LIKE '" + escapeSingleQuoteForLike(*catalog) + "'"
+				}
+				query += " IN ACCOUNT"
+
+				terseDbQueryID, err = getQueryID(gQueryIDsCtx, query, driverConn)
+				return
+			})


Some of these are repeated across cases. Could we extract them out of the switch-case to avoid the duplication?

joellubi · 2024-10-15T03:09:58Z

Thanks @zeroshade! Any rough performance numbers for using SHOW to get the DB objects rather than information_schema?

zeroshade · 2024-10-15T14:53:24Z

@joellubi Most of the performance actually came from the improved handling of the channels rather than the switch to using SHOW since they only replaced the calls to selecting from information_schema.schemata etc.

The way the channels were being handled caused bottlenecks since we weren't using buffered channels and the record reader was being passed through a channel instead of just using it directly. Switching up the managing of the channels led to about a 25% improvement in performance by removing the blocking. My tests showed a drop from ~5s to ~3.5s for a large GetObjects scenario. About 2/3 of the time is the raw snowflake execution. which for the ADBC account is taking a total of around 2 - 3 seconds depending on the query for all of the SHOW queries + the primary one

joellubi · 2024-10-15T15:51:14Z

@joellubi Most of the performance actually came from the improved handling of the channels rather than the switch to using SHOW since they only replaced the calls to selecting from information_schema.schemata etc.

The way the channels were being handled caused bottlenecks since we weren't using buffered channels and the record reader was being passed through a channel instead of just using it directly. Switching up the managing of the channels led to about a 25% improvement in performance by removing the blocking. My tests showed a drop from ~5s to ~3.5s for a large GetObjects scenario. About 2/3 of the time is the raw snowflake execution. which for the ADBC account is taking a total of around 2 - 3 seconds depending on the query for all of the SHOW queries + the primary one

Ah cool, the record reader handling is much cleaner now. Not sure why I did it that way originally.

Good catch on increasing the buffer size for the channel. I did think that could be a bottleneck which is why I didn't make it unbuffered, but didn't think it would be so significant. I also couldn't think of a value to use that didn't feel somewhat arbitrary. Maybe making it configurable or set to runtime.NumCPUs? Not critical but could be nice.

zeroshade · 2024-10-15T23:49:12Z

@joellubi I'll switch it to runtime.NumCPU and fix up the failing integration tests tomorrow. I'm part way through, snowflake is making me sad

davidhcoe · 2024-10-16T04:00:05Z

go/adbc/driver/snowflake/connection.go

+			return conn.Raw(func(driverConn any) (err error) {
+				query := "SHOW TERSE /* ADBC:getObjectsCatalogs */ DATABASES"
+				if catalog != nil && len(*catalog) > 0 && *catalog != "%" && *catalog != ".*" {
+					query += " LIKE '" + escapeSingleQuoteForLike(*catalog) + "'"


I believe this will be a case sensitive search (LIKE) and will it also treat names with underscores as wildcards?

The LIKE keyword in the SHOW commands is actually case-insensitive according to the docs (https://docs.snowflake.com/en/sql-reference/sql/show-tables). But it does treat underscores like a LIKE comparison, though we do say in the docs that the arguments for "catalog" and such are treated as patterns if they include wildcards like _ and %.

zeroshade · 2024-10-16T18:39:46Z

Finally got the unit tests and validation tests passing for this. Can I get one more review pass please?

c/validation/adbc_validation_connection.cc

lidavidm · 2024-10-17T06:37:30Z

c/validation/adbc_validation_statement.cc

@@ -2180,15 +2180,15 @@ void StatementTest::TestSqlBind() {

  ASSERT_THAT(
      AdbcStatementSetSqlQuery(
-          &statement, "SELECT * FROM bindtest ORDER BY \"col1\" ASC NULLS FIRST", &error),
+          &statement, "SELECT * FROM bindtest ORDER BY col1 ASC NULLS FIRST", &error),


Do we perhaps need a quirk for escaping column names?

(I also wouldn't be opposed to trying to make these tests more data-driven...I should go find time to sketch it out)

It's more about consistency. Our CREATE TABLE query earlier in this function doesn't quote the column names, so our select statement needs to also not quote the names. Almost everywhere else we quote the columns. We just need to be consistent.

That said, I agree with it would be awesome for these tests to be more data-driven.

…mantics

zeroshade requested review from lidavidm and joellubi October 14, 2024 20:33

github-actions bot added this to the ADBC Libraries 15 milestone Oct 14, 2024

lidavidm reviewed Oct 15, 2024

View reviewed changes

joellubi reviewed Oct 15, 2024

View reviewed changes

zeroshade force-pushed the fixup-metadata-getobjects-snowflake branch from e4e38e8 to 55a5c76 Compare October 15, 2024 16:07

lidavidm approved these changes Oct 15, 2024

View reviewed changes

davidhcoe reviewed Oct 16, 2024

View reviewed changes

zeroshade requested review from lidavidm, joellubi, davidhcoe and CurtHagenlocher October 16, 2024 18:39

lidavidm approved these changes Oct 17, 2024

View reviewed changes

zeroshade added 11 commits October 17, 2024 13:43

feat(go/adbc/driver/snowflake): improve GetObjects performance and se…

8f0b793

…mantics

trim whitespace

39e1ed2

updates to make tests work

9acf8a8

fix lint and flakey test

71dd6bc

add catalog for test

8d581d5

reduce duplication

1ad0c09

fix get object tables

78829a8

use create or replace

d276045

fix go unit tests

a17cd90

remove unused func

a5a18a4

updates from feedback

43ceb38

zeroshade added 2 commits October 17, 2024 13:43

fix handling for binary

73a7fbf

handle updated tests

e26883f

zeroshade force-pushed the fixup-metadata-getobjects-snowflake branch from 50b302e to e26883f Compare October 17, 2024 18:00

fix linting

4e9f681

zeroshade merged commit 5471d95 into main Oct 17, 2024
96 of 97 checks passed

zeroshade deleted the fixup-metadata-getobjects-snowflake branch October 17, 2024 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(go/adbc/driver/snowflake): improve GetObjects performance and semantics #2254

feat(go/adbc/driver/snowflake): improve GetObjects performance and semantics #2254

zeroshade commented Oct 14, 2024

zeroshade commented Oct 14, 2024

lidavidm Oct 15, 2024

lidavidm Oct 15, 2024 •

edited

Loading

CurtHagenlocher Oct 15, 2024

lidavidm Oct 15, 2024

CurtHagenlocher Oct 15, 2024

zeroshade Oct 15, 2024

lidavidm Oct 15, 2024

joellubi Oct 15, 2024

joellubi commented Oct 15, 2024

zeroshade commented Oct 15, 2024

joellubi commented Oct 15, 2024

zeroshade commented Oct 15, 2024

davidhcoe Oct 16, 2024

zeroshade Oct 16, 2024

zeroshade commented Oct 16, 2024

lidavidm Oct 17, 2024

zeroshade Oct 17, 2024

feat(go/adbc/driver/snowflake): improve GetObjects performance and semantics #2254

feat(go/adbc/driver/snowflake): improve GetObjects performance and semantics #2254

Conversation

zeroshade commented Oct 14, 2024

zeroshade commented Oct 14, 2024

Choose a reason for hiding this comment

lidavidm Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joellubi commented Oct 15, 2024

zeroshade commented Oct 15, 2024

joellubi commented Oct 15, 2024

zeroshade commented Oct 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeroshade commented Oct 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidavidm Oct 15, 2024 •

edited

Loading