[v0/v1 migration] Implement CheckVariableGroupExistence to support StatVarGroup existence checks in Spanner#1892
Conversation
…ce checks in Spanner datasource
There was a problem hiding this comment.
Code Review
This pull request introduces the CheckVariableGroupExistence method to the Spanner client to support existence checks for variable groups and topics. The Observation data source has been updated to split incoming variables into standard statistical variables and groups/topics, routing them to the appropriate existence check logic. Feedback suggests avoiding the hardcoded string offset used for prefix stripping, ensuring that the import filter logic correctly handles requests with mixed entity types rather than assuming uniformity based on the first entity, and restoring the warning logs for malformed database rows to maintain observability.
keyurva
left a comment
There was a problem hiding this comment.
Thanks Nick! This was a deep change and would've taken a lot of figuring out and getting things in order. Much appreciated!
| if isExistenceRequest { | ||
| rows, err := sds.client.CheckVariableExistence(ctx, variables, entities) | ||
| if len(variables) == 0 { | ||
| rows, err := sds.client.CheckVariableExistence(ctx, variables, entities) |
There was a problem hiding this comment.
Can we create helpers like handleEntityOnlyExistence, sortNodesByType, etc. for better readability?
| continue | ||
| } | ||
| obs = append(obs, &Observation{ | ||
| VariableMeasured: row[0], |
There was a problem hiding this comment.
For my understanding: This is interesting. So the current implementation returns SVs, SVGs and topics under VariableMeasured?
|
|
||
| // CheckVariableGroupExistence checks for the existence of observations for the given variable groups/topics and entities. | ||
| // Returns a slice of rows, where each row contains [variable, entity] that has at least one observation. | ||
| func (sc *spannerDatabaseClient) CheckVariableGroupExistence(ctx context.Context, variables []string, entities []string) ([][]string, error) { |
There was a problem hiding this comment.
variables here are actually groups or topics, correct? If so, should we name them groupsOrTopics or something similar?
| // We are doing this because a query on root involve a massive join involving | ||
| // a large number of variables (causing a slow query and risk of timeout). | ||
| stmtRoots := spanner.Statement{ | ||
| SQL: `SELECT DISTINCT subject_id |
There was a problem hiding this comment.
Our pattern is to define statements in statements.go. Can we do that for these statements as well?
| stmtRoots := spanner.Statement{ | ||
| SQL: `SELECT DISTINCT subject_id | ||
| FROM Edge | ||
| WHERE predicate = 'specializationOf' |
There was a problem hiding this comment.
Can you describe how topics are handled because they won't have any specializationOf predicates.
|
|
||
| // CheckVariableGroupExistence checks for the existence of observations for the given variable groups/topics and entities. | ||
| // Returns a slice of rows, where each row contains [variable, entity] that has at least one observation. | ||
| func (sc *spannerDatabaseClient) CheckVariableGroupExistence(ctx context.Context, variables []string, entities []string) ([][]string, error) { |
There was a problem hiding this comment.
To manage the size of this method, can we create helpers - e.g. identifyRoots, handleRoots, handleNonRoots, etc.?
| params := map[string]interface{}{} | ||
| for _, p := range keys { | ||
| ents := entitiesByPredicate[p] | ||
| filters = append(filters, fmt.Sprintf(`(predicate = '%s' AND object_id IN UNNEST(@%s))`, p, p)) |
There was a problem hiding this comment.
I believe (not certain) these filter statements are declared in statements.go as well. Let's declare them there if that's the case for other filters as well.
| } | ||
|
|
||
| sourceEdgesStmt := spanner.Statement{ | ||
| SQL: `SELECT DISTINCT subject_id, object_id |
There was a problem hiding this comment.
Can this be declared in statements.go as well?
| // Handle non-roots | ||
| if len(nonRoots) > 0 { | ||
| existenceQueryStmt := spanner.Statement{ | ||
| SQL: `SELECT DISTINCT e.object_id AS variable, o.import_name |
There was a problem hiding this comment.
Similar statements.go comment as others.
|
|
||
| // CheckVariableGroupExistence checks for the existence of observations for the given variable groups/topics and entities. | ||
| // Returns a slice of rows, where each row contains [variable, entity] that has at least one observation. | ||
| func (sc *spannerDatabaseClient) CheckVariableGroupExistence(ctx context.Context, variables []string, entities []string) ([][]string, error) { |
There was a problem hiding this comment.
nit: From offline chat, we learned that variables are actually SVGs or topics and entities are sources.
And the outputs are SVGs / topics and import names.
Let's rename variables / add comments wherever appropriate to make this clear.
Issue
b/505048261
Description
Legacy observation existence checks were able to resolve checks for SVGs in addition to stat vars. This functionality was used to power the "Data source" and "Dataset" filters in the Stat Var Explorer. These checks used the BT cache, which is no longer available in the Spanner path and resulted in this functionality being broken.
This PR brings this functionality to Spanner by bringing existence check support to SVGs, specifically for imports / datasets.
Notes
Mirroring legacy code, we use a substring slice (
importName = subjectID[8:]) to extract the raw import name from thesubject_idfound in theEdgetable. We need to do this because theEdgetable stores source nodes with a prefix (dc/base/AustraliaStatistics) while observations stores the name itselfAustraliaStatisticsin itsimport_namecolumn. We are doing this in Go here (to avoid a substring in a join in Spanner). As a separate consideration from this PR, we could look at whether this shouldbe resolved ad the schema or data ingestion level so that the join can be performed on exact matches without string manipulation (this method does seem brittle).Separately, this PR implements the SVG level existence checks, but only for imports and datasets (but not entities such as places). This is the only thing needed for the functionality as used here; however we could consider expanding this in another PR as a follow-up.
Testing
In the stat var explorer, when on master but with Spanner on, the "Data source" filter is empty.
When on this branch, the "Data source" filter is populated and should match production. Selecting "Data sources" should then both filter the SVGs and populate the "Dataset" list, with outputs that match production.