Skip to content

Conversation

@Tpt
Copy link
Contributor

@Tpt Tpt commented Dec 12, 2025

Makes it clear that inserting SAMPLE is something the user can do, not something SPARQL implementation should do automatically

Close #165


Preview | Diff

Makes it clear that inserting SAMPLE is something the user can do, not something SPARQL implementation should do automatically

Close #165
@Tpt Tpt requested review from afs, hartig, kasei and rubensworks December 12, 2025 17:13
@Tpt Tpt self-assigned this Dec 12, 2025
Co-authored-by: Olaf Hartig <[email protected]>
@kasei
Copy link
Contributor

kasei commented Dec 13, 2025

Is this section even true? I'm trying to square the language in this note (which exists in similar form in the 1.1 spec) with the translation in 18.3.4.1 Grouping and Aggregation which includes:

For each (X AS Var) in SELECT, each HAVING(X), and each ORDER BY X in Q
  For each unaggregated variable V in X
      Replace V with SAMPLE(V)

I read that as supporting STR(?x) as a valid expression for X.

I think the language in this section is probably true of the algebra, but I understood that to be the entire point of the translation logic that wraps un-grouped and un-aggregated variables in SAMPLE. Am I missing something?

I also think this note is a bit confusing, because it's not clear if it is describing a restriction on the example just presented (in which the presence of grouping on STR(?x) does not imply that you can project that expression without the extra AS clause) or if it's describing a general rule (which as I mention above, I'm now unsure about).

@Tpt Tpt changed the title Clarify a note on unagreggated variables Clarify that unagreggated variables in SELECT raise an error Dec 13, 2025
@Tpt Tpt requested a review from hartig December 13, 2025 11:08
@Tpt Tpt added spec:substantive Change in the spec affecting its normative content (class 3) –see also spec:bug, spec:new-feature spec:bug Change fixing a bug in the specification (class 3) –see also spec:substantive and removed spec:substantive Change in the spec affecting its normative content (class 3) –see also spec:bug, spec:new-feature labels Dec 13, 2025
@Tpt
Copy link
Contributor Author

Tpt commented Dec 13, 2025

@kasei This is a great point! Thank you! I overlooked it even if it was in the issue motivating the MR (#165).

I think the SPARQL 1.1 specification is inconsistent and 18.3.4.1 Grouping and Aggregation section is contradicting 11.4 Aggregate Projection Restrictions. The testsuite seems to take the 11.4 section approach of not inserting SAMPLE but raising an error (agg08, agg09, agg10 and agg11 tests). I edited the MR to go into this direction but this is something that should be discussed by the WG.

Current quick survey:

  • error: Jena, Blazegraph, Oxigraph, QLever and Communica
  • insertion of SAMPLE: Virtuoso and RDFLib

@kasei
Copy link
Contributor

kasei commented Dec 14, 2025

@kasei This is a great point! Thank you! I overlooked it even if it was in the issue motivating the MR (#165).

I think the SPARQL 1.1 specification is inconsistent and 18.3.4.1 Grouping and Aggregation section is contradicting 11.4 Aggregate Projection Restrictions. The testsuite seems to take the 11.4 section approach of not inserting SAMPLE but raising an error (agg08, agg09, agg10 and agg11 tests). I edited the MR to go into this direction but this is something that should be discussed by the WG.

I don't think agg08 shows this. It has grouping on the expression (?O1 + ?O2) and then tries to use that same expression in projection, so it isn't about a "simple expressions consisting of just a variable". agg09, agg10, and agg11 similarly don't seem to be about using such a grouped "simple expression" in a non-simple projection.

Current quick survey:

  • error: Jena, Blazegraph, Oxigraph, QLever and Communica
  • insertion of SAMPLE: Virtuoso and RDFLib

That's interesting. Ignoring current implementations, my feeling here is the insertion case makes the most intuitive sense. Both aggregated values and grouped values should be available for use in any (non-aggregation) select expressions.

@Tpt
Copy link
Contributor Author

Tpt commented Dec 14, 2025

I don't think agg08 shows this. It has grouping on the expression (?O1 + ?O2) and then tries to use that same expression in projection, so it isn't about a "simple expressions consisting of just a variable".

Yes, it's not about single variable indeed. My bad.

agg09, agg10, and agg11 similarly don't seem to be about using such a grouped "simple expression" in a non-simple projection.

It seems to me we are not focusing on the same cases. I am focusing on "simple expressions consisting of just a variable" in the projection. I still think agg09 (and agg10 are relevant). Taking agg09:

SELECT ?P (COUNT(?O) AS ?C)
WHERE { ?S ?P ?O } GROUP BY ?S

There are two possible outcomes:

  1. We apply the section 11.4 Aggregate Projection Restrictions
In a query level which uses aggregates, only expressions consisting of aggregates and constants may be projected, with one exception. When GROUP BY is given with one or more simple expressions consisting of just a variable, those variables may be projected from the level.
Then this query is invalid because `?P` in the `SELECT` is not a grouped value.
  1. We apply 18.3.4.1 Grouping and Aggregation algorithm and the line
Replace V with SAMPLE(V)
Then the query become equivalent to
SELECT (SAMPLE(?P) AS ?P) (COUNT(?O) AS ?C)
WHERE { ?S ?P ?O } GROUP BY ?S

and we get results.

Hence my impression that the spec is inconsistent.

My MR is taking the first approach and changing algorithm but I don't feel strongly on it. I just think we should be consistent and make a choice here.

@hartig
Copy link
Contributor

hartig commented Dec 15, 2025

In light of the discussion, looking again at the following sentence from 11.4 Aggregate Projection Restrictions [SPARQL 1.1] ..

Other expressions, not using GROUP BY variables, or aggregates may have non-deterministic values projected from their groups using the SAMPLE aggregate.

(which was the sentence that you wanted to improve with this PR), may be this sentence was indeed meant to say that SAMPLE will be injected implicitly rather than saying that SAMPLE would have to be used explicitly. If that was indeed the meaning, then the sentence is consistent with the

Replace V with SAMPLE(V)

of the translation algorithm in 18.2.4.1 Grouping and Aggregation [SPARQL 1.1]. However, I would still consider this inconsistent with the following first paragraph of 11.4 Aggregate Projection Restrictions [SPARQL 1.1]:

In a query level which uses aggregates, only expressions consisting of aggregates and constants may be projected, with one exception. When GROUP BY is given with one or more simple expressions consisting of just a variable, those variables may be projected from the level.

This one says, indirectly but clearly, that it is not allowed to project a variable unless the variable is a GROUP BY variable.

So, in any case, there is an inconsistency in the spec.

Regarding the two possible outcomes, I would find the first one (throwing an error, as done by Jena, Blazegraph, Oxigraph, QLever and Communica) to be more intuitive. The PR in its current form (with commit 3f3e302 included) captures this option.

Yet, I agree that this is something to be discussed, at least in the TF if not the whole WG, because no matter which of the two options we pick, it is a change for some implementations.

@afs
Copy link
Contributor

afs commented Dec 15, 2025

Not completely sure but I think SAMPLE is inserted to put in an aggregate.
Then the next line ("For each aggregate") only has to consider aggregates.

I think the language in this section is probably true of the algebra, but I understood that to be the entire point of the translation logic that wraps un-grouped and un-aggregated variables in SAMPLE. Am I missing something?

agg08/agg09/agg10 are negative syntax tests so, yes, they don't get to the algebra.


Does anyone have an example query where it would hit that SAMPLE injection clause?

@TallTed
Copy link
Member

TallTed commented Dec 17, 2025

It's puzzles like this that make me wish for Invited Experts who participated in the previous WG cycle(s) for the specs being worked in the current WG. We can guess at what was intended, but that always runs the risk of guessing wrong, and more of overlooking some significant issue that led to the current spec not choosing path x (though we would hope they would have explicitly ruled it out).

Given the lack of clarity in the existing spec, I don't think we should decide based on a poll of the current implementations for what they've done. Even in the time crunch, I think this calls for some exploration of "what might happen if", good and bad, for each option.

@afs
Copy link
Contributor

afs commented Dec 17, 2025

Invited Experts

I was there (member organisation, not IE).

After 12+ years, it is prudent to not make absolute claims.

@TallTed
Copy link
Member

TallTed commented Dec 17, 2025

(I meant, as IEs in this WG, presuming that they were no longer with member organizations. I don't really care what their status was in the past group(s), nor would be in this group, as long as they participated!)

@hartig
Copy link
Contributor

hartig commented Dec 18, 2025

First off, I agree with @TallTed that it would be better to have the original editors---or, more precisely, the original author(s) of this part of the spec---around here to help us understand what exactly the intention of such ambiguous sentences was.

Yet, as they do not seem to be here, we have to interpret the spec text as given. Doing so, I observe that we (several of the new editors) seem to agree that there are different ways to interpret the given text. Additionally, as illustrated by @Tpt's quick test, vendors also came to different interpretations.

Therefore, as one of the new editors, I would like to fix the apparent ambiguity. And. without the previous author(s) of this part of the spec being here to help us interpreting their text—in combination with the fact that implementations are already diverging in terms of this aspect of the spec—I would say it is okay to tighten this aspect of the spec by picking the option that we think is most reasonable.

@kasei
Copy link
Contributor

kasei commented Dec 18, 2025

I'm afraid I've somewhat lost track of the exact issue we're discussing here, and worry that some of this may have been caused by my initial confusion about the proposed text. I think the test case examples pointed to earlier by @Tpt (agg08, agg09, agg10 and agg11 tests) all align with my expectations, but I didn't think any of those showed the problem that we were discussing (which I believed was about using GROUP BY on a simple variable, and then using that variable in a more complex, non-aggregate function in projection). Does that align or conflict with others' understanding of the discussion?

Beyond that confusion, I am somewhat alarmed at the appended commit here that raises a new error during translation. I believe it makes some very simple patterns invalid:

For each (X AS Var) in SELECT, each HAVING(X), and each ORDER BY X in Q
  For each unaggregated variable V in X
      Raise an error

For example, for a query with GROUP BY ?x, this means you can't project (?x AS ?y), can't ORDER BY ?x, and can't use HAVING (?x) (and more complex and possibly more meaningful variations). That seems very problematic to me.

@afs
Copy link
Contributor

afs commented Dec 18, 2025

agg08 (reformatted, simplified) is a negative syntax test.

PREFIX : <http://www.example.org>

SELECT ((?o1 + ?o2) AS ?o12)
WHERE { ?s :p ?o1; :q ?o2 }
GROUP BY (?o1 + ?o2)
ORDER BY ?o12

The user can get the effect by assigning the expression, which fixes the value as GROUP happens, before the various apsects of the SELECT clause. This is the point of agg08b (reformatted, simplified) -- "grouping by expression, done correctly"

PREFIX : <http://www.example.org>
SELECT (?X AS ?o12)
WHERE { ?s :p ?o1; :q ?o2 }
GROUP BY (?o1 + ?o2 AS ?X)
ORDER BY ?o12
SELECT ?x WHERE { ?x :p 123 } GROUP BY ?x

the query is covered by the text:

For each variable V appearing outside of an aggregate
  Ai := Aggregation(V, Sample, {}, Grp)
  E := E append (V, aggi)
  i := i + 1
  End

Greg's example is one of three cases:

## Group key
SELECT (?x AS ?y) WHERE { ?x :p ?v } GROUP BY ?x

and also consider:

## Non-group key
SELECT (?v AS ?y) WHERE { ?x :p ?v } GROUP BY ?x
## New variable
SELECT (?z AS ?y) WHERE { ?x :p ?v } GROUP BY ?x

As Olaf quoted: (11.4 Aggregate Projection Restrictions

In a query level which uses aggregates, only expressions consisting of aggregates and constants may be projected, with one exception. When GROUP BY is given with one or more simple expressions consisting of just a variable, those variables may be projected from the level.

Do we agree that

  • (?x AS ?y) is intended to work?
  • (?v AS ?y) is ?v as an expression and is intended to be a syntax error? (by 11.4)
  • (?z AS ?y) is ?z as an expression and is intended to be a syntax error? (by 11.4)

In the text:

For each (X AS Var) in SELECT, each HAVING(X), and each ORDER BY X in Q
For each unaggregated variable V in X
Replace V with SAMPLE(V)
End

"unaggregated variable" is not mentioned anywhere else.

X isn't described, but

Replace V with SAMPLE(V)
...
For each aggregate R(args ; scalarvals) now in X

"in X" suggests an expression, but X as an arbitrary expression does not work without a further condition.

If X is used twice, it could be different values from SAMPLE.

Note that the Sample function is not required to be deterministic for a given input. The only restriction is that the output value must be present in the input sequence.

In (?v + ?v) AS ?w, (SAMPLE(?v) + SAMPLE(?v)) AS ?w, ?w is not necessarily 2*?v.
And SAMPLE(?v) < SAMPLE(?v) is bizarre.

We can make that an erratum if it is not already covered.

One charitable reading of "unaggregated variable" is as a variable used as GROUP BY (scoping hides variables
in the group that are not part of grouping).

The SELECT clauses mixes up several steps, in particularly, the aggregation step and post-aggregation usage.
(aside: https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/pipe-syntax)

The opposite to 11.4 is that ?v in (?v AS ?y) could be understood is a new variable (scoping hides the inner ?v).
The ?z in (?z AS ?y) could be understood is a new variable but again not 11.4.

Overall, I think that we already have a principle that, after grouping, at a given query level, the set of variables allowed outside aggregates is only the variables used directly for group keys values (GROUP BY ?x , and ?E in GROUP BY (?v+1 AS ?E)) i.e. in-scope, and any variables introduced left-to-right by SELECT AS. (11.4 needs to be revised to cover the "introduce and then use" case.)


Summary:

  • Define "unaggregated variable" as a variable used as a group key, including AS ?var. Rename.
  • Revise 11.4 to cover both introduced variables and "non-aggregated" - the use of SAMPLE in the definition makes then aggregated after translation.
  • Add a new item to the grammar rules for 11.4 that use of variables not in-scope after GROUP BY can not be used in expressions in SELECT (expr AS ?var) when grouping/aggregation is in effect.

@afs
Copy link
Contributor

afs commented Dec 18, 2025

It looks like there is a further simplification:

For each variable V appearing outside of an aggregate
  Ai := Aggregation(V, Sample, {}, Grp)
  E := E append (V, aggi)
  i := i + 1
  End
```

can be changed to

   E := E append (V, group key variable))

and in 18.3.4.4 SELECT Expressions

For each pair (var, expr) in E
   X := Extend(X, var, expr)
   End

still applies.

@Tpt
Copy link
Contributor Author

Tpt commented Dec 18, 2025

@afs Thank you for your detailed reply! If you want, feel free to take over the MR to apply the changes.

@TallTed
Copy link
Member

TallTed commented Dec 18, 2025

I think —

E := E append (V, group key variable))

— has an extra )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spec:bug Change fixing a bug in the specification (class 3) –see also spec:substantive

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clarify use of SAMPLE

6 participants