Skip to content

[ES|QL] MMR Command - MMROperator and Execution#141324

Open
markjhoy wants to merge 37 commits intoelastic:mainfrom
markjhoy:markjhoy/esql_mmr_command_operator_execution
Open

[ES|QL] MMR Command - MMROperator and Execution#141324
markjhoy wants to merge 37 commits intoelastic:mainfrom
markjhoy:markjhoy/esql_mmr_command_operator_execution

Conversation

@markjhoy
Copy link
Contributor

@markjhoy markjhoy commented Jan 27, 2026

Adds the operator and execution flow for the MMR command as well as pertinent CSV tests

Also performs a light refactoring to add a CompleteInputCollectorOperator base class for operations that require the full set of input pages to be input / processed before any output

import static org.elasticsearch.xpack.esql.core.type.DataType.INTEGER;

public class MMR extends UnaryPlan implements TelemetryAware, ExecutesOn.Coordinator, PostAnalysisVerificationAware {
public static final NamedWriteableRegistry.Entry ENTRY = new NamedWriteableRegistry.Entry(LogicalPlan.class, "Mmr", MMR::new);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although serialization is not strictly necessary here as this should only run on the coordinator, certain parent plans (e.g. LIMIT) may try and serialize any child plans - and without this (and the writeTo(), etc. implementations) will throw. It theoretically should not hurt anything to allow this to de/serialize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need serialization, if there's any serialization happening that is a separate bug we need to dig into.

@markjhoy
Copy link
Contributor Author

markjhoy commented Feb 2, 2026

Note: seeing a weird issue with the output of the dense vectors where it's rounding the values. For example in one of the CSV tests, I'm getting a mismatch error due to the results:

Actual:
    keyword_field:keyword | text_body:text | text_vector:dense_vector
    test1                 | first text     | [0.4, 0.2, 0.4]
    test3                 | third text     | [0.4, 0.1, 0.3]
    test6                 | sixth text     | [0.099999994, 0.099999994, 0.099999994]
Expected:
    keyword_field:keyword | text_body:text | text_vector:dense_vector
    test1                 | first text     | [0.4, 0.2, 0.4]
    test3                 | third text     | [0.4, 0.1, 0.3]
    test6                 | sixth text     | [0.1, 0.1, 0.1]

... the input vectors appear to be rounding on input. Shouldn't affect the functionality of the operator here, but we need to look a bit further into this.

@markjhoy markjhoy added >non-issue Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch :Search Relevance/ES|QL Search functionality in ES|QL labels Feb 3, 2026
@markjhoy markjhoy marked this pull request as ready for review February 3, 2026 23:31
@markjhoy markjhoy self-assigned this Feb 3, 2026
@markjhoy markjhoy requested review from a team and ioanatia February 3, 2026 23:31
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good! 💯

Some questions about exception usage and releasing pages.


@Override
protected Page createPage(int positionOffset, int length) {
length = Integer.min(length, remaining());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to add random nulls (rarely) and we check nothing breaks...

FROM mmr_text_vector_keyword
| SORT keyword_field
| LIMIT 10
| MMR [0.1, 0.2, 0.3]::dense_vector ON text_vector LIMIT 3 WITH { "lambda": 0.1 }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do implicit casting here! Check how it's done now for functions - we may expand that to commands, or just for this specific instance.

I realize that users will probably not use directly a vector as an argument, but the result of some function / expression. Just mentioning it here if you think it would be useful. In that case, we can do that on a separate PR.

}

@Override
protected void onClose() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to override? Is this not handled on the superclass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an abstract function of the superclass, called after any close() items performed there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - I mean, I think the exact same thing is being done on the superclass before invoking onClose()?

import static org.elasticsearch.xpack.esql.core.type.DataType.INTEGER;

public class MMR extends UnaryPlan implements TelemetryAware, ExecutesOn.Coordinator, PostAnalysisVerificationAware {
public static final NamedWriteableRegistry.Entry ENTRY = new NamedWriteableRegistry.Entry(LogicalPlan.class, "Mmr", MMR::new);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need serialization, if there's any serialization happening that is a separate bug we need to dig into.

Integer scoreChannel = null;
for (Attribute input : mmr.inputSet()) {
if (input instanceof MetadataAttribute && MetadataAttribute.SCORE.equals(input.name())) {
scoreChannel = source.layout.get(input.id()).channel();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to pass in the scoreChannel?
when I have seen this initially, I thought that maybe we use these in the MMR formula, but that's not the case 😌
it seems like we only need this for RankDoc ? I wonder what's the side effect of always choosing a constant score of 1.0f when we create the RankDocs objects 🤔 . Can we do that and simplify the code here a bit so we don't need to pass _score here?

return new MMROperator.Status(emitNanos, pagesReceived, pagesProcessed, rowsReceived, rowsEmitted);
}

public record Status(long emitNanos, int pagesReceived, int pagesProcessed, long rowsReceived, long rowsEmitted)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can abstract the Status too in CompleteInputCollectorOperator 🤔 .
Other abstract operators seem to do this:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :Search Relevance/ES|QL Search functionality in ES|QL Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants