Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-14673: Add bin/solr stream CLI #2479

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open

Conversation

epugh
Copy link
Contributor

@epugh epugh commented May 25, 2024

https://issues.apache.org/jira/browse/SOLR-14673

Description

Bring in code that @joel-bernstein wrote, but using the SolrCLI infrastructure. The original code is a patch in the associated JIRA.

Solution

Another CLI client ;-)

Tests

Copied over the basic tests from the patch. I still need to write an integration style test and ideally one that exercies the basic auth.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 28, 2024
@gerlowskija
Copy link
Contributor

gerlowskija commented Jun 3, 2024

A few high-level questions/concerns:

  1. bin/solr already has an "api" tool, which can be used to invoke streaming expressions e.g. bin/solr api -get "$SOLR_URL/techproducts/stream?expr=search(techproducts)". I'm all for syntactic-sugar, but I wonder whether this is worth the maintenance cost if the main thing that it "buys" us is saving people from having to provide the full API path as the "api" tool requires?

  2. If I'm reading the PR correctly, it looks like one other capability of the proposed bin/solr stream tool is that it can evaluate streams "locally" in some cases i.e. without a full running Solr. Which is pretty cool - you could imagine a real super-user doing some pretty involved ETL that builds off of an expression like: update(techproducts, unique(cat(...))).

    But I'd worry about some of the documentation challenges surrounding this. For instance, how would a user know which expressions can be run locally, and which require a Solr to execute on? For expressions that have a mix of both locally and remotely-executed clauses, is there any way for a user to know which clauses are executed where?

    To clarify - I think the upside here is pretty cool, I'm just worried that upside is hard to realize without some extensive work on the documentation end to make it usable by folks in practice.

@epugh
Copy link
Contributor Author

epugh commented Jun 5, 2024

Thanks for sharing the feedback @gerlowskija ! I think the value of the tool is only there if your second comment about being able to run a streaming expression locally is valid, and then having it do what yoru first comment highlights falls out easy, otherwise it really is a thin wrapper/duplication of the bin/solr api call. Especially without any special value add in formatting tuples or error handling etc.

I do believe the second part is the really cool thing, that I can run a streaming expression locally and use it to process some data.

We clearly need some way of specifying where the processing is happening, in the cluster or locally. I was trying to think if we have any other places in Solr where we define "Where am I doing work" that might provide a name for a parameter. bin/solr stream --environment cluster BLAH ? The search() expression has a qt parameter.. bin/solr stream -qt=/stream BLAH ?

Reading through docs more, we have the parralel() and it refers to workers. Maybe the command should be something like bin/solr stream --workers=local BLAH which would run on your laptop, and if you don't specify --workers then it runs on the cluster via /stream?

I have found that lots of streaming expressions don't require a Solr connection, especially during development. I'm just iterating on the logic, and I'm starting and ending iwth tuples.. it's only later when I get the mappings etc working that I then move to adding in my search() or update() clauses.

Also, as far as docs go, we have a LONG way to go in Streaming expressions. It's both the best docuemnted code, with all the howtos and guides, but also, I find a million expressions that exist but don't show up in our reference docs ;-).

I went with the plural name --workers solr, and then you pass in a collection.   However, I could imagine that this becomes --workers my_collection,films,worker_collection on your local solr...    Not quite sure what passing more then one in means however...
@epugh epugh marked this pull request as ready for review June 20, 2024 13:39
@epugh
Copy link
Contributor Author

epugh commented Jul 23, 2024

Okay, I think this is ready for review! I've added some docs.. I especially liked being able to cat some local data right into a Solr collection!

cat example/exampledocs/books.csv | bin/solr stream -e local 'update(gettingstarted,parseCSV(stdin()))'

In my local playing, it's been nice to be able to write a complex streaming expression in a file and just run it from the command line....

@epugh
Copy link
Contributor Author

epugh commented Aug 10, 2024

@gerlowskija since you provided some early review, do you think the docs I've added etc are enough that I can merge this in?

Copy link
Contributor

@gerlowskija gerlowskija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for those added docs @epugh . They're a huge help, and I suggested a few tweaks that might help even more.

One remaining question I have - wdyt about marking the tool syntax as "experimental" in some way? Seeing all the hard work you've put into improving syntax on the other tools, and considering that we might not notice some rough edges to the syntax of this tool until it's out in the wild a bit...might be prudent to give this script equivalent of "@lucene.experimental" so that we wouldn't need to worry about backcompat if we want to make any future tweaks?

The Stream tool allows you to run a xref:streaming-expressions.adoc[] and see the results from the command line.
It is very similar to the xref:stream-screen.adoc[], but is part of the `bin/solr` CLI.

To run it, open a window and enter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] "window" -> "terminal" or maybe "shell"?

// under the License.

The Stream tool allows you to run a xref:streaming-expressions.adoc[] and see the results from the command line.
It is very similar to the xref:stream-screen.adoc[], but is part of the `bin/solr` CLI.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] Might be worth mentioning here the other differentiator - that this executes some streams "locally"?

A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM|
----

TIP: Notice how we used the pipe character (|) as the delimiter? It required a backslash for escaping it so it wouldn't be treated as a pipe with in the shell script.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] "with in" -> "within"

TIP: Notice how we used the pipe character (|) as the delimiter? It required a backslash for escaping it so it wouldn't be treated as a pipe with in the shell script.

You can also specify a file with the suffix `.expr` containing your streaming expression.
This is useful for longer expressions or if you having command line parsing issues with your expression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] "or if you having" -> or if you have"

Might also be worth clarifying that the file approach primarily helps with shell character-escaping issues, and not parsing/syntax issues generally.

A-DATA V-Series 1GB 184-Pin DDR SDRAM Unbuffered DDR 400 (PC 3200) System Memory - OEM
----

The `--help` (or simply `-h`) option will output information on its usage (i.e., `bin/solr stream --help)`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Should this sentence go in the "Using the ..." section below, since that section essentially pastes the "bin/solr stream -h" output in its entirety?

-u,--credentials <credentials> Credentials in the format username:password. Example: --credentials solr:SolrRocks
-url,--solr-url <HOST> Base Solr URL, which can be used to determine the zk-host if that's not known;
defaults to: http://localhost:8983.
-e,--execution <CONTEXT> Execution context is either 'local' or 'solr'. Default is 'solr'
Copy link
Contributor

@gerlowskija gerlowskija Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] I don't love this terminology, but I don't have anything better in mind (yet). "Local" to me could be misconstrued by folks running 'bin/solr' on a box that also happens to have Solr running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on:

Execution context is either 'local' (i.e CLI process) or 'solr'.


Caveats:

* You don't get to use any of the parallelization support that is available when you run the expression on the cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Is this only a limitation of --execution=local is specified?

Hello World
----

This also works with a `.expr` files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] "also works with a .expr files." -> "also works when using .expr files."

)
----

Running this expression will read in the local file and send the first two lines to the collection `gettingstarted`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0] Might be worth hitting this point a little more strongly: even "local" processing is likely to reach out to a remote host.

Maybe something like:

All streaming expressions are processed "locally" if that execution mode is selected.
However, "local" processing does not imply a networking sandbox.
Many streaming expressions, such as search and update, will make network requests to remote Solr nodes if configured to do so, even in "local" execution mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants