Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17450 StatusTool with pure Java code #2712

Merged
merged 36 commits into from
Oct 20, 2024

Conversation

janhoy
Copy link
Contributor

@janhoy janhoy commented Sep 13, 2024

https://issues.apache.org/jira/browse/SOLR-17450

This moves the meat of bin/solr status command to StatusTool, removing some ps | grep java logic in both scripts.

Instead, we use the Java 9 ProcessHandle API to interact with list of operating system processes, so we can find running instances of Solr.

The new SolrProcessMgr class is made standalone since it can also be used elsewhere to interact with solr processes...

I updated test_status.bats, but did not keep the old logic of allowing no arguments. Instead we expose botn --solr-url and -p so you can ask for status for another Solr. If you supply no args, it behaves like before, scanning for all running solrs. I added these to bats.

The output is probably not 100% same as it used to.

Windows script changes not tested at all...

Copy link
Contributor

@epugh epugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving it...

@epugh
Copy link
Contributor

epugh commented Sep 14, 2024

We can put a call out to the community for testing on Windows... @Willdotwhite would you be interested in taking this PR for a spin on Windows?

@Willdotwhite
Copy link
Contributor

We can put a call out to the community for testing on Windows... @Willdotwhite would you be interested in taking this PR for a spin on Windows?

I'd be happy to, although I'm away from my Windows machine until next weekend at the earliest I'm afraid! Is there any particular time pressure on this?

@epugh
Copy link
Contributor

epugh commented Sep 14, 2024

We can put a call out to the community for testing on Windows... @Willdotwhite would you be interested in taking this PR for a spin on Windows?

I'd be happy to, although I'm away from my Windows machine until next weekend at the earliest I'm afraid! Is there any particular time pressure on this?

It's open source and volunteer driven ;-). So any time is a great time!

I do appreciate you and @rahulgoswami stepping up to do the Windows verification work. It makes me a lot more confident about being able to evolve the CLI.

@janhoy
Copy link
Contributor Author

janhoy commented Sep 14, 2024

I removed the RuntimePermission for manageProcess for runtime since I suspect it is only needed for tests, where we actually do spawn processes... Will revert that commit if the test fails...

@janhoy janhoy marked this pull request as ready for review September 16, 2024 07:30
@janhoy
Copy link
Contributor Author

janhoy commented Sep 16, 2024

There's a reproducible failure since BasicAuthIntegrationTest actually uses StatusTool and expects it to output pure JSON with the --solr-url option, and for it to also handle basic auth credentials correctly from basicauth sysProp. It does not..

@janhoy
Copy link
Contributor Author

janhoy commented Sep 16, 2024

So the reason is that in BasicAuthIntegrationTest it calls the tool directly with a url to query the status endpoint of the local Solr being tested. However, our test suite does not actually spawn processes, it just spins up a test jetty in a thread, so SolrProcessManager does not detect it. And we rely on info from running processes in that code path now. So either I'll change BasicAuth code to some other means of querying or will adapt the --solr-url code path of StatusTool to not rely on running process map...

@janhoy
Copy link
Contributor Author

janhoy commented Oct 16, 2024

So I did a few changes and found a few bugs:

The status tool with no args now defaults to inspecting the PID files in solr.pid.dir instead of listing all OS processes. This felt safer in case more Solr nodes run on the same host. Now only those in the PID_DIR will be listed.

Then I found that the resolving of pid dir was flawed. If solr.pid.dir was set, it worked, but the fallback to {solr.install.dir}/bin did not work due to a bug, so it fell back to hardcoded /tmp which did not work on WIndows. Now it will most likely use {solr.install.dir}/bin because it is set in shell script. But anyway we fallback to java.io.tmpdir.

It still did not work on Windows since I assumed that the PID file on windows was a real PID file. But it isn't. It is a PORT file, and its contents is the port number. The start.cmd script uses some script magic to list processes and parse PID each time. Don't ask me why.

So now we look up process by port if on windows and by pid if on Linux. Tests also updated. Running some tests on win.

@janhoy
Copy link
Contributor Author

janhoy commented Oct 16, 2024

Did more testing. The output when starting Solr on Win was broken and printed the json status text on Win, while on Linux it still printed Started Solr server on port 3333 (pid=6838). Happy searching!. So for bin/solr we still print that text in the shell script, but on Windows, we delegate it to StatusTool.

@epugh
Copy link
Contributor

epugh commented Oct 17, 2024

I love how instead of making StatusTool help output more complex, you added the "stealth" idea that honestly we may use in other places!

Remove duplicate printout "Found 1 Solr nodes:"
@janhoy
Copy link
Contributor Author

janhoy commented Oct 18, 2024

Did another pass at cleaning up some convoluted code, and remove some duplicate printout. Here I have copied the command response from some local tests. Will probably commit soon if you think it looks good:

bin/solr status -h
usage: bin/solr status [-h] [-p <PORT>] [-s <HOST>] [--short] [--verbose]

List of options:
 -h,--help              Print this message.
 -p,--port <PORT>       Port on localhost to check status for
 -s,--solr-url <HOST>   Base Solr URL, which can be used to determine the zk-host if that's not known; defaults to:
                        http://localhost:8983.
    --short             Short format. Prints one URL per line for running instances
    --verbose           Enable verbose command output.

Please see the Reference Guide for more tools documentation:
https://solr.apache.org/guide/solr/latest/deployment-guide/solr-control-script-reference.html
bin/solr status --short
http://localhost:9999/solr
http://localhost:8888/solr
bin/solr status
Looking for running processes...

Found 2 Solr nodes:

Solr process 16318 running on port 9999
{
  "solr_home":"/Users/janhoy/git/solr/solr/packaging/build/dev/server/solr",
  "version":"10.0.0-SNAPSHOT 58f7d877726a0274665c13ea5c89bf57f14c0f2b [snapshot build, details omitted]",
  "startTime":"Fri Oct 18 12:38:37 CEST 2024",
  "uptime":"0 days, 0 hours, 20 minutes, 27 seconds",
  "memory":"209.9 MB (%41) of 512 MB",
  "cloud":{
    "ZooKeeper":"127.0.0.1:10999",
    "liveNodes":"1",
    "collections":"0"}}


Solr process 16472 running on port 8888
{
  "solr_home":"/Users/janhoy/git/solr/solr/packaging/build/dev/server/solr",
  "version":"10.0.0-SNAPSHOT 58f7d877726a0274665c13ea5c89bf57f14c0f2b [snapshot build, details omitted]",
  "startTime":"Fri Oct 18 12:38:45 CEST 2024",
  "uptime":"0 days, 0 hours, 20 minutes, 20 seconds",
  "memory":"82.7 MB (%16.1) of 512 MB",
  "cloud":{
    "ZooKeeper":"127.0.0.1:9888",
    "liveNodes":"1",
    "collections":"0"}}
bin/solr status -p 8888

Solr process 16472 running on port 8888
{
  "solr_home":"/Users/janhoy/git/solr/solr/packaging/build/dev/server/solr",
  "version":"10.0.0-SNAPSHOT 58f7d877726a0274665c13ea5c89bf57f14c0f2b [snapshot build, details omitted]",
  "startTime":"Fri Oct 18 12:38:45 CEST 2024",
  "uptime":"0 days, 0 hours, 20 minutes, 37 seconds",
  "memory":"83.2 MB (%16.2) of 512 MB",
  "cloud":{
    "ZooKeeper":"127.0.0.1:9888",
    "liveNodes":"1",
    "collections":"0"}}
bin/solr status -p 1111
Could not find a running Solr on port 1111
bin/solr status --solr-url http://localhost:8888/solr
{
  "solr_home":"/Users/janhoy/git/solr/solr/packaging/build/dev/server/solr",
  "version":"10.0.0-SNAPSHOT 58f7d877726a0274665c13ea5c89bf57f14c0f2b [snapshot build, details omitted]",
  "startTime":"Fri Oct 18 12:38:45 CEST 2024",
  "uptime":"0 days, 0 hours, 21 minutes, 6 seconds",
  "memory":"84.2 MB (%16.4) of 512 MB",
  "cloud":{
    "ZooKeeper":"127.0.0.1:9888",
    "liveNodes":"1",
    "collections":"0"}}
bin/solr status --solr-url http://localhost:1111/solr
Solr at http://localhost:1111/solr not online.
bin/solr status --solr-url http://localhost:8888/solr --max-wait-secs 1
Waiting up to 1 seconds to see Solr running on port 8888
Started Solr server on port 8888. Happy searching!
bin/solr status --solr-url http://localhost:1111/solr --max-wait-secs 1
Waiting up to 1 seconds to see Solr running on port 1111
Solr at http://localhost:1111/solr did not come online within 1 seconds!

With no processes running:

bin/solr status
Looking for running processes...

Found 0 Solr nodes:

No Solr nodes are running.

@janhoy
Copy link
Contributor Author

janhoy commented Oct 18, 2024

Note that until now, the only supported syntax has been bin/solr status for user-facing, and bin/solr status --solr-url <url> --max-wait-secs <secs> for scripts. So that's the only we need to worry about back compat in output.

@janhoy
Copy link
Contributor Author

janhoy commented Oct 18, 2024

I compared with outputs from v9.7:

solr status with no running nodes:


+Looking for running processes...
+
+Found 0 Solr nodes:
+
No Solr nodes are running.

I.e. this PR adds two extra lines of output. Should I remove them?

solr status with two nodes running:

Same two extra lines as above. In addition 9.7 printed this extraline:

WARNING: URLs provided to this tool needn't include Solr's context-root (e.g. "/solr"). Such URLs are deprecated and support for them will be removed in a future release. Correcting from [http://localhost:8888/solr/] to [http://localhost:8888/].

solr start

Output almost identical between bin/solr script output and the output from statusTool, used on Windows.

Started Solr server on port 8888 (pid=1720). Happy searching!
Started Solr server on port 8888. Happy searching!

Difference is that StatusTool does not print the (pid=xxxx) information. Since this is invoked with a --solr-url it could be a remote server. But when combined with --max-wait-secs it kind of implies that it is local, so we COULD lookup port from URL and then find the PID from SolrProcessManager. The only trouble here is that the list of runing processes in SolrProcessManager is ininitalized on startup, so not really able to detect a new process that has suddenly appeared. So there could be a race condition. We could add an init() to the SPM of course to poll for new processes. Or we could just accept this difference.

@janhoy janhoy merged commit 2391f49 into apache:main Oct 20, 2024
5 checks passed
@janhoy janhoy deleted the SOLR-17450-status-tool-java branch October 20, 2024 22:20
janhoy added a commit to janhoy/solr that referenced this pull request Oct 20, 2024
Co-authored-by: Christos Malliaridis <[email protected]>

(cherry picked from commit 2391f49)
@malliaridis
Copy link
Contributor

malliaridis commented Oct 26, 2024

@janhoy After the merge of this PR I have multiple tests that fail on Windows with access denied ("java.io.FilePermission" "<<ALL FILES>>" "execute").

I'd like to see the security manager being removed in main, but the issues will still be present in branch_9x, so we should address them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants