Skip to content

Singleton spawn missing child IO #10691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jjhursey opened this issue Aug 19, 2022 · 7 comments · Fixed by #10695
Closed

Singleton spawn missing child IO #10691

jjhursey opened this issue Aug 19, 2022 · 7 comments · Fixed by #10695

Comments

@jjhursey
Copy link
Member

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

main at 31d719d

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Standard build from github source.

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

shell$ git submodule status
 813d8ba6bc938fc87ed5316fa47a02571cf3b03a 3rd-party/openpmix (v1.1.3-3589-g813d8ba6)
+d0f3e280c5612908f0daf3f097659db22baf31b1 3rd-party/prrte (v3.0.0rc1-12-gd0f3e280)

Please describe the system on which you are running

  • Operating system/version: RHEL 8.4
  • Computer hardware: ppc64le
  • Network type: shared memory (single node)

Details of the problem

shell$ mpicc simple_spawn_multiple.c -o simple_spawn_multiple
shell$ mpirun -np 1 ./simple_spawn_multiple ./simple_spawn_multiple
Hello from a Child (B)
Hello from a Child (B)
Hello from a Child (A)
Spawning Multiple './simple_spawn_multiple' ... OK
shell$ ./simple_spawn_multiple ./simple_spawn_multiple
Spawning Multiple './simple_spawn_multiple' ... OK

I expect that the second run (singleton spawn) would have the same output as the first run (non-singleton spawn). However, it looks like the child output is supressed.

@rhc54
Copy link
Contributor

rhc54 commented Aug 19, 2022

I suspect the reason is that the singleton isn't declaring itself to be an IOF "sink" - i.e., that it's IOF streams should be output instead of passed along. I'll take a peak - it's something that should happen in PMIx when we detect we are operating without a server, I suspect.

@rhc54
Copy link
Contributor

rhc54 commented Aug 19, 2022

Just FYI: if you build OMPI with external PRRTE, you get a link from mpirun to prterun - but you don't get any link created for prte. Thus, singleton comm_spawn fails as it cannot find that executable (though it doesn't tell you that).

Very unexpected behavior - took me quite a while to break it down (by digging thru the OMPI source) and figure out what was going on.

@rhc54
Copy link
Contributor

rhc54 commented Aug 19, 2022

Note that if I build with embedded versions, then prte does show up in $prefix/bin and singleton comm_spawn works fine. Rather inconsistent behavior, and confusing.

@rhc54
Copy link
Contributor

rhc54 commented Aug 19, 2022

Hmmm...I couldn't get this to run at all. First, I had to update both PMIx and PRRTE to head of master branches (to resolve the keepalive option issue), and then I had to edit the ompi_rte.c to fix the local_peers issue (I gather that PR has not yet been committed). I then found that the cmd line being built in dpm.c was incorrectly being parsed as having an application, which PRRTE will reject - the '&' at the end was identified as not a defined option, and therefore being designated as an application, causing prte to abort.

Once I got that fixed, I could finally reproduce the problem. Just not clear how you were able to get it to run with OMPI in its current head-of-main state.

@rhc54
Copy link
Contributor

rhc54 commented Aug 19, 2022

Once you update the PMIx submodule pointer, #10695 will fix this issue.

@rhc54
Copy link
Contributor

rhc54 commented Aug 19, 2022

FWIW: this is what I get out of your test program the updates are all applied:

$ ./simp ./simp
Hello from a Child (B)
Hello from a Child (B)
Hello from a Child (A)
Spawning Multiple './simp' ... OK

@jjhursey
Copy link
Member Author

I confirmed this is fixed by PR #10695 in this comment

Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants