- 
                Notifications
    You must be signed in to change notification settings 
- Fork 929
WeeklyTelcon_20210713
- DID NOT RECORD?
- Austen Lauria (IBM)
- Brendan Cunningham (Cornelis Networks)
- Brian Barrett (AWS)
- Geoffrey Paulsen (IBM)
- Harumi Kuno (HPE)
- Hessam Mirsadeghi (NVIDIA))
- Howard Pritchard (LANL)
- Jeff Squyres (Cisco)
- Joseph Schuchart (HLRS)
- Matthew Dosanjh (Sandia)
- Michael Heinz (Cornelis Networks)
- Naughton III, Thomas (ORNL)
- Sam Gutierrez (LANL)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (NVIDIA)
- Aurelien Bouteiller (UTK)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- Christoph Niethammer (HLRS)
- David Bernholdt (ORNL)
- Edgar Gabriel (UH)
- Erik Zeiske (HPE)
- Geoffroy Vallee (ARM)
- George Bosilca (UTK)
- Josh Hursey (IBM)
- Joshua Ladd (NVIDIA)
- Marisa Roman (Cornelius)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Nathan Hjelm (Google)
- Noah Evans (Sandia)
- Raghu Raja
- Ralph Castain (Intel)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Todd Kordenbrock (Sandia)
- Tomislav Janjusic (NVIDIA)
- William Zhang (AWS)
- Xin Zhao (NVIDIA)
- No schedule for v4.0.7
- Cisco would like v4.0.7 someday.
 
- PR9094 - external32 - Do we want it in v4.0?
- No
 
- PR9088 - long long - Do we want it in v4.1
- Yes
 
- We need both 9094 and 9088 on v4.0.x to fix the bug reported.
- Quality of what this is and what's needed.
 
- v4.0.6 shipped last week. Looking good.
- Mpool PR, waiting for review and to go into master first.
- Howard is testing.
 
- 8919 nVidia cannot link.  Some users may have already hit this.
- Tomislav will try to find someone to look at it.
 
- Schedule: Planning on late August (no reason for August) for accumulated bugfixes.
- Fix huge page allocator waiting on Howard's testing.
- Long Long one
- 8867 - show help if libz is missing, Jeff's looking at.
- 
Documentation - Issue 7668 - lots of things need to change here.
- Can use help
- Jeff done with first past of docs, and slowly folding in docs
- Still stuff that needs to be revamped or written.
- Still all one docs.
 
- Harumi - Even if others can't write well
- Docs that should go into PRRTE
- Some infrastructure with sphynx - can be started as well.
- Decent handle on infrastructure.
 
- Doc could also start in PMIX/PRRTE so we can slurp in.
 
- 
PMIX / PRRTE plan to release in next few weeks 
- 
Need to do a v5.0 rc as soon as PRRTE v2 ships. - Need feedback if we've missed an important one.
 
- 
PMIx Tools support is still not functional. Opened tickets in PRRTE. - Not a common case for most users.
- This also impacts the MPIR shim.
- PRRTE v2 will probably ship with broken tool support.
 
 
- 
Is the driving force for PRRTE v2.0 OMPI? - So we'd be indirectly/directly responsible for PRRTE shipping with broken tool support?
- Ralph would like to retire, and really wants to finish PRRTE v2.0 before he retires.
- Or just fix it in PRRTE v2.0?
- Is broken tool support a blocker for PRRTE v2.0?
- Don't ship OMPI v5.0 with broken Tools support.
 
 
- 
Is there any objections to delaying - Either we resource this
 
- 
https://github.com/openpmix/pmix-tests/issues/88#issuecomment-861006665 - Current state of PMIx tool support.
- We'd like to get Tool support in CI, but need it to be working to enable the CI.
 
- 
https://github.com/openpmix/prrte/issues/978#issuecomment-856205950 - Blocking issue for Open-MPI
- Brian
 
- 
PR 9014 - new blocker. - fix should just be a couple of lines of code... hard to decide what we want.
- Ralph, Jeff and Brian started talking.
- Simplest solution was to have our own
 
- 
Need people working on v5.0 stuff. 
- 
Need some configury changes in before we RC. 
- 
Issue 8850, 8990 and more 
- 
Brian will file 3-ish issues - One is configure pmix
 
- 
Dynamic Windows fix in for UCX. 
- 
Any update on debugger support? 
- 
Need some documentation that Open MPI v5.0 supports PMIx based debuggers, and that if 
- 
UCC coll component updating to just set to be default when UCX is selected. PR 8969 - Intent is that this will eventually replace hcoll.
- Qaulity
 
- Solid progress happening, on Read the docs.
- These docs would be on the readthedocs.io site, or on our site?
- Haven't thought either way yet.
- No strong opinion yet.
 
- Geoff is going to help
- 
Issue 8884 - ROMIO detects CUDA differently. - Giles proposed a quick fix for now.
 
- 
https://github.com/open-mpi/ompi/wiki/Meeting-2021-07
- Find link to Web-ex HERE.
 
- July 22nd (2pm Central)
- July 29st (10-12 Central)
- 
Now released. 
- 
Virtual Face to face. 
- 
Persistant Collectives - So nice to get MPIX_ rename into v5.0
- Don't think this was planned for v5.0
- Don't know if anyone asked them this.  - Might not matter to them
- Virtual face to face -
 
 
- 
a bunch of stuff in pipeline. Then details. 
- 
Plan to open Sessions pull request. - Big, almost all in OMPI.
- Some of it are more impacted by clang format changes.
- New functions.
- Considerably more functions can be called before MPI_Init/Finalize
- Don't want to do sessions in v5.0
- Hessam Miradeghi is interested in trying MPI_Sessions.
- Interested in a timeline of a release that will contain MPI_Sessions.
 
- Sessions working group meets every monday at noon central time.
- https://github.com/mpiwg-sessions/sessions-issues/wiki
- Several of the tools tests are busted on master.
- Sessions branch fixes some of these.
- Initialize tools after finalize MPI
 
 
- Update:
- Did some cleanup of refactoring.
- Topology might NOT change with Sessions relative to whats currently in master
- Extra topology work that wasn't accepted by MPI v4.0 standard.
- Question on how we do mca versioning
 
 
- 
We don't KNOW that OMPI v6.0 may not be an ABI break 
- 
Would be NICE to get MPIX symbols into a seperate library. - What's left in MPIX after persistant collectives?
- Short Float,
- Pcall_req - persistant collective
- Affinity
 
- If they're NOT built by default, it's not too high of a priority.
- Should just be some code-shuffling.
- On the surface shouldn't be too much.
- If they use wrapper compilers, or official mechanism
- Top level library, since app -> MPI and app -> MPIX lib.
- libmpi_x library can then be versioned differently.
 
 
- Should just be some code-shuffling.
 
- What's left in MPIX after persistant collectives?
- 
Dont change to build MPIX by default. 
- 
Open an issue to track all of our MPI 4.0 items - MPI Forum will want, certainly before supercomputing.
 
- 
Do we want an MPI 4.0 Design meeting in place of a Tuesday meeting. - In person meeting is off the table for many of us. We might want an out of sequence meeting.
- Lets doodle something a couple of weeks out.
- Doodle and send it out
- trivial wiki page in style of other in person wiki.
 
- 
Two days of 2 hour blocks - wiki * 
- 
Who owns our open-SQL? - noone?
- What value is the viewer using to generate the ORG data?
- Looking for field in the perl client
- It's just the username.  It's nothing simple.
- Something about how the cherry-pie server is stuffing stuff into the database.
 
 
- It's just the username.  It's nothing simple.
- Thought it was in the ini file, but isn't.
 
- Looking for field in the perl client
- Concerned that we don't have an owner.
- Back in the day, we used MTT because there was nothing else.
- But perhaps there's something else now?
 
 
- 
A lot of segfaults in UCX 1sided in IBM 
- 
Howard Pritchard Does someone at nVidia have a good set of test for GPU - Can ask around.
- Only tests is The OSU MPI has support for CUDA and ROCM tests.
- Good enough for sanity.
- No support for Intel low level stuff now.
 
- PyTorch - machine learning framework - resembles an actual application.
- Has different backends, collectives reduction tool NCCL, but also has a CUDA backend for single/multiple nodes.
 
 
- 
ECP - worried we're going to get so far behind MPICH because all 3 major exascale systems are using essentially the same technology and their vendors use MPICH. They're racing ahead with integrating GPU offloaded code with MPICH. Just a heads up. - A thread on The GPU can trigger something to happen in MPI.
- CUDA_Async Not sure of
 
- 
Jeff will send out committer list to remove people from list. - Trivial to re-add someone, so error on kicking folks out.
 
- No discussion
- No update
- No discussion.