Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scatter is nonblocking when called on a communicator other than GroupWorld #2456

Open
shannong opened this issue Feb 19, 2025 · 0 comments
Open

Comments

@shannong
Copy link

I"m implementing a custom alltoall, which relies on multiple scatter steps and custom communicators. When scatter is called on one of the custom communicators, it appears to be non-blocking, which results in an error that the EventQueue is empty. The simulation does finish and report latencies though.

Adding a scatter on any communicator(s) that include the ranks not in the first communicator corrects the issue.

Link to example code: https://github.com/shannong/sst-elements/blob/multilevel-hierarchical/src/sst/elements/ember/mpi/motifs/emberalltoall.cc#L189

if the else block from 198 - 201 is removed, the event queue error is printed in the output.

Example simulation:

import sst
from sst.merlin.base import * 
from sst.merlin.endpoint import *
from sst.merlin.interface import *
from sst.merlin.topology import *

from sst.ember import *

if __name__ == "__main__":

    PlatformDefinition.setCurrentPlatform("firefly-defaults")

    sst.setStatisticLoadLevel(15)
    sst.enableAllStatisticsForAllComponents()
    sst.setStatisticOutput("sst.statOutputConsole")
    

    ### set up the topology
    topo = topoDragonFly()
    topo.hosts_per_router = 4
    topo.routers_per_group = 32
    topo.intergroup_links = 4
    topo.num_groups = 2
    topo.algorithm = ["minimal", "ugal"]

    group_size = topo.hosts_per_router * topo.routers_per_group

    # Set up the routers
    router = hr_router()
    router.link_bw = "25GB/s"
    router.flit_size = "8B"
    router.xbar_bw = "30GB/s"
    router.input_latency = "20ns"
    router.output_latency = "20ns"
    router.input_buf_size = "256kB"
    router.output_buf_size = "256kB"
    router.num_vns = 2
    router.xbar_arb = "merlin.xbar_arb_lru"

    topo.router = router
    topo.link_latency = "20ns"      

    networkif = ReorderLinkControl()
    networkif.link_bw = "25GB/s"
    networkif.input_buf_size = "256kB" 
    networkif.output_buf_size = "256kB"

    ep = EmberMPIJob(0, topo.getNumNodes(), numCores=1)
    ep.network_interface = networkif
    ep.addMotif("Init")
    ep.addMotif("Alltoall") # look at different sizes here (< 500 bytes, 500 < n < 8k, > 8k) 
    ep.addMotif("Fini")
    ep.nic.nic2host_lat="100ns"
    
    system = System()
    system.setTopology(topo, 1)
    system.allocateNodes(ep, "linear")

    system.build()

    sst.setStatisticLoadLevel(16)
    sst.enableAllStatisticsForAllComponents()
    sst.setStatisticOutput("sst.statOutputCSV")
    sst.setStatisticOutputOptions({
        "filepath" : "/users/skinkead/carc-scratch/frontier/hierarchical/hierarchical1-2-frontier.csv",
        "separator" : ", "
    })
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant