Skip to content

Commit b2e5d0e

Browse files
authored
Removed progress method and its indicator from SB API. (#701)
Removes progress() and supportsProgTh() from SB API. They can be still used within the plugin. As ucx and gds_mt are already using multi-threads to deliver performance, idea of shared progress thread became less useful. Shared progress thread can also cause contention between the backends, while we want the isolation in modularity as much as possible. So team decided to remove it from the SB API. Plugins internally can still use it, such as current ucx implementation, just not a mandatory method in SB API anymore.
1 parent 287b6e2 commit b2e5d0e

File tree

21 files changed

+72
-134
lines changed

21 files changed

+72
-134
lines changed

docs/BackendGuide.md

Lines changed: 6 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -51,12 +51,10 @@ The key/value parameters are a map of strings to byte arrays that are passed fro
5151
* supportsLocal(): Indicates if the backend supports transfers within a node
5252
* supportsRemote(): Indicates if the backend supports transfers across nodes
5353
* supportsNotif(): Indicates if the backend supports notifications
54-
* supportsProgressThread(): Indicates if the backend supports progress() method. That method should call the underlying procedure of progressing transfers for this backend.
5554
* getSupportedMems(): Indicates memory types supported by the backend
5655

57-
Based on the first 4 methods (supports*), the required methods to be implemented change. For instance, UCX backend implements all as it supports all scenarios, while GDS backend only has supportsLocal, detailed more in Example implementations. Note that a network backend should have supportsRemote and supportsNotif to be set to true, and preferably supportsLocal also to true, so another backend doesnt need to be involved for local transfers. For a storage backend, it should have supportsLocal and supportsNotif is optional. supportsProgressThread is optional for both cases. Additionally, a backend that supportsRemote must also support supportNotifs.
56+
Based on the first 3 methods (supports*), the required methods to be implemented change. For instance, UCX backend implements all as it supports all scenarios, while GDS backend only has supportsLocal, detailed more in Example implementations. Note that a network backend must have supportsRemote and supportsNotif to be set to true, and preferably supportsLocal also to true, so another backend doesn't need to be involved for local transfers. For a storage backend, it should have supportsLocal and supportsNotif is optional.
5857

59-
Note that supportProgressThread is an indicator whether a backend has implemented the progress() method, but does not imply how the progress thread is implemented. During creation of a backend, the provided init params indicate how the progress thread is intended to be used. For instance, if the enablement of progress thread is set to false, while a backend cannot work without a separate progress thread, the backend creation would fail. This flag is useful for the NIXL agent if we want to provide some agent level guarantees, such as minimum time between calls to progress for backends, or if a central progress method is implemented (for future proofing, not currently implemented).
6058
### Connection Management:
6159

6260
* connect(): Initiates connection to a remote agent.
@@ -106,12 +104,6 @@ Finally, note that a call to releaseXferReq should not block and be asynchronous
106104

107105
Note that getNotif does not know which agent it should look for to receive the notification. So there should be a method to extract the agent name from the notification received, corresponding to a transfer. genNotif generates a notification which is not bound to any transfers, and does not provide any ordering guarantees. If a backend does not set supportsNotifications, these two methods are not needed.
108106

109-
### Progress Thread:
110-
111-
* progress(): Makes progress on transfers and notifications.
112-
113-
If a backend requires a progress call, such as UCX, to proceed with the transfers, for both check of transfer status or received notification, they can implement a progress thread, and a frequency of waking up that thread will be passed during backend creation. In addition, each time a user calls to check a transfer status, or check received notifications, this method is called, enabling progress if a progress thread is not implemented.
114-
115107
## Descriptor List Abstraction
116108

117109
A key underlying abstraction for NIXL library is a descriptor list, that is made of a memory space (host/GPU/block/File/Obj-Store) and a list of descriptors. There are 2 types of descriptors used for the SB API.
@@ -142,9 +134,9 @@ The plugin manager maintains API versioning of these above APIs. This can allow
142134

143135
## Comparing two plugins as an example
144136

145-
NIXL UCX plugin provides networking across different nodes, while GDS plugin provides storage access. Moreover, UCX plugin sets all of the “supports” flags, while GDS only has the supportsLocal flag set. The reason being UCX requires a progress thread and provides notifications, and can do transfers within an Agent, for instance from GPU to CPU, and across Agents. Therefore, it should implement all of the methods mentioned previously.
137+
NIXL UCX plugin provides networking across different nodes, while GDS plugin provides storage access. UCX plugin sets all of the “supports” flags, while GDS only has the supportsLocal flag set. The reason being UCX is a network plugin that should support inter-agent communication and notifications, and it also supports intra-agent transfers, for instance from GPU to CPU.
146138

147-
However, for NIXL storage backends, there is no need to run a NIXL agent on a remote storage node. Instead, a distributed storage client on the local agent talks to the remote distributed storage, and therefore from NIXL agent point of view for all storage, whether local or remote, it has to talk to this local storage client. In other words, all the transfers are loopback to the agent itself. For the current use case, there is no need for notifications within the same agent, or a progress thread either.
139+
However, for NIXL storage backends, there is no need to run a NIXL agent on a remote storage node. Instead, a distributed storage client on the local agent talks to the remote distributed storage, and therefore from NIXL agent point of view for all storage, whether local or remote, it has to talk to this local storage client. In other words, all the transfers are loopback to the agent itself. For the current use case, there is no need for notifications within the same agent.
148140

149141
Moreover, the GDS plugin does not require a local connection to itself, so it returns SUCCESS for connect and disconnect, and for loadLocal simply returns back the input pointer as its output. The only 6 remaining methods that it has to implement are:
150142

@@ -213,20 +205,20 @@ Note that inside a transfer, a backend might provide methods for network resilie
213205

214206
### Get transfer status:
215207

216-
The agent will call the backend specific transfer handle that is stored within the agent transfer handle, and check the status of the transfer. This is achieved through a call to **checkXfer** in the SB API. Internal to the backend, they can call the **progress** method in SB API, if that’s necessary to get the latest status of the transfers. If the agent is run in progress thread mode, the agent will call that periodically, and therefore reduce the load on this internal call.
208+
The agent will call the backend specific transfer handle that is stored within the agent transfer handle, and check the status of the transfer. This is achieved through a call to **checkXfer** in the SB API. Internal to the backend, they can call their internal progress method, if that’s necessary to get the latest status of the transfers.
217209

218210
### Invalidate transfer request:
219211

220212
The agent will call the **releaseReqH** from the SB API on the backend specific transfer handle to release it, and potentially abort the transfer if in progress and the backend has the capability. Then the agent will release the other resources within the agent level transfer handle to fully release it.
221213

222214
### Get notifications:
223215

224-
The agent will iterate over all the backends that support notification, and call their **getNotifs** from the SB API, which will return a list of notifications received from each remote node between the previous call to this method and this time. Then the agent will merge the results from all such backends, and append them to the map that the user has provided. Similar to get transfer status, Internal to the backend, they can call the **progress** method in SB API, if that’s necessary to get the latest notifications received from the transfers initiated by the other agents towards them. If the agent is run in progress thread mode, the agent will call that periodically, and therefore reduce the load on this internal call.
216+
The agent will iterate over all the backends that support notification, and call their **getNotifs** from the SB API, which will return a list of notifications received from each remote node between the previous call to this method and this time. Then the agent will merge the results from all such backends, and append them to the map that the user has provided. Similar to get transfer status, Internal to the backend, they can call their internal progress method, if that’s necessary to get the latest notifications received from the transfers initiated by the other agents towards them.
225217

226218
### Generate notification:
227219

228220
If a backend is provided by the user, the agent will call **genNotif** from the SB API of that backend engine. Otherwise, it will look for a backend that is available locally and remotely and also supports notifications. If more than one candidate is found, it will choose the first one, or use a preference list.
229221

230222
### Destructor:
231223

232-
When an agent is getting destroyed at the end of the application, it will deregister all the remaining memories that were not deregistered by the application (bad practice, but agent takes care of it). Then for each of the backends it will call their **destructor** from the SB API, and finally do the rest of internal clean up.
224+
When an agent is getting destroyed at the end of the application, it will deregister all the remaining memories that were not deregistered by the application (bad practice, but agent takes care of it). Then for each of the backends it will call their **destructor** from the SB API, and finally do the rest of internal clean up.

src/api/cpp/backend/backend_engine.h

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -112,9 +112,6 @@ class nixlBackendEngine {
112112
// pure virtual, and return errors, as parent shouldn't call if supportsNotif is false.
113113
virtual bool supportsNotif() const = 0;
114114

115-
// Determines if a backend supports progress thread.
116-
virtual bool supportsProgTh() const = 0;
117-
118115
virtual nixl_mem_list_t getSupportedMems() const = 0; // TODO: Return by const-reference and mark noexcept?
119116

120117

@@ -209,14 +206,6 @@ class nixlBackendEngine {
209206
}
210207

211208

212-
// *** Needs to be implemented if supportsProgTh() is true *** //
213-
214-
// Force backend engine worker to progress.
215-
virtual int
216-
progress() {
217-
return 0;
218-
}
219-
220209
// *** Optional virtual methods that are good to be implemented in any backend *** //
221210

222211
// Query information about a list of memory/storage

src/core/agent_data.h

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,6 @@ class nixlBackendH {
130130
bool supportsRemote () const { return engine->supportsRemote(); }
131131
bool supportsLocal () const { return engine->supportsLocal (); }
132132
bool supportsNotif () const { return engine->supportsNotif (); }
133-
bool supportsProgTh () const { return engine->supportsProgTh(); }
134133

135134
friend class nixlAgentData;
136135
friend class nixlAgent;

src/plugins/cuda_gds/gds_backend.h

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -121,9 +121,6 @@ class nixlGdsEngine : public nixlBackendEngine {
121121
bool supportsLocal() const {
122122
return true;
123123
}
124-
bool supportsProgTh() const {
125-
return false;
126-
}
127124

128125
nixl_mem_list_t getSupportedMems() const {
129126
nixl_mem_list_t mems;

src/plugins/gds_mt/gds_mt_backend.h

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,10 +52,6 @@ class nixlGdsMtEngine : public nixlBackendEngine {
5252
supportsLocal() const override {
5353
return true;
5454
}
55-
bool
56-
supportsProgTh() const override {
57-
return false;
58-
}
5955

6056
nixl_mem_list_t
6157
getSupportedMems() const override {

src/plugins/gpunetio/gpunetio_backend.h

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -46,9 +46,6 @@ class nixlDocaEngine : public nixlBackendEngine {
4646
bool supportsNotif() const {
4747
return true;
4848
}
49-
bool supportsProgTh() const {
50-
return false;
51-
}
5249

5350
nixl_mem_list_t
5451
getSupportedMems() const;

src/plugins/hf3fs/hf3fs_backend.h

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -138,9 +138,6 @@ class nixlHf3fsEngine : public nixlBackendEngine {
138138
bool supportsLocal () const {
139139
return true;
140140
}
141-
bool supportsProgTh () const {
142-
return false;
143-
}
144141

145142
nixl_mem_list_t getSupportedMems () const {
146143
nixl_mem_list_t mems;

src/plugins/mooncake/mooncake_backend.h

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -53,11 +53,6 @@ class nixlMooncakeEngine : public nixlBackendEngine {
5353
return true;
5454
}
5555

56-
bool
57-
supportsProgTh() const {
58-
return false;
59-
}
60-
6156
nixl_mem_list_t
6257
getSupportedMems() const;
6358

src/plugins/obj/obj_backend.h

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -46,11 +46,6 @@ class nixlObjEngine : public nixlBackendEngine {
4646
return false;
4747
}
4848

49-
bool
50-
supportsProgTh() const override {
51-
return false;
52-
}
53-
5449
nixl_mem_list_t
5550
getSupportedMems() const override {
5651
return {OBJ_SEG, DRAM_SEG};

src/plugins/posix/posix_backend.h

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -81,10 +81,6 @@ class nixlPosixEngine : public nixlBackendEngine {
8181
return false;
8282
}
8383

84-
bool supportsProgTh() const override {
85-
return false;
86-
}
87-
8884
nixl_mem_list_t getSupportedMems() const override {
8985
return {FILE_SEG, DRAM_SEG};
9086
}

0 commit comments

Comments
 (0)