-
Notifications
You must be signed in to change notification settings - Fork 176
/
Copy pathREADME
198 lines (164 loc) · 7.59 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
Linux kernel driver for Elastic Fabric Adapter (EFA)
====================================================
Overview
========
Elastic Fabric Adapter (EFA) is a network device that provides reliable userspace
communication and kernel bypass capabilities, targeting more consistent latency
and higher throughput than traditional TCP-based communication.
EFA is first implemented in AWS EC2 instances, and is optimized to cloud-scale
network infrastructure.
EFA brings the scalability, flexibility, and elasticity of cloud to
tightly-coupled applications like HPC and Machine Learning Training, that
would benefit from the lower and more consistent latency and higher throughput.
Applications would use rdma-core (https://github.com/linux-rdma/rdma-core) as the
userspace library to use EFA.
EFA supports datagram send/receive operations and can support RDMA read/write
operations on some of the devices. EFA supports unreliable datagrams (UD) as
well as a new Scalable (unordered) Reliable Datagram protocol (SRD). SRD provides
support for reliable datagrams and more complete error handling than typically
seen with other Reliable Datagram (RD) implementations, but, unlike RD, it does
not support ordering or segmentation.
EFA depends on having ib_core and ib user verbs compiled with the kernel.
User verbs are supported via a dedicated userspace EFA provider in rdma-core,
and kernel verbs are supported through the standard ib_verbs interface combined
with some additional EFA extensions that must be included separately.
Driver distribution
===================
In addition to this repository, EFA driver can be found in upstream Linux kernel
and in various Linux distributions' kernel trees, e.g. Amazon Linux, Ubuntu and
RHEL. As a general approach, features and bug fixes that are relevant and suitable
for mainline kernel, are being updated in both repositories around the same time.
In addition to receiving bug fixes from stable kernel trees, some Linux
distributions may backport EFA driver from more advanced kernel trees into their
kernels. In Amazon Linux case, kernels are expected to have up-to-date EFA driver
shortly after it is released in this repository.
Driver compilation
==================
For list of supported kernels and distributions, please refer to:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-amis
Prerequisites:
Kernel must be compiled with CONFIG_INFINIBAND_USER_ACCESS in Kconfig.
sudo yum update
sudo yum install gcc
sudo yum install kernel-devel-$(uname -r)
Compilation:
Run:
mkdir build
cd build
cmake ..
make
efa.ko is created inside the src/ folder.
EFA supports PCIe peer-to-peer memory access between devices.
Currently supported peer-to-peer architectures are:
* GPUDirect RDMA (GDR)
* NeuronLink RDMA
Peer-to-peer support can be disabled by running
cmake with the following parameter:
cmake -DENABLE_P2P=0 ..
For more information regarding GPUDirect RDMA, visit:
https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
Kernel verbs support can be disabled by running
cmake with the following parameter:
cmake -DENABLE_KVERBS=0 ..
To build EFA RPMs run `make` in the rpm/ folder. Your environment will need to
be setup to build RPMs. The EFA RPM will install the EFA kernel driver source,
setup DKMS in order to build the driver when the kernel is updated, and update
the configuration files to load EFA and its dependencies at boot time.
Driver installation
===================
Loading driver
--------------
modprobe ib_core
modprobe ib_uverbs
insmod efa.ko
For automatic driver start upon the OS boot
sudo vi /etc/modules-load.d/efa.conf
insert "efa" to the file
copy the efa.ko to /lib/modules/$(uname -r)/
sudo depmod
If previous driver was loaded from initramfs - it will have to be
updated as well (i.e. dracut)
Restart the OS (sudo reboot and reconnect)
Supported PCI vendor ID/device IDs
==================================
1d0f:efa0 - EFA used in EC2 virtualized and bare-metal instances.
1d0f:efa1 - EFA used in EC2 virtualized and bare-metal instances.
1d0f:efa2 - EFA used in EC2 virtualized and bare-metal instances.
EFA Source Code Directory Structure (under src/)
================================================
efa_main.c, efa.h - Main Linux kernel driver.
efa_verbs.c - Control verbs implementations.
efa_data_verbs.c - Data path verbs implementations.
efa_com.[ch], efa_com_cmd.[ch] - Management communication layer.
This layer is responsible for the
handling all the management (admin)
communication between the device and the
driver.
efa_common_defs.h - Common definitions for efa_com layer.
efa_admin_defs.h, efa_admin_cmd_defs.h - Definition of EFA management interface.
efa_regs_defs.h - Definition of EFA PCI memory-mapped (MMIO) registers.
efa_io_defs.h - Definition of EFA datapath types.
efa_sysfs.[ch] - Sysfs files.
efa_verbs.h - EFA extension to ib_verbs.h intended for ULPs' use.
efa-abi.h - Kernel driver <-> Userspace provider ABI.
efa_p2p.[ch] - Peer-to-peer memory layer implementation.
efa_gdr.c - GPUDirect RDMA implementation.
nv-p2p.h - NVIDIA GDR API.
efa_neuron.c - Neuron implementation.
neuron_p2p.h - Neuron P2P API.
Management Interface
====================
EFA management interface is exposed by means of:
- PCIe Configuration Space
- Device Registers
- Admin Queue (AQ) and Admin Completion Queue (ACQ)
- Asynchronous Event Notification Queue (AENQ)
AQ is used for submitting management commands, and the
results/responses are reported asynchronously through ACQ.
EFA introduces a small set of management commands.
Most of the management operations are framed in a generic get/set feature
command.
The following admin queue commands are supported:
- Create/Modify/Query/Destroy Queue Pair
- Create/Destroy Completion Queue
- Create/Destroy Memory Region
- Create/Destroy Address Handle
- Allocate/Deallocate Protection Domain
- Create/Destroy Event Queue
- Get Statistics
- Get feature
- Set feature
- Query device
Refer to efa_admin_cmds_defs.h for the list of supported get/set feature
properties.
The Asynchronous Event Notification Queue (AENQ) is a unidirectional
queue used by the EFA device to send to the driver events that cannot
be reported using ACQ. AENQ events are subdivided into groups. Each
group may have multiple syndromes, as shown below:
The events are:
Group Syndrome
Keep-Alive - X -
ACQ and AENQ share the same MSI-X vector.
Interrupt Modes
===============
Management interrupt registration is performed when the Linux kernel
probes the adapter, and it is un-registered when the adapter is
removed.
The management interrupt is named:
efa-mgmnt@pci:<PCI domain:bus:slot.function>
Data Path Interface
===================
I/O operations are based on Queue Pairs (QPs) - Send Queues (SQs) and Receive
Queues (RQs). Each queue has a Completion Queue (CQ) associated with it.
The QPs and CQs are implemented as Work/Completion Queue Elements (WQEs/CQEs)
rings in contiguous physical memory.
The EFA supports Low Latency Queue (LLQ) mode for SQs:
In this mode the userspace provider writes the WQEs directly to the EFA device
memory space, while the packet data resides in the host's memory. The device
uses a dedicated PCI device memory BAR, which is mapped with write-combine
capability.
The RQs reside in the host's memory. The EFA device fetches the EFA RX WQEs
directly from host memory.
The user notifies the EFA device of new WQEs by writing to a dedicated PCI
device memory BAR referred as Doorbells BAR which can be mapped to the
userspace provider.