Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AF_XDP zero-copy makes driver reset #221

Closed
akhota opened this issue May 20, 2022 · 27 comments
Closed

AF_XDP zero-copy makes driver reset #221

akhota opened this issue May 20, 2022 · 27 comments

Comments

@akhota
Copy link

akhota commented May 20, 2022

We are trying to compare the performance of AF_XDP socket with regular (AF_INET) socket on AWS EC2 instances.
On ena driver version 2.7.1, driver reset occurs when a load is applied using AF_XDP zero-copy mode.
dmesg command shows the following message:

[Fri May 20 15:22:26 2022] ena 0000:00:06.0: Device reset completed successfully, Driver info: Elastic Network Adapter (ENA) v2.7.1g

I have attached a sample program for reproduction and the full output of dmesg.

You can reproduce this behavior by following the steps below:

Server instance

  1. build AF_XDP sample program
$ sudo yum install libbpf libbpf-devel clang llvm gcc bpftool kernel-headers make

$ tar xf process-udp-packet.tar.gz
$ cd process-udp-packet
$ make
  1. config NIC

We are using eth1 for test.

$ sudo ethtool -L eth1 combined 1
$ sudo ip link set dev eth1 mtu 3498
  1. run AF_XDP sample program

This program listens on port 13333/udp.

$ sudo ./af_xdp_user -d eth1 --filename ./af_xdp_kern.o -N -z -p

Client instance

  1. install iperf package
$ wget https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/i/iperf-2.1.6-2.el8.x86_64.rpm
$ sudo yum localinstall iperf-2.1.6-2.el8.x86_64.rpm
  1. send UDP packets to server by iperf in 128 threads for 180 seconds
$ iperf -c <Server IP address> -p 13333 -u -l 512 -t 180 -P 128

Our environment is:

  • AMI: Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD Volume Type
  • Instance type: c6i.2xlarge
  • ena driver version: 2.7.1g (build from source)
  • Add eth1 to each instance for testing, and connect to the same private network.

Both the server and client instances have the same specifications.

Thanks,

@davidarinzon
Copy link
Contributor

Hi @akhota

Thank you for raising this issue and sharing the logs, we'll look into it and provide feedback.

@akhota
Copy link
Author

akhota commented Jun 7, 2022

Hi @davidarinzon

We have updated ena driver to 2.7.2 on our instances and retried AF_XDP zero-copy mode, but device reset still occurred.

According to sar command, the number of received packets of the iperf instance is smaller than the number of packets sent by the AF_XDP server instance. Maybe packets lossed somewhere, and it seems that a device reset has occurred after the packet loss occurred.

  • the AF_XDP server instance
$ sar -n DEV 1
(snip)
09:32:41        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
09:32:42           lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
09:32:42         eth0     18.00     11.00      1.16      2.62      0.00      0.00      0.00
09:32:42         eth1  25601.00  25634.00  13850.11  13868.42      0.00      0.00      0.00
(snip)
  • the iperf instance
$ sar -n DEV 1
(snip)
09:32:41        IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
09:32:42           lo      0.00      0.00      0.00      0.00      0.00      0.00      0.00
09:32:42         eth0      4.00      4.00      0.23      0.80      0.00      0.00      0.00
09:32:42         eth1  16835.00  25602.00   8877.43  13851.08      0.00      0.00      0.00
(snip)

Thanks,

@ShayAgros
Copy link
Contributor

@akhota thanks a lot for testing our AF XDP support implementation.

After doing some additional tests we see that the issue indeed reproduces on our machines as well and we're actively working on root-causing and solving the issue.

Will update this ticket soon with a possible solution, sorry for the inconvenience

ShayAgros added a commit to ShayAgros/amzn-drivers that referenced this issue Jun 17, 2022
Due to several bugs discovered in the feature, it is marked
experimental. The feature can still be used if the driver is compiled
with the TEST_AF_XDP flag set. E.g.
    TEST_AF_XDP=1 make

Please follow amzn#221 issue
for an experimental fix for AF XDP issues.

Signed-off-by: Shay Agroskin <[email protected]>
@akhota
Copy link
Author

akhota commented Jun 23, 2022

Hi @ShayAgros,
Is there something that we can help you with?
For example, additional testing in our environment.

@ShayAgros
Copy link
Contributor

Hi,
I think I've identified all the issues with it and I'm currently testing out internally a fix for these issues.
If the current AF XDP issues bock your progress, please write me at [email protected] where I could provide you a tentative fix for the issues at hand.

Once the testing phase ends I'll post the fix on this thread. Also we hope that by the next driver version release a new version of AF XDP support would be published which fixes some of the wrong design assumptions done in this version. I'm sorry for the inconvenience caused by this buggy experience

@akhota
Copy link
Author

akhota commented Jun 27, 2022

Hi @ShayAgros,

OK, we are looking forward to the next version.
Thanks,

@ShayAgros
Copy link
Contributor

Hi,
The AF XDP design would change by the next version (2.8) since some assumptions made with the current design were discovered to be incorrect. We're sorry for the great inconvenience we caused by introducing this incomplete implementation.

If you'd still like to test the AF XDP implementation, you can use the patch
0001-linux-ena-Fix-some-bugs-in-AF-XDP-support.patch.txt

on top of the latest current version (2.7.3) (e.g. using git am ./0001-linux-ena-Fix-some-bugs-in-AF-XDP-support.patch.txt ). I modified the driver version to 2.7.4 to allow distinguishing the modified driver.

By default the driver would compile without native (zero-copy) AF XDP support. To enable it please specify TEST_AF_XDP envar when compiling the driver, e.g. TEST_AF_XDP=1 make.

Please note that the AF XDP is currently in testing phase. We tested it thoroughly with this patch, but if still some issues are discovered or if you have a question then feel free to comment on this thread or write me to my email (listed above)

@akhota
Copy link
Author

akhota commented Jul 1, 2022

Hi @ShayAgros,

Thank you for the patch. We will apply your patch to our instances until the next version is released, and retry the AF XDP zero-copy performance test.

Please note that the AF XDP is currently in testing phase. We tested it thoroughly with this patch, but if still some issues are discovered or if you have a question then feel free to comment on this thread or write me to my email (listed above)

OK, I understand.

Thanks,

@akhota
Copy link
Author

akhota commented Jul 13, 2022

Hi,
Now we can develop and test our AF XDP programs in the AWS environment. Thank you very much!

I have updated the ena driver and retried performance test in the AF XDP native zero-copy mode.
Device reset did not occur in version 2.7.4. On the other hand, it seems that the performance of version 2.7.4 native zero-copy mode is lower than version 2.7.2 native copy mode.

The summary of results is follows:

  • ena version 2.7.4 native zero-copy mode
packet size Tx rate (Gbps) total pps (Mpps)
64 0.45 0.67
128 0.78 0.66
256 1.45 0.66
512 2.81 0.66
1024 3.26 0.39
2048 4.75 0.29
3498 7.05 0.25
  • ena version 2.7.2 native copy mode
packet size Tx rate (Gbps) total pps (Mpps)
64 0.85 1.26
128 1.50 1.27
256 2.80 1.27
512 5.31 1.25
1024 9.69 1.16
2048 12.61 0.76
3498 12.58 0.45

(We used TRex for measurement and packet generation.)

We suppose the patch 0001-linux-ena-Fix-some-bugs-in-AF-XDP-support.patch.txt prevents device resets, but reduces performance of the AF XDP native zero-copy mode.
Will the next version 2.8 improve performance?

@davidarinzon
Copy link
Contributor

Hi @akhota

Thank you for performing the checks on the provided patch and summarizing the results.
We will analyze them and provide additional information.

@Li-Xiaoyun
Copy link

Hi, we were seeing the same issue (device reset) as this. But this issue hasn't been updated for a while. Just wondering is there any new AF_XDP patch which can provide better performance than copy mode?

@davidarinzon
Copy link
Contributor

Hi @Li-Xiaoyun,

(Also answering @akhota)
We're in the process of investigating the AF_XDP performance.

If needed, the patch discussed in this comment was adjusted for 2.8.0 release and is available here.

Thanks

@Li-Xiaoyun
Copy link

Thanks for the info. @davidarinzon

@Li-Xiaoyun
Copy link

Li-Xiaoyun commented Oct 19, 2022

Hi @davidarinzon, thanks for the patch.
Just for your reference, I've tested in our env. It's a simple setup with one instance running trex and another running DPDK af_xdp pmd. Both instances are using c5n4xlarge. AF_XDP PMD are using 3 queues and non-busy polling mode.
zero copy works stably not having device reset issue but unfortunately, perf is still worse than copy mode.
copy mode can reach ~15Gbps with packet size 1420 while zero copy can reach ~11Gbps. (2.8.0+patch)

@nirvana-msu
Copy link

@Li-Xiaoyun @akhota just wondering if either of you also measured latency (as opposed to throughput) of the zero-copy mode patch vs copy mode?

@davidarinzon have you had any success inverstigating the mentioned performance issue?

@Li-Xiaoyun
Copy link

I didn't measure latency.

@pstavirs
Copy link

Any update on the AF_XDP zero copy changes? Looks like they've not yet been merged

@akiyano
Copy link
Contributor

akiyano commented Dec 14, 2023

Hi @pstavirs,

You are correct they have not been merged.
We don't currently have a specific timeline that we can share with you.
Once they are released we will update this ticket.

Thanks,
Arthur

@oicnysa
Copy link

oicnysa commented Mar 12, 2024

Hi, any updates?

@davidarinzon
Copy link
Contributor

Hi @oicnysa

Thanks for reaching out.
We are working on this, but have no specific timeline to share at the moment.
Please stay tuned for updates.

@oicnysa
Copy link

oicnysa commented Mar 22, 2024

@akiyano maybe you have some specific branch which we could test?

@moscovium115
Copy link

@oicnysa I would be interested in that aswell! Would be great to test out with openonload

@akiyano
Copy link
Contributor

akiyano commented Mar 23, 2024

@oicnysa and @moscovium115,
It's great to know there is interest in AF_XDP and we are taking this into account.
However we can't currently share anything new.
As @davidarinzon already said, please stay tuned for updates.

@davidarinzon
Copy link
Contributor

davidarinzon commented Mar 28, 2024

For those who want to experiment with AF_XDP, the original patch posted in this comment was developed on top of 2.8.0 release.
An updated patch on top of the latest driver release (2.12.0) is available here, please apply it when using AF_XDP.
(Please note that the patch has been updated on 04/16/24).

@moscovium115
Copy link

@davidarinzon Amazing thank you

@davidarinzon
Copy link
Contributor

Hi,

Official AF_XDP support was released with 2.13.0g.
You no longer need to use the patches.

@akhota and others, please let us know if you face any issues.

@davidarinzon
Copy link
Contributor

Resolving this ticket, please re-open it in case you face any new issues with AF_XDP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants