How a NIC receives a packet: PC Engines apu4d4 Packet-per-Second Throughput
While playing with the Data Plane Development Kit (DPDK), I wondered what is the normal (kernel-based, non-DPDK) packet per second (pps) throughput of PC Engines apu4d4 which has 4 Intel i211AT based 1Gbps LAN ports. This experiment is also helpful on learning a few important concepts regarding to how NICs (network interface cards, i211AT in this post) work when receiving packets.
To reach the line rate (1Gbps) with small packets e.g. 84 bytes, a pps of 1000000000 bps / (84 bytes * 8 bits/bytes) = 1.488Mpps
is needed. See more about this calculation here. Usually the line rate can only be achieved with medium/large size packets, this is also true for pure network devices like switches, routers and firewalls. That is why they also always specify the pps throughput on their datasheets. For example the 1Gbps switch I am using (HP 1920 28 Ports, JG924A) gives 41.7Mpps (64-byte packets), this is roughly 21.35Gbps, which is less than the full line rate switching capacity given as 56 Gbps (28 Ports x 2 -duplex- x 1Gbps).
This post is basically an application of a cloudflare blog post titled How to receive a million packets per second. Although I tried other small applications including ones I wrote, I am using the same sender and receiver applications used in that blog post here as well. The source code of these applications are here. The source includes only a clang build file, and for gcc, you can use this:
udpsender: udpsender.c net.c
gcc -pthread -o udpsender udpsender.c net.c
udpreceiver: udpreceiver1.c net.c
gcc -pthread -o udpreceiver udpreceiver1.c net.c
I am using Ubuntu 20.04 on two apu4d4 boards (named mars -sender- and venus -receiver-) with their enp4s0 ethernet interfaces directly connected with a very short Cat6 cable.
1x Sender, 1x Receiver
Here are the configurations:
mars, sender, is at 192.168.6.1:
mete@mars:~/dump/how-to-receive-a-million-packets$ ip -4 addr
...
5: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
inet 192.168.6.1/24 brd 192.168.6.255 scope global enp4s0
valid_lft forever preferred_lft forever
venus, receiver, is at 192.168.6.2:
mete@venus:~/dump/how-to-receive-a-million-packets$ ip -4 addr
...
5: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
inet 192.168.6.2/24 brd 192.168.6.255 scope global enp4s0
valid_lft forever preferred_lft forever
Running the sender on mars:
mete@mars:~/dump/how-to-receive-a-million-packets$ ./udpsender 192.168.6.2:4321
[*] Sending to 192.168.6.2:4321, send buffer 1024 packets
and receiver on venus:
mete@venus:~/dump/how-to-receive-a-million-packets$ ./udpreceiver
[*] Starting udpreceiver on 0.0.0.0:4321, recv buffer 4KiB
0.105M pps 3.195MiB / 26.804Mb
0.105M pps 3.216MiB / 26.982Mb
0.106M pps 3.225MiB / 27.051Mb
0.105M pps 3.217MiB / 26.986Mb
...
so it is around 105Kpps without doing anything special.
Looking at how much CPU they use (output from the screen of top -1H), on mars:
%Cpu2 : 0.7 us, 99.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
on venus:
%Cpu3 : 1.1 us, 15.7 sy, 0.0 ni, 22.8 id, 0.0 wa, 0.0 hi, 60.4 si, 0.0 st
So my take on this is there is very little done at user space (us). Sender spends most of the time in kernel space (sy). Receiver spends most of the time in softirq (si) and then the kernel space (sy). Also receiver does not use 100% of a core, but sender uses 100%.
2x Sender, 1x Receiver
So I guess at least the Sender may benefit from using multiple processes/threads, lets try 2 threads.
mete@mars:~/dump/how-to-receive-a-million-packets$ ./udpsender 192.168.6.2:4321 192.168.6.2:4321
[*] Sending to 192.168.6.2:4321, send buffer 1024 packets
[*] Sending to 192.168.6.2:4321, send buffer 1024 packets
udpsender application basically launches 2 threads sending packets to these targets. This can be seen in the top -H1
output:
%Cpu0 : 0.7 us, 99.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 0.0 us,100.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
PID USER %CPU COMMAND
1618 mete 99.7 udpsender
1619 mete 99.7 udpsender
threads are still using the 100% of the cores. What about the receiver now:
0.161M pps 4.919MiB / 41.262Mb
0.160M pps 4.895MiB / 41.065Mb
0.163M pps 4.987MiB / 41.832Mb
0.166M pps 5.078MiB / 42.600Mb
0.156M pps 4.773MiB / 40.042Mb
0.159M pps 4.839MiB / 40.589Mb
0.156M pps 4.748MiB / 39.827Mb
there is a bit variation but it is around 155-165Kpps now, and the CPU usage is:
%Cpu3 : 0.4 us, 24.6 sy, 0.0 ni, 6.7 id, 0.0 wa, 0.0 hi, 68.3 si, 0.0 st
so it is using a bit more than before (90% now, comparing to 70% before). Probably if there is more than 2 sender threads, there is a need for 2x receiver threads as well.
How does a NIC receive a network packet ?
At this point a little knowledge about NIC is very useful. The simple steps of receiving a packet is:
- NIC receives a packet (e.g. through twisted pair)
- NIC checks if the packet is well-formed (checks FCS) and if the packet is targeted for itself (checks MAC)
- NIC uses DMA to copy the packet (data) to its ring/queue/buffer (in the system memory) -these terms might be used interchangeably-
- NIC sends an interrupt
- Driver (interrupt handler) polls and gets the packet (data) from the same (ring) buffer
- Driver uses the data for further processing (sending it to the kernel networking stack, then maybe sends it to user space etc.)
There are a few places an optimization can take place.
- A single port on a NIC can only receive one packet at a time by the nature of Ethernet protocols. You can do parallel work only by having multiple ports/multiple NICs.
- Offload: As the NIC checks FCS and MAC, it can also do other things at hardware such as checking (and also calculating during transmit) IP/TCP/UDP checksums, e.g. checksum offload.
- Queues and Receive Side Scaling (RSS): NIC can use multiple queues even for a single port, and these queues can be processed by different cores.
- Interrupt Moderation: NIC can decrease or moderate the number of interrupts by grouping multiple packet receive events into a single interrupt.
- There can be various optimizations that can be done in kernel and in user space to speed up packet processing.
- Data Plane Development Kit (DPDK): Both interrupt-based processing and kernel stack can be eliminated by using polling drivers at user space. This is what Data Plane Development Kit (DPDK) is used for, but it is not the topic of this post.
Offload
The basic offload mechanisms are usually enabled by default, advanced ones might need to be enabled explicitly. This can be checked with ethtool -k enp4s0
.
Interrupt Moderation
While receiving packets, if I check the interrupts:
mete@venus:~$ cat /proc/interrupts | grep enp4s0
56: 0 1 0 0 PCI-MSI 2097152-edge enp4s0
57: 0 0 417 266296679 PCI-MSI 2097153-edge enp4s0-rx-0
58: 980 205 8093 7 PCI-MSI 2097154-edge enp4s0-rx-1
59: 2525 410 6187 204 PCI-MSI 2097155-edge enp4s0-tx-0
60: 1877 348 7462 85 PCI-MSI 2097156-edge enp4s0-tx-1
if I monitor this every second with watch -n 1
, the interrupt count on CPU4 for enp4s0-rx-0 increases approx. by 16K. This is obviously less than 160Kpps, so there is some kind of interrupt moderation already in place. Interrupt moderation (coalescing) can be done in two ways, either by counting some number of packets or by waiting a specific period of time. It can be checked with:
mete@venus:~$ sudo ethtool -c enp4s0
Coalesce parameters for enp4s0:
Adaptive RX: off TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
rx-usecs: 3
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0
tx-usecs: 3
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0
rx-usecs-low: 0
rx-frames-low: 0
tx-usecs-low: 0
tx-frames-low: 0
rx-usecs-high: 0
rx-frames-high: 0
tx-usecs-high: 0
tx-frames-high: 0
This is the default setting (I think). So there is a setting of 3 microseconds, anything received within 3 microseconds do not cause an interrupt immediately. Lets change it to 0, so effectively disabling interrupt moderation.
mete@venus:~$ sudo ethtool -C enp4s0 rx-usecs 0
after doing this, the receive throughput decreases immediately to approx. 50Kpps:
0.047M pps 1.420MiB / 11.910Mb
0.044M pps 1.331MiB / 11.169Mb
0.041M pps 1.252MiB / 10.500Mb
0.057M pps 1.725MiB / 14.473Mb
0.044M pps 1.348MiB / 11.310Mb
0.041M pps 1.236MiB / 10.371Mb
and checking /proc/interrupts
, the number of interrupts increases similarly to pps number. So every packet received generates and interrupt. It is possible to change this to other values than 3 (but it seems not to every value), but I could not find anything better than 3, so I set it back to 3. Also, i211 does not allow setting rx-frames.
Queues and RSS
The /proc/interrupts
output above lists rx-0, rx-1, tx-0, tx-1 as source, these are actually the queues. It can be seen with ethtool:
mete@venus:~$ sudo ethtool -l enp4s0
Channel parameters for enp4s0:
Pre-set maximums:
RX: 0
TX: 0
Other: 1
Combined: 2
Current hardware settings:
RX: 0
TX: 0
Other: 1
Combined: 2
There are 2 combined queues maximum and 2 combined queues are also the current settings. So nothing much I can do in terms of setting the number of queues, but this opens another topic, how these queues are used. Lets first check if RSS is enabled.
$ sudo ethtool -k enp4s0 | grep hashing
receive-hashing: on
Now lets check how the load (the packets received) is distributed among these queues:
mete@venus:~$ sudo ethtool -x enp4s0
RX flow hash indirection table for enp4s0 with 2 RX ring(s):
0: 0 0 0 0 0 0 0 0
8: 0 0 0 0 0 0 0 0
16: 0 0 0 0 0 0 0 0
24: 0 0 0 0 0 0 0 0
32: 0 0 0 0 0 0 0 0
40: 0 0 0 0 0 0 0 0
48: 0 0 0 0 0 0 0 0
56: 0 0 0 0 0 0 0 0
64: 1 1 1 1 1 1 1 1
72: 1 1 1 1 1 1 1 1
80: 1 1 1 1 1 1 1 1
88: 1 1 1 1 1 1 1 1
96: 1 1 1 1 1 1 1 1
104: 1 1 1 1 1 1 1 1
112: 1 1 1 1 1 1 1 1
120: 1 1 1 1 1 1 1 1
RSS hash key:
Operation not supported
RSS hash function:
toeplitz: on
xor: off
crc32: off
This table is called indirection table and it is a mapping from a hash function to queue number. Because I have only 2 rx queues in this NIC, there is only 0
and 1
on this table, indicating the rx queue. The output of the hash function is from 0 up to 128. The rows show every 8th different output, and columns are the values in between. So at the moment hash output 0-63 is mapped to rx queue 0 and the hash output 64-127 is mapped to rx queue 1. What is this hash ? As shown in the output above, there is a line toeplitz: on
, this is the Toeplitz Hash Algorithm -to be honest I have heard it first time, but it seems it is widely used for this purpose-. This algorithm is very simple, just a matrix multiplication.
Naturally you might say hash function output is usually much bigger than this, so only certain parts (LSB or something else) of the actual output of hash function is used for this indirection table. The input to hash function is different of every packet type, and they can be seen by:
mete@venus:~$ sudo ethtool -n enp4s0 rx-flow-hash tcp4
TCP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA
L4 bytes 0 & 1 [TCP/UDP src port]
L4 bytes 2 & 3 [TCP/UDP dst port]
mete@venus:~$ sudo ethtool -n enp4s0 rx-flow-hash udp4
UDP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DA
So the hash for tcp4 (TCP over IPv4) uses IP source and destination addresses, and also source and destination ports. udp4 uses only the addresses. These can be modified by ethtool -N
.
A small detail, using message signaled interrupts (MSI), only the CPU (core) that handles the queue can be interrupted. So as seen in /proc/interrupts
, the interrupts for different queues have different interrupt numbers (although there is no such physical interrupt line going from NIC to CPU, this is handled within PCIe bus).
So there are two RX queues, I am sending UDP packets, so the hash function uses the source and destination addresses, but the addresses are same. Lets check per-queue statistics (remember there are 2 sender threads, 1 receiver thread):
mete@venus:~$ sudo ethtool -S enp4s0 | grep rx_queue_[01]_packets
rx_queue_0_packets: 3808489796
rx_queue_1_packets: 0
Only one of the queues are used as expected. So even if I run two receiver threads, one of the queues will be empty, so it is not going to give much benefit.
2x Sender, 2x Receiver
In this example, I can use both of the queues by using a different source or destination address, so instead of only one, I give 4 IPs (192.168.6.2, 3, 4, 5) to the receiver.
mete@venus:~$ ip -4 addr
...
5: enp4s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
inet 192.168.6.2/24 brd 192.168.6.255 scope global enp4s0
valid_lft forever preferred_lft forever
inet 192.168.6.3/24 brd 192.168.6.255 scope global secondary enp4s0
valid_lft forever preferred_lft forever
inet 192.168.6.4/24 brd 192.168.6.255 scope global secondary enp4s0
valid_lft forever preferred_lft forever
inet 192.168.6.5/24 brd 192.168.6.255 scope global secondary enp4s0
valid_lft forever preferred_lft forever
Now starting the receiver with 2 threads:
mete@venus:~/dump/how-to-receive-a-million-packets$ ./udpreceiver 0.0.0.0:4321 2
[*] Starting udpreceiver on 0.0.0.0:4321, recv buffer 4KiB
and starting the sender with two different target IPs (and each target is handled by a different thread):
mete@mars:~/dump/how-to-receive-a-million-packets$ ./udpsender 192.168.6.2:4321 192.168.6.3:4321
[*] Sending to 192.168.6.2:4321, send buffer 1024 packets
[*] Sending to 192.168.6.3:4321, send buffer 1024 packets
both of these start two threads and it can be checked by top -H1
. I see around 170Kpps. However this is still not what I wanted to do. Because if I check per-queue statistics again, rx_queue_1 is not receiving anything. This is because of the indirection table, because its top half was rx queue 0, bottom half was rx queue 1. So it seems the destination IP address here is used somehow directly (I also checked other IPs and result is same). The solution is to modify the indirection table with an even spread which can be achieved with this:
mete@venus:~$ sudo ethtool -X enp4s0 equal 2
mete@venus:~$ sudo ethtool -x enp4s0
RX flow hash indirection table for enp4s0 with 2 RX ring(s):
0: 0 1 0 1 0 1 0 1
8: 0 1 0 1 0 1 0 1
16: 0 1 0 1 0 1 0 1
24: 0 1 0 1 0 1 0 1
32: 0 1 0 1 0 1 0 1
40: 0 1 0 1 0 1 0 1
48: 0 1 0 1 0 1 0 1
56: 0 1 0 1 0 1 0 1
64: 0 1 0 1 0 1 0 1
72: 0 1 0 1 0 1 0 1
80: 0 1 0 1 0 1 0 1
88: 0 1 0 1 0 1 0 1
96: 0 1 0 1 0 1 0 1
104: 0 1 0 1 0 1 0 1
112: 0 1 0 1 0 1 0 1
120: 0 1 0 1 0 1 0 1
RSS hash key:
Operation not supported
RSS hash function:
toeplitz: on
xor: off
crc32: off
now if I run the sender with IPs next to other (e.g. 192.168.6.2 and 192.168.6.3), the packets go to different queues. The pps I see is still around 160-170Kpps, but I think now the number of senders can be increased.
4x Sender, 2x Receiver
Lets try 4 sender threads:
mete@mars:~/dump/how-to-receive-a-million-packets$ ./udpsender 192.168.6.2:4321 192.168.6.3:4321 192.168.6.4:4321 192.168.6.5:4321
[*] Sending to 192.168.6.2:4321, send buffer 1024 packets
[*] Sending to 192.168.6.3:4321, send buffer 1024 packets
[*] Sending to 192.168.6.4:4321, send buffer 1024 packets
[*] Sending to 192.168.6.5:4321, send buffer 1024 packets
now the receiver shows:
0.256M pps 7.823MiB / 65.622Mb
0.261M pps 7.976MiB / 66.910Mb
0.273M pps 8.333MiB / 69.900Mb
0.263M pps 8.014MiB / 67.228Mb
0.270M pps 8.250MiB / 69.207Mb
0.258M pps 7.859MiB / 65.926Mb
over 250Kpps. If I check the sender, it shows two of the cores are around 100%, so I think there is still some space for improvement. Increasing this more than 4 is not helping as there are only 4 cores on the cpu.
Queue Length
I did not check before the size or length of the queues. This can be checked by:
mete@venus:~$ sudo ethtool -g enp4s0
Ring parameters for enp4s0:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 256
RX Mini: 0
RX Jumbo: 0
TX: 256
I tried changing this but it does not change the result much.
Final Result
Comparing to the most straightforward implementation:
- using 4 sender threads
- utilizing both rx queues on the receiver
it is possible to reach more than 250Kpps. In the blog post I mentioned in the beginning, the author reaches over 1Mpps, but on a 24-core CPU with a 10G NIC having 11 rx queues. So it looks to me that 250Kpps is a good figure on this embedded system.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.