investigate zero-copy socket writes #1299

wjhun · 2020-09-28T02:11:57Z

From a cursory look it appears that we could potentially implement zero-copy on socket writes by eliminating the TCP_WRITE_FLAG_COPY flag on calls to tcp_write when SO_ZEROCOPY / MSG_ZEROCOPY is specified. User pages not under the domain of the pagecache are, in a sense, pinned already, and pages within the pagecache could be pinned by taking an extra refcount on the pagecache_page. Implementation of socket error queues would also be necessary to allow completion notification to the application.

This could potentially yield a significant performance benefit in cases such as large static page loads when the service supports zero copy (which requires that user buffers remain unmodified until after sent TCP data is acknowledged), but some further exploration might be necessary to verify that in fact the zero copy path - from lwIP through our existing PV nic drivers - will work as expected. Furthermore, note that SO_ZEROCOPY is a hint to the kernel to use zero-copy if available - with a guarantee that completion notifications will be returned - and not a guarantee that copying will be avoided (so a non-compliant driver could result in use of TCP_WRITE_FLAG_COPY with completion notifications).

https://www.kernel.org/doc/html/v4.15/networking/msg_zerocopy.html
https://blogs.oracle.com/linux/zero-copy-networking-in-uek6

francescolavra · 2021-05-16T07:26:57Z

Did some testing without the TCP_WRITE_FLAG_COPY flag on calls to tcp_write. The zero copy path from lwIP through the nic driver does work as expected (at least with the virtio net driver), in the sense that data from the user buffer are correctly sent to the nic, but doesn't result in an overall performance gain; in fact, I could see a slight degradation in performance (in the order of a few percentage points); and that is without the socket error queue messaging, which when added would likely contribute to a decrease in performance.
The reason for the zero-copy path not bringing performance benefits is that the savings from avoiding memory copying are outweighed by the overhead associated with handling an additional buffer in each network packet: when data is copied, each network packet can be sent as a single contiguous buffer, but when the data is not copied, the packet headers need to be allocated in a separate buffer and then chained to the user data buffer. At the nic driver level, the physical address needs to be retrieved for each buffer of a network packet, and this is relatively expensive.

Below is an annotated ftrace plot obtained when streaming TCP data from Nanos to the host machine, without zero-copy:

The time from runtime_memcpy is non-negligible (24.6k) but overall doesn't contribute much.

Below is the ftrace plot obtained with zero-copy:

The time from runtime_memcpy is considerably reduced (9.5k), but other functions involved in translation from virtual to physical address (such as kern_pointer_from_pteaddr, physical_from_virtual_locked and table_find) take more time.

When trying zero-copy transmission on the loopback interface (where no physical addresses are needed), I did see a performance increase (in the order of 10%), but in Linux zero-copy does not apply to the loopback interface.

This commit changes a few lwIP configuration options, following the hints at https://www.nongnu.org/lwip/2_1_x/optimization.html. The lwip_htons() and lwip_htonl() macros have been defined so that byte order inversion operations are executed inline instead of as function calls (in #1299, the lwip_htons() function shows up as taking a non-negligible share of CPU time). Instead of the default checksum algorithm #2, algorithm #3 is now being used, because it is faster on 64-bit platforms (checksum calculation on 20-byte data is around 35% faster).

francescolavra · 2021-07-31T15:34:28Z

I did some more testing with current master, and it is still the case that copying user data to kernel buffers when doing a socket send is more efficient (with Nanos running on qemu and sending TCP data to the local host) than the zero-copy approach. Even though retrieving the physical address from a kernel virtual address is now almost free, this doesn't apply to user buffers, so when doing zero-copy for each network packet being sent there is always a physical_from_virtual() call that involves a table lookup.
Looking at the ftrace output (after applying #1551):

non-zero-copy:
total time: 2M
mcache_alloc: 139k
mcache_dealloc: 161k
heaplock_alloc: 116k
heaplock_dealloc: 75k
low_level_output: 70k
objcache_alloc: 81k
objcache_from_object: 50k
objcache_dealloc: 43k
physical_from_virtual: 8k
runtime_memcpy: 33k

zero-copy:
total time: 2.1M
mcache_alloc: 141k
mcache_dealloc: 184k
heaplock_alloc: 111k
heaplock_dealloc: 75k
low_level_output: 84k
objcache_alloc: 76k
objcache_from_object: 58k
objcache_dealloc: 44k
physical_from_virtual: 20k
runtime_memcpy: 17k

The reduction of runtime_memcpy time is comparable to the increase in physical_from_virtual time, plus there is a significant increase in the time taken by mcache and objcache alloc/dealloc functions, which stems from the fact that when sending socket data with zero-copy we have to call pbuf_alloc() (allocating from lwip_heap) twice as many times as with the conventional approach.
But there might be other factors at play which cannot be easily seen with ftrace. With ftrace enabled, even doing zero-copy on the loopback interface is slight slower than non-zero-copy, whereas without ftrace zero-copy on the loopback interface is around 10% faster than non-zero-copy. Anyway, on the virtio-net interface with zero-copy I see a slight performance degradation (something like 3-5%) both with and without ftrace.

wjhun added networking feature performance labels Sep 28, 2020

francescolavra self-assigned this May 14, 2021

francescolavra mentioned this issue May 21, 2021

lwIP performance optimization #1492

Merged

francescolavra removed their assignment Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate zero-copy socket writes #1299

investigate zero-copy socket writes #1299

wjhun commented Sep 28, 2020

francescolavra commented May 16, 2021

francescolavra commented Jul 31, 2021

investigate zero-copy socket writes #1299

investigate zero-copy socket writes #1299

Comments

wjhun commented Sep 28, 2020

francescolavra commented May 16, 2021

francescolavra commented Jul 31, 2021