Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate zero-copy socket writes #1299

Open
wjhun opened this issue Sep 28, 2020 · 2 comments
Open

investigate zero-copy socket writes #1299

wjhun opened this issue Sep 28, 2020 · 2 comments

Comments

@wjhun
Copy link
Contributor

wjhun commented Sep 28, 2020

From a cursory look it appears that we could potentially implement zero-copy on socket writes by eliminating the TCP_WRITE_FLAG_COPY flag on calls to tcp_write when SO_ZEROCOPY / MSG_ZEROCOPY is specified. User pages not under the domain of the pagecache are, in a sense, pinned already, and pages within the pagecache could be pinned by taking an extra refcount on the pagecache_page. Implementation of socket error queues would also be necessary to allow completion notification to the application.

This could potentially yield a significant performance benefit in cases such as large static page loads when the service supports zero copy (which requires that user buffers remain unmodified until after sent TCP data is acknowledged), but some further exploration might be necessary to verify that in fact the zero copy path - from lwIP through our existing PV nic drivers - will work as expected. Furthermore, note that SO_ZEROCOPY is a hint to the kernel to use zero-copy if available - with a guarantee that completion notifications will be returned - and not a guarantee that copying will be avoided (so a non-compliant driver could result in use of TCP_WRITE_FLAG_COPY with completion notifications).

https://www.kernel.org/doc/html/v4.15/networking/msg_zerocopy.html
https://blogs.oracle.com/linux/zero-copy-networking-in-uek6

@francescolavra
Copy link
Member

Did some testing without the TCP_WRITE_FLAG_COPY flag on calls to tcp_write. The zero copy path from lwIP through the nic driver does work as expected (at least with the virtio net driver), in the sense that data from the user buffer are correctly sent to the nic, but doesn't result in an overall performance gain; in fact, I could see a slight degradation in performance (in the order of a few percentage points); and that is without the socket error queue messaging, which when added would likely contribute to a decrease in performance.
The reason for the zero-copy path not bringing performance benefits is that the savings from avoiding memory copying are outweighed by the overhead associated with handling an additional buffer in each network packet: when data is copied, each network packet can be sent as a single contiguous buffer, but when the data is not copied, the packet headers need to be allocated in a separate buffer and then chained to the user data buffer. At the nic driver level, the physical address needs to be retrieved for each buffer of a network packet, and this is relatively expensive.

Below is an annotated ftrace plot obtained when streaming TCP data from Nanos to the host machine, without zero-copy:
image
The time from runtime_memcpy is non-negligible (24.6k) but overall doesn't contribute much.

Below is the ftrace plot obtained with zero-copy:
image
The time from runtime_memcpy is considerably reduced (9.5k), but other functions involved in translation from virtual to physical address (such as kern_pointer_from_pteaddr, physical_from_virtual_locked and table_find) take more time.

When trying zero-copy transmission on the loopback interface (where no physical addresses are needed), I did see a performance increase (in the order of 10%), but in Linux zero-copy does not apply to the loopback interface.

francescolavra added a commit that referenced this issue May 21, 2021
This commit changes a few lwIP configuration options, following the
hints at https://www.nongnu.org/lwip/2_1_x/optimization.html.
The lwip_htons() and lwip_htonl() macros have been defined so that
byte order inversion operations are executed inline instead of as
function calls (in #1299, the lwip_htons() function shows up as
taking a non-negligible share of CPU time).
Instead of the default checksum algorithm #2, algorithm #3 is now
being used, because it is faster on 64-bit platforms (checksum
calculation on 20-byte data is around 35% faster).
@francescolavra
Copy link
Member

I did some more testing with current master, and it is still the case that copying user data to kernel buffers when doing a socket send is more efficient (with Nanos running on qemu and sending TCP data to the local host) than the zero-copy approach. Even though retrieving the physical address from a kernel virtual address is now almost free, this doesn't apply to user buffers, so when doing zero-copy for each network packet being sent there is always a physical_from_virtual() call that involves a table lookup.
Looking at the ftrace output (after applying #1551):

non-zero-copy:
total time: 2M
mcache_alloc: 139k
mcache_dealloc: 161k
heaplock_alloc: 116k
heaplock_dealloc: 75k
low_level_output: 70k
objcache_alloc: 81k
objcache_from_object: 50k
objcache_dealloc: 43k
physical_from_virtual: 8k
runtime_memcpy: 33k

zero-copy:
total time: 2.1M
mcache_alloc: 141k
mcache_dealloc: 184k
heaplock_alloc: 111k
heaplock_dealloc: 75k
low_level_output: 84k
objcache_alloc: 76k
objcache_from_object: 58k
objcache_dealloc: 44k
physical_from_virtual: 20k
runtime_memcpy: 17k

The reduction of runtime_memcpy time is comparable to the increase in physical_from_virtual time, plus there is a significant increase in the time taken by mcache and objcache alloc/dealloc functions, which stems from the fact that when sending socket data with zero-copy we have to call pbuf_alloc() (allocating from lwip_heap) twice as many times as with the conventional approach.
But there might be other factors at play which cannot be easily seen with ftrace. With ftrace enabled, even doing zero-copy on the loopback interface is slight slower than non-zero-copy, whereas without ftrace zero-copy on the loopback interface is around 10% faster than non-zero-copy. Anyway, on the virtio-net interface with zero-copy I see a slight performance degradation (something like 3-5%) both with and without ftrace.

@francescolavra francescolavra removed their assignment Feb 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants