Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance impact? #52

Open
FCrane opened this issue Sep 4, 2015 · 42 comments
Open

Performance impact? #52

FCrane opened this issue Sep 4, 2015 · 42 comments

Comments

@FCrane
Copy link

FCrane commented Sep 4, 2015

Hi!

I'm testing the passthru example with "true" and "8" as parameters (also tried 1) and it works fine. However, copying a file over the network that usually runs at about 90 MB/s slows down to 25 MB/s. CPU load of the passthru program is between 20 and 25%.

Does WinDivert really slow down network traffic that much? Can this be improved? Other solutions, like WinpkFilter, have a much smaller impact (e.g. just 5% CPU load, just 10% drop in transfer rate).

Thanks!

@FCrane
Copy link
Author

FCrane commented Sep 4, 2015

I've now tested this again and indeed the performance impact is huge. On a gigabit LAN, running the passthru example slows the network speed down to about 30%! I'm losing almost 70% of the network speed on a fast machine (Intel i7). The other network packet filter "WinpkFilter" does not show that issue. It lets the traffic pass at full speed with a fraction of the CPU load WinDivert uses...

Is WinDivert really so slow?

@basil00
Copy link
Owner

basil00 commented Sep 5, 2015

Firstly, which version of WinDivert did you use? Some of the older versions have performance problems that have been fixed.

These are my test results for WinDivert1.2.0-rc:

direct         : inbound=206Mbps outbound=206Mbps
passthru true 4: inbound=205Mbps outbound=178Mbps

200Mbps = ~25MB/s is less than your 90MB/s, and I have not yet tested anything higher.

There is a performance hit for outbound traffic (206 vs 178Mbps, about ~15%). This is something I was aware of but have never found the exact cause. A possible culprit is the checksum recalculation & this may also explain some of the CPU usage. Unfortunately correct checksums are a requirement of the underlying WFP framework as far as I can tell. WinPkFilter is a lower-level NDIS intermediate driver and probably does not need checksum recalculation for a passthru-type example.

The other thing is that WinDivert has always been a convenience versus performance trade-off. For best performance, you are better off implementing a specialized filtering driver for your application.

@FCrane
Copy link
Author

FCrane commented Sep 5, 2015

I'm using WinDivert v1.18. It's a pity that it slows down gigabit networks, because otherwise it seems really great!

Maybe you can test it on a gigabit LAN to see yourself.

@ghost
Copy link

ghost commented Sep 22, 2016

Did you ever find anything to improve the performance? I'm currently experiencing a similar performance drop and high CPU usage. In my case my download speed goes from ~6MB/s to 4.5MB/s with 15% CPU (probably depends on the CPU). The application spends most of its time in the WinDivertSendEx method. I already increased both available parameters (WINDIVERT_PARAM_QUEUE_LEN and WINDIVERT_PARAM_QUEUE_TIME) but that does not make a lot of difference I'm afraid.

@FCrane
Copy link
Author

FCrane commented Sep 22, 2016

Hi!

Sorry, but no. This problem seems to be by design and the developers don’t seem to be interested to fix it.

Regards!

From: Areithus [mailto:notifications@github.com]
Sent: Donnerstag, 22. September 2016 11:05
To: basil00/Divert Divert@noreply.github.com
Cc: FCrane frasier.crane@gmx.net; Mention mention@noreply.github.com
Subject: Re: [basil00/Divert] Performance impact? (#52)

Did you ever find anything to improve the performance? I'm currently experiencing a similar performance drop and high CPU usage. The application spends most of its time in the WinDivertSendEx method. I already increased both available parameters (WINDIVERT_PARAM_QUEUE_LEN and WINDIVERT_PARAM_QUEUE_TIME) but that does not make a lot of difference I'm afraid.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub #52 (comment) , or mute the thread https://github.com/notifications/unsubscribe-auth/ANeBiOwx8yOi1e7Ar5LNxNSEjG06SEIMks5qskTVgaJpZM4F307e . https://github.com/notifications/beacon/ANeBiDM_FtwI0hdtO3phhhs7M_-KVsZMks5qskTVgaJpZM4F307e.gif

@u-riaz
Copy link

u-riaz commented Sep 27, 2016

Hi @FCrane and @basil00 ,
I'm interested to write a packet filter which will block outbound traffic from almost 162K IPs. For that purpose, instead of diverting all traffic and then filtering I've come with a scheme:

  • For TCP divert only SYN packets (dropping SYN packets wouldn't make the connection) and for UDP and other divert all packets (filter: "(outbound and tcp.Syn) or (outbound and (!tcp))" , is filter correct?).
  • Catch diverted traffic (with 4-8 threads) and perform filtering using binary search, if any diverted packet is from the blacklisted IPs block it, or reinject otherwise.

Please comment about my scheme, would it work?
Additionally, please let me know the tools and all other things you used to measure the performance drop. Help me in creating the environment to measure performance, I want to test my scheme's throughput.
Thanks !

@basil00
Copy link
Owner

basil00 commented Sep 28, 2016

Generally you want to divert as little traffic as possible to get the job done. Diverting only SYN packets is a good approach and should have minimal impact, although this will not affect established TCP connections. For UDP, that does not have a SYN equivalent, you'd be stuck with diverting everything or implementing something complex (e.g. update the filter string to whitelist established UDP flows).

@basil00
Copy link
Owner

basil00 commented Sep 28, 2016

This problem seems to be by design and the developers don’t seem to be interested to fix it.

Nothing can be done until #53 is fixed anyway.

@u-riaz
Copy link

u-riaz commented Sep 28, 2016

@basil00 ,
Thanks for your reply.
Would you please let me know the tools and all other things you used to measure the performance drop (which you and @FCrane measured/tested and mentioned in upper comments). Help me in creating the environment to measure performance, I want to test my scheme's throughput.

@basil00
Copy link
Owner

basil00 commented Oct 1, 2016

For latency use ping, and for throughput just use any file transfer tool (ftp, scp, or even there are some http speed testers if you google for them) will do.

The performance impact of WinDivert is usually minimal unless you are attempting to divert megabytes per second of data through a user application. This is especially true for latency, where the the lag introduced by the user application is usually insignificant compared to the normal network lag. One danger for throughput is if the WinDivert packet queue getting overwhelmed resulting in packet loss.

@satnatantas
Copy link

satnatantas commented Dec 6, 2016

#52 (comment)

It would be helpful to know if the performance hit is constant with handicapped transfers with varying granularity of handicapping.

@basil00
Copy link
Owner

basil00 commented Oct 16, 2017

The latest WinDivert source code seems to be about ~4x faster than older versions 1.1.X and 1.2.Y, at least with my quick-and-dirty testing. This might not be quite gigabit speeds but at least it is a lot closer. The better performance is mainly due to internal driver optimizations such as avoiding copying packets (where possible) and instant injection.

@ghost
Copy link

ghost commented Oct 17, 2017

Nice @basil00!
4x more throughput or 4x less CPU usage? Or both? 👍

@TechnikEmpire
Copy link

TechnikEmpire commented Oct 17, 2017

@Areithus In my experience CPU is nearly all on the user, for the diversion process CPU usage is nil when using overlapped functions and tracking TCP packet flows. Also I believe, given the context of the thread, it's about throughput.

This is exciting news. I've had word @basil00 that EV cert was granted and in the mail to someone I'm working with so we should be able sign shortly.

@basil00
Copy link
Owner

basil00 commented Oct 17, 2017

Yes I meant 4x throughput, although it was a very rough test. I was testing 1Gbps speed, and version 1.2.0 choked at about 170Mbps, whereas the new version managed 630Mbps (still not perfect but much better). But this is just one quick test.

This is exciting news. I've had word @basil00 that EV cert was granted and in the mail to someone I'm working with so we should be able sign shortly.

Let me know when you are ready and I can assist. My other sponsor signed version 1.3.0 but it was a long and painful process, but we gained much experience. From the project's perspective there is no harm in more than one sponsor :)

@satnatantas
Copy link

What was the CPU and it's load when you tested it? Did you test passthru?

@basil00
Copy link
Owner

basil00 commented Oct 19, 2017

Yes passthru. The test box is an old system, so might do better with more modern CPUs.

@basil00
Copy link
Owner

basil00 commented Oct 25, 2017

Some benchmarks for passthru true at gigabit speeds:

------------------------------------------------------------------
Direct:

##: 0.80 Gbps down, 0.93 Gbps up

------------------------------------------------------------------
WinDivert-1.2.0-rc (#threads)

#1: 0.06 Gbps down, 0.06 Gbps up
#2: 0.12 Gbps down, 0.11 Gbps up
#3: 0.16 Gbps down, 0.15 Gbps up
#4: 0.19 Gbps down, 0.18 Gbps up

------------------------------------------------------------------
WinDivert-1.3.0 (#threads)

#1: 0.41 Gbps down, 0.49 Gbps up
#2: 0.71 Gbps down, 0.81 Gbps up
#3: 0.76 Gbps down, 0.87 Gbps up
#4: 0.73 Gbps down, 0.83 Gbps up

------------------------------------------------------------------
WinDivert-1.4.0-dev (#threads)

#1: 0.39 Gbps down, 0.46 Gbps up
#2: 0.61 Gbps down, 0.75 Gbps up
#3: 0.77 Gbps down, 0.84 Gbps up
#4: 0.74 Gbps down, 0.77 Gbps up

Notes:

  • WinDivert-1.2.0 and earlier had a performance bug that limited throughput to around ~200Mbps. Coincidentally, this was my available bandwidth at the time, so the problem went unnoticed.
  • The performance bug was fixed here. The fix was included in the WinDivert-1.3.0 release. The disadvantage of the fix is that WinDivertSend() will not return an error code if the injection fails (instead the packet will silently disappear).
  • The performance has slightly regressed in WinDivert-1.4.0 (although for 3 threads it is about the same). I will continue to investigate.

@lumogate
Copy link

There is no MSVC build for WinDivert 1.3.0?

@TechnikEmpire
Copy link

@kelvinomolumo did you check the releases page?

@ghost
Copy link

ghost commented Oct 25, 2017

Nice test @basil00, I did some testing here as well (just a little with reading) and can confirm that 1.3.0 is faster than 1.4.0. Not just the throughput but also CPU usage is a little less (about 1-2% less).

@basil00
Copy link
Owner

basil00 commented Oct 26, 2017

There is no MSVC build for WinDivert 1.3.0?

No, try to link against the MINGW version.

can confirm that 1.3.0 is faster than 1.4.0

Version 1.4.0 has a more complicated pipeline, so is probably a bit slower as a result. The details are somewhat technical, but version 1.3.0 queues packets (by deep copying) at DISPATCH_LEVEL, which is not ideal (at DISPATCH_LEVEL the thread is uninterruptible, so nothing else can run until the copying has finished). Version 1.4.0 fixes this by moving the copying and filtering out-of-band and runs at PASSIVE_LEVEL (is interruptible, just like normal user-mode code), but this requires an extra queue internally, so likely adds some overheads.

A more optimal design (in terms of performance) would be to not to use deep copying for queueing packets at all, but rather keep a reference to the original packet (NET_BUFFER_LIST). However, drivers are not supposed to keep references to NET_BUFFER_LISTs for long, such as waiting for a user mode application (as is the case with WinDivert), and Microsoft specifically advise against this.

@ghost
Copy link

ghost commented Oct 27, 2017

@basil00
I've been reading up a little on this (I think you refer to this specifically: https://msdn.microsoft.com/en-us/library/windows/hardware/ff551206(v=vs.85).aspx and also on some other pages such as https://msdn.microsoft.com/en-us/library/windows/hardware/ff551134(v=vs.85).aspx). Please correct me if I'm wrong though. It seems that if you listen to IRP_MN_QUERY_POWER you can keep the references. Might be worth looking into, it'd be nice to get near gbit speeds with just 2 threads.

@basil00
Copy link
Owner

basil00 commented Oct 27, 2017

That might be something to look into.

I also remembered that there are other complications to consider. Specifically, while deep copying sounds slow, it also has the benefit of freeing up the original buffer. This means that WSASend can complete immediately rather than blocking until WinDivert dereferences the packet. This can result in better throughput.

@basil00
Copy link
Owner

basil00 commented Jan 15, 2018

The latest WinDivert-1.4-dev has reverted back to deep copying rather than referencing packets. It appears this mode is actually slightly faster:

#3: 0.80 Gbps down, 0.87 Gbps up

So there is no reason not to continue using this mode for the immediate future. I hope to release version 1.4 shortly.

@lumogate
Copy link

lumogate commented Jan 15, 2018

@basil00 is the WinDivertSend back to how it was working in v1.3 as well with no error if injection fails?

@basil00
Copy link
Owner

basil00 commented Jan 15, 2018

Since version 1.3.0 the WinDivertSend function will return immediately since this is a lot faster. If you prefer to wait for an error code, it is possible to pass the WINDIVERT_FLAG_DEBUG flag to WinDivertOpen and this will emulate the old behavior. Note that, in my experience, Windows often does not return an error if injection fails either way...

@basil00
Copy link
Owner

basil00 commented Mar 31, 2019

I had evaluated the WinDivert 2.0 performance as part of testing, so it is probably worthwhile to make some quick notes here.

One problem was that I was unable to replicate the pervious performance numbers for older versions of WinDivert. It is possible that WinDivert performance took a hit from the Meltdown mitigation, and especially since my test box uses older hardware. I was also unable to replicate the top speeds for the unfiltered connection either, which may be related, or may have been a temporary network issue.

Nevertheless, we can relative evaluate the performance of WinDivert 2.0, and it essentially matches 1.4.3 using the same parameters (i.e., same thread count), which is in line with expectations.

WinDivert 2.0 also introduces "batch mode" using the WinDivert...Ex() functions. This allows the user application to send/receive multiple packets at one, and significantly reduces the number of kernel/user-mode context switches required. In my experiments, batch mode can significantly improve performance improvement even for single-threaded applications. Using the 2.0 version of passthru, the following configuration (using a single thread and batch of 32) can run at "full speed" (~0.83Gbps filtered vs ~0.93 unfiltered):

passthru.exe true 1 32

This suggests that "batch mode" is the most important factor in terms of performance improvement in recent versions of WinDivert.

@haohaolee
Copy link

haohaolee commented May 10, 2019

Hi basil,

I am suffering from the performance issue now. Actually, we focus on the SMB performance (aka network share).
Test env: Two Windows virtual machines (Win10 and Win7) connected with each other under Parallels Desktop

Without using Windivert, the file copying speed can be 150MByte/s (the virtual network adapter is 10Gbps)
With using Windivert, passthru.exe (both 1.4.3 and 2.0.0 rc), the speed I can get is at most 50MByte/s
The best performance with passthru is with 1 or 2 threads:
passthru true 1 for 1.4.3 passthru true 1 32 for 2.0 rc

Although this is tested under virtual machines, but I got similar results with Physical machines with 1Gbps connection.

Could you please shed some light on this? How can I debug this issue?

Thanks

@TechnikEmpire
Copy link

Just so you know, its a known issue to some of us that SMB and other windows services suffer degraded performance. However I haven't retested using batching.

Personally I exempt such traffic but that may not be an option based on your use case.

@TechnikEmpire
Copy link

Try disabling throttling in windows:

https://serverfault.com/questions/4409/windows-networking-performance-smb-cifs

One of the answers shows what reg key to set. Please let us know your results.

@haohaolee
Copy link

Thanks, I will try it.
Since we use Windivert to do some kind of proxying, we cannot exempt it

@haohaolee
Copy link

It seems DisableBandwidthThrottling does not have much impact on this behavior.
Now I can get 65MByte/s using WinDivert 2.0 rc (passthru true 1 32)
The direct copy is still 150+MByte/s

My observation is more threads lead to a lower and unstable speed, one thread can get the best and stable speed

@basil00
Copy link
Owner

basil00 commented May 11, 2019

Also check your CPU usage, e.g., is one core running at 100%? Also, how big is each transfer? What is the latency between the source/destination?
Normally I'd expect a bit better performance for 1Gbps, but performance issues can be very difficult to debug.

My observation is more threads lead to a lower and unstable speed, one thread can get the best and stable speed

If more threads does not help then it probably means the user application is not the bottleneck.

@haohaolee
Copy link

Hi basil,
The CPU usage in userland IS NOT 100%, no matter 1 thread or multiple threads I run. The testing file is 10GiB big, and the latency is less than 1ms (two virtual machines).

I am now trying to understand Windivert and find the root cause, could you please share the pdbs (public or private) when you release the drivers in the future? It will help a lot when others try to investigate issues. (Although I can build it myself, it will be convenient anyway)

@TechnikEmpire
Copy link

For clarity, you want the pdb files are for performance profiling yes?

@haohaolee
Copy link

haohaolee commented May 13, 2019

Yes, for the current situation. All kinds of debugging and profiling tools need symbols.
But it will be very convenient to provide pdbs with binaries anyway.

@basil00
Copy link
Owner

basil00 commented May 14, 2019

The pdb files are very large (relative to the rest of the binaries) and most users don't need them, so are not included. I could put them in a separate package but never get around to it.

@haohaolee
Copy link

Thanks~ That would be great

@majibow
Copy link

majibow commented May 23, 2019

If you are using multiple threads then it is likely that the packet reordering is affecting performance.

Note in general packet reordering is allowed but it has drawbacks in performance. Packet reordering should be avoided where possible. The only reason packet reordering was allowed back in the day was to allow different packets to go over different links to the same destination. One takes the scenic route with increased latency and arrives sometime later. That also allowed for stateless comparison and links randomly going offline.

However these days many core routers put all packets related to the same flow down the same path even if there are multiple links load balanced in a round robin... this is precisely to avoid packets being reordered.

There are simple ways to keep packets in order and one way is hashing the destination address. A really simple hashing scheme is to take the least significant bit (for two threads) or least two significant bits (for 4 threads) ect. and only allow one particular thread to handle one particular flow. This will give a fairly uniform load for multiple flows towards different destinations.

There are other strategies like using src and dst port numbers or xor'ing the src ip address too for example but you do need to make it as simple as possible to keep the overheads low. Bottom line is don't let packets get reordered for no reason.

@TechnikEmpire
Copy link

@majibow Thanks for sharing that info here. After reading it, seems quite logical to me. Great comment.

@majibow
Copy link

majibow commented May 24, 2019

I guess you could get 8 threads easily with 8 WinDivert handles
inbound/outbound, ip/ipv6, tcp/udp... one combination being "inbound and ipv6 and udp"

Unfortunately one thread will be hit a lot more than the others, mostly "inbound and ip and tcp" and its probably far from uniform.

Since WinDivert already supports >, <, >=, <= operators you could simply do combinations of
"ip and tcp.DstPort<16384" and
"ip and tcp.DstPort>=16384 and tcp.DstPort<32768" and
"ip and tcp.DstPort>=32768 and tcp.DstPort<49152" and
"ip and tcp.DstPort>=49152"

Would be nice if we could get bitwise operations in the filter language &, |, ^, ~, at minimum bitwise and would be super useful and far more efficient and and uniformly distributed.

"ip and tcp.DstPort & 3 = 0" and
"ip and tcp.DstPort & 3 = 1" and
"ip and tcp.DstPort & 3 = 2" and
"ip and tcp.DstPort & 3 = 3"

Note: 3 = 0x3 = 0b00000011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants