Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat engine: add improved WorkStealingTaskQueue #577

Open
wants to merge 12 commits into
base: develop
Choose a base branch
from

Conversation

egor-bystepdev
Copy link
Contributor

No description provided.

@itrofimow
Copy link
Contributor

I've ran some benchmarks with this PR applied, and so far it seems really good:
given the benchmark https://github.com/TechEmpower/FrameworkBenchmarks/tree/master/frameworks/C%2B%2B/userver, a 64-cores VM with 2 NUMA nodes

itrofimow@wsq-test:~/app$ lscpu 
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         40 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  64
  On-line CPU(s) list:   0-63
Vendor ID:               GenuineIntel
  Model name:            Intel Xeon Processor (Icelake)
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           2
    Stepping:            0
    BogoMIPS:            3990.62
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc 
                         cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fa
                         ult invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512b
                         w avx512vl xsaveopt xsavec xgetbv1 wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear ar
                         ch_capabilities
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   2 MiB (64 instances)
  L1i:                   2 MiB (64 instances)
  L2:                    128 MiB (32 instances)
  L3:                    32 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-31
  NUMA node1 CPU(s):     32-63
Vulnerabilities:         
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Syscall hardening, KVM SW loop
  Srbds:                 Not affected
  Tsx async abort:       Not affected

and the application having 24 worker-threads, 7 ev-threads and a single timer-thread taskset -c 32-63 ./wsq -c userver_configs/static_config.yaml, I'm seeing these results:


work-stealing-task-queue:

itrofimow@wsq-test:~/app$ taskset -c 1-30 twrk -c 256 -t 30 -D 3s -d 300s --pin-cpus http://localhost:8080/plaintext
Running 5m test @ http://localhost:8080/plaintext
  30 threads and 256 connections
  Thread Stats   Avg     Stdev       Max       Min   +/- Stdev
    Latency   222.83us  561.04us  292.95ms   19.00us   97.82%
    Req/Sec    42.87k     1.31k    49.65k    29.94k    73.16%
  383857503 requests in 5.00m, 53.98GB read
Requests/sec: 1279524.66
Transfer/sec:    184.26MB

work_stealing_task_queue

global-task-queue:

itrofimow@wsq-test:~/app$ taskset -c 1-30 twrk -c 256 -t 30 -D 3s -d 300s --pin-cpus http://localhost:8080/plaintext
Running 5m test @ http://localhost:8080/plaintext
  30 threads and 256 connections
  Thread Stats   Avg     Stdev       Max       Min   +/- Stdev
    Latency   220.95us  220.71us   21.37ms   23.00us   97.75%
    Req/Sec    37.46k   617.45     41.17k    32.47k    70.04%
  335445125 requests in 5.00m, 47.17GB read
Requests/sec: 1118150.09
Transfer/sec:    161.02MB

global_task_queue

which is an impressive ~14.4% throughput increase (keep in mind that the application in the benchmark is overtuned and not completely realistic, but still).
I'm not seeing any measurable differences between global-task-queue and current develop, so from the performance point of view this PR is a gem.

Note that I haven't done any correctness nor latency tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants