Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Radically simplify scheduler and implement new scheduling strategy for short running actions #751

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wz1000
Copy link

@wz1000 wz1000 commented Apr 11, 2020

As noted by @mpickering and @pepeiborra in #503, the default behavior of the shake scheduler incurs a large overhead on ghcide response times on multicore systems.

On @mpickering's advice, first I tried to remove all the custom thread scheduling code in shake and delegated to ghc's native scheduling system. This didn't yield any consistent benefits, but also did not incur any noticeable penalty, both when using ghcide or when compiling GHC using hadrian.

We then noticed that shake stores the previous run time for actions, and decided to use this as a guide to scheduling short running action in such a way as to avoid the overhead of context switching and concurrency.

This fork now uses forkOn to schedule actions that we think will be short(< 0.05s currently) on the zeroth capability. This brings hover response times in ghcide running with multiple threads in line with those observed when running ghcide on a single thread.

TODO:

  • Figure out optimum value of the threshold and/or make it configurable
  • Remove other, now unused PoolPrioritys

See also https://github.com/digital-asset/ghcide/issues/503#issuecomment-605666014

@wz1000 wz1000 changed the title Radically simplify scheduler and Radically simplify scheduler and implement new scheduling strategy for short running actions Apr 11, 2020
@wz1000
Copy link
Author

wz1000 commented Apr 11, 2020

The hlint suggestion

.\src\General\Thread.hs:42:12: Suggestion: Replace case with maybe
Found:
  case mcap of
    Nothing -> forkIOWithUnmask
    Just n -> forkOnWithUnmask n
Perhaps:
  maybe forkIOWithUnmask forkOnWithUnmask mcap

won't actually compile because of mainline GHC's lackluster support for impredicativity and the rank-2 types involved.

estimates to guide scheduling of short actions on a single core.
@ndmitchell
Copy link
Owner

Many thanks for the patch, and sorry it's taken so long to get around to a discussion (family all caught COVID, and been a slow recovery). I love the idea of improving the scheduler, and I appreciate the work you have put in. Having read the code, a few questions on the conceptual stuff.|

Performance

(After reading, this section looks like a wall of questions designed to be off-putting - but that's really not the intention! I'm keen to understand more, so please share what you already know, I don't expect answers to everything - just trying to find a thread to follow.)

You seem to suggest most of the performance benefits come from using forkOn for small things? Is that accurate? If so, why? Is the cause that GHC stops doing something? What does it stop doing? Are there any other ways to stop it doing that? Does reusing already forked jobs help? Does only using forkOn help? Can we generalise these benefits for longer-running tasks, to regain uniformity? What is the percentage of short-lived tasks? Do we know why there are so many? What if a thing estimated to be quick takes a long time - do we end up stalling? What if the user has already pinned something long-running to capability 0? Is it possible to come up with a benchmark that demonstrates this policy has a significant advantage?

What are the currently understood performance benefits in ghcide and Hadrian? Last reported numbers on the thread are 0.2s before vs 0.02s after. That's amazing. Is it still that, or has it got better/worse? Is the Hadrian difference measurable, or in the realm of noise?

Complexity

Simple code would be great, and since the scheduler was written, Haskell has improved and the job the scheduler does has changed radically. A few things I'm worried we might have lost:

  • You don't seem to respect PoolPriority. Things like batch won't work properly unless PoolStart things always take priority over PoolBatch, and more specifically that PoolBatch is queued. Things like PoolException make sure that when an exception happens we terminate the build as quickly as possible.
  • With threadsLimit we ensure that no more than N threads are running at one time, but we also prioritise finishing existing threads instead of starting new threads. If you have 1000 pieces of work to do, and start them all, and they all read a file handle, you end up with 1000 open file handles that GHC switches between. Maybe it's not a big deal because blocking causes that to happen to, but it definitely seems more likely now, and I worry it would be a memory regression.
  • The scheduler deliberately randomises the order of tasks. In some real-world benchmarks that save up to 30% by randomly interleaving linkers with compilers, and spreading out disk vs cpu usage. I imagine GHC is doing a fair scheduler, which might harm these cases.

@ndmitchell
Copy link
Owner

ndmitchell commented May 22, 2020

Given the benchmark:

    (d, _) <- duration $ runPool False 4 $ \pool ->
        replicateM_ 200000 $ addPool PoolStart pool $ return ()
    print d

Run at -O2 -N4 the existing implementation takes 3.27s. The implementation from this patch takes 0.2s, assuming I estimate everything at the very quick threshold - an amazing and most welcome speedup. If I just switch to forkOn with the existing implementation I get to 0.4s.

I'm going to look into optimisations for avoiding spawning threads, as they should be fairly localised, preserve all the benefits, not require estimation, and might get to < 0.4s. With that done, we can revisit the things in this chunk.

@wz1000
Copy link
Author

wz1000 commented May 22, 2020

I think a simple way to keep the original scheduler and retain the performance improvements of this patch would be to just use forkOn on a random(unused?) capability instead of forkIO. This would prevent GHC from unnecessarily moving threads between capabilities. Perhaps we should also still reserve a capability for short running actions though.

@ndmitchell
Copy link
Owner

What makes you believe the issue is all due to GHC moving between threads? In some microbenchmarks I've written, forkOn performs measurably worse than forkIO. I'm actually wondering if forkOn is slower, and that's what makes the pool work better, since you are more likely to reuse a thread. Either way, any references you have on all this stuff would be great.

@ndmitchell
Copy link
Owner

How do we find an unused capability? How do we be sure no one else is going to queue up for that capability? If we're wrong about our determination something is short what are the consequences?

@ndmitchell
Copy link
Owner

I benchmarked just a simple loop spawning threads. I got weird behaviour, so I shared it on StackOverflow: https://stackoverflow.com/questions/61971292/ghc-forkio-bimodal-performance

Interestingly, if I replace that benchmark with forkOn performance crashes, taking vastly longer, and most scarily, quadratically long. I raised a bug at https://gitlab.haskell.org/ghc/ghc/issues/18221. I'd love to improve the performance of the pool, but am super worried about running into O(n^2) behaviour and taking out the whole IDE if things go wrong!

ndmitchell added a commit that referenced this pull request May 23, 2020
@ndmitchell
Copy link
Owner

Following the remarks added, and the information on the StackOverflow thread:

  • I'm worried about overusing a primitive that is O(n^2). It's like playing with fire, and sooner or later you get burnt. But if we understand why it's happening, perhaps we can avoid it carefully, e.g. some pre-scheduling to avoid spawning too many forkOn's.
  • I converted Pool from MVar to IORef with atomicModifyIORef and it had very little difference. It seems whatever makes one slow also harms the other.
  • It seems that IORef causes the thread switching that is harming performance, or at least that seems to be a plausible theory. PVar fixes that in the StackOverflow benchmarks. I imagine forkOn fixes it too by pinning the thread. But why does the IORef writes cause switching? Is there more allocation and the allocation causes it? There is definitely an allocation difference (7M in use vs 2G in use), but what causes that?

So it seems there's definitely something to be done, but I'm not sure what, so will wait for more information to appear on the two threads above.

@ndmitchell
Copy link
Owner

In case it's relevant, the source code to atomicModifyIORef is at https://gitlab.haskell.org/ghc/ghc/-/blob/8916e64e5437c99b82d5610286430328af1d86bc/rts/PrimOps.cmm#L634-702.

@ndmitchell
Copy link
Owner

More discussion at https://gitlab.haskell.org/ghc/ghc/issues/18224

@ndmitchell
Copy link
Owner

I wrote up my investigations in a blog post, which might be easier to follow (nothing not already in the thread above) https://neilmitchell.blogspot.com/2020/05/fixing-space-leaks-in-ghcide.html.

@pepeiborra
Copy link
Contributor

pepeiborra commented Jun 27, 2020

The goal of this patch was to reduce the overhead of Shake tasks in multicore systems due to context switches induced by forking trivial tasks (e.g. map lookups) off to a new thread.

The core of the proposal was to use forkOn instead of forkIO, which brings on more trouble than it's worth.

At the risk of sounding naive, an alternative would be to not fork at all, and run the trivial computation on the spot if all the dependencies are available. Is something like that possible?

@wz1000
Copy link
Author

wz1000 commented Jun 27, 2020

I tried a naive approach to not forking short running actions and it quickly lead to deadlocks.

@ndmitchell
Copy link
Owner

@pepeiborra - a sensible thing to try. I did so in https://github.com/digital-asset/ghcide/issues/503#issuecomment-633642426 - it gave a 25% speedup. Nice, but not amazing, and if you hypothesise wrong about what will be cheap, you loose potentially huge amounts of parallelism. Given the fact that forkIO seems to have a recognised GHC bug (it can often switch out immediately after it starts running, which is not a good idea) I think that is the best direction to push.

However, I also intend to revisit benchmarking Ghcide and Shake at some point in the future. There were costs going to things like HashMap, which I probably reduced significantly by reducing collisions, so the Shake costs are likely to creep up over time as everything else gets optimised.

@pepeiborra
Copy link
Contributor

pepeiborra commented Jun 30, 2020

I have been benchmarking ghcide recently and found the Shake overhead to be around 30% of the cost of a ShakeSession reload in ghcide. If the Shake perf was improved by 25%, it would translate in a 7.5% improvement in the edit workflow in ghcide. I'll take two of that, thanks!

I agree that improving forkIO is the better fix though.

Alternatively, this could be exposed to the user. I would like to tag my rules as "cheap", "pure", etc, so that Shake has more info on how to schedule them

@ndmitchell
Copy link
Owner

If it was a pure win, I'd definitely commit that. But it's 25% faster for small values runs, and potentially 200% slower if the dynamic nature of how long certain operations take. Tagging to Shake that certain rules are cheap is one way to go, but it's both bothersome and potentially wrong.

Can you share the benchmark? Perhaps as a separate ticket. And if you have it, the profile output? Things have shifted a lot, so it might even be the case that the scheduler is no longer the bottleneck. However, do be aware that the GHC profiler tends to over-attribute things to Shake because of the Monad instance it uses.

@pepeiborra
Copy link
Contributor

pepeiborra commented Jun 30, 2020

./ghcide-bench  --ghcide ./ghcide --select "completions after edit" --samples 1000 --ghcide-options "+RTS -P"

Where ghcide comes from https://github.com/pepeiborra/ghcide/tree/76d468318052a3f764cebbed1566090ee0421606

ghcide.zip

@pepeiborra
Copy link
Contributor

My guess/understanding is that the big maybeM cost center is Shake:

maybeM                 Control.Monad.Extra                      src/Control/Monad/Extra.hs:(70,1)-(73,43)                    33.7   34.1  148729 122175136776

@ndmitchell
Copy link
Owner

I ran with 100 samples, which give a startup of 15s, a runtime of 29-35s, and 337M resident.

I built for profiling by inlining Shake into Ghcide with:

ghc -O -prof -fprof-auto -o obj/prof/ghcide -outputdir obj/prof -XBangPatterns -XDeriveFunctor -XDeriveGeneric -XGeneralizedNewtypeDeriving -XLambdaCase -XNamedFieldPuns -XOverloadedStrings -XRecordWildCards -XScopedTypeVariables -XStandaloneDeriving -XTupleSections -XTypeApplications -XViewPatterns -package=ghc -ignore-package=ghc-lib-parser -DGHC_STABLE -Iinclude -idist/build/autogen -isrc -iexe -iC:/Neil/shake/src -iC:/Neil/shake/dist/build/autogen Main -fexternal-interpreter

That shows all the pieces of Shake broken down. If I naively add up all the Development.Shake.* bits, that gives me 19.2%. However, we know that the monad interpretation under profiling is very expensive, but under normal execution is relatively cheap - that accounts for 6.7%, leaving about 12.5% to Shake (plus there will be some cost that comes from the monad).

Given how profiling works, it sometimes misses the cost of spawning threads. I see 2302 threads per test (measured by profiling 10 samples, then profiling 100, and dividing the excess threads by 90). That might take as high as 0.1s, based on benchmarks. However, from comments from SPJ, it seems like that's a bit of a bug, and we might be able to radically reduce that in future GHC's.

I don't see any low hanging fruit, unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants