Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maximum CPU time - spread tasks evenly over timeslot #4181

Open
cminnoy opened this issue Feb 13, 2021 · 9 comments · May be fixed by #4207
Open

Maximum CPU time - spread tasks evenly over timeslot #4181

cminnoy opened this issue Feb 13, 2021 · 9 comments · May be fixed by #4207

Comments

@cminnoy
Copy link

cminnoy commented Feb 13, 2021

Describe the problem
When choosing a value of 'Maximum % of CPU time" BOINC will schedule the runtime in seconds.
For example, selecting 25% will make BOINC tasks run for one second, and sleep for 3 seconds.
In a multi core environment this is quite detrimental for performance, power spikes and more.
Also when all the tasks start at once, the computer often is less responsive for one second, and then 3 seconds is very usable.

Describe the solution you'd like
It would be beneficial for the system to spread the multi-core tasks in time, using slots.
For example, if we have an 8 core system with a 25% usage selected, we have have 8 tasks running:

  • First second: run 2 tasks out of the 8 (tasks 1, 2)
  • Second second: run next 2 tasks out of the 8 (task 3, 4)
  • Third second: run other 2 tasks (task 5, 6)
  • Fourth second: run other 2 tasks (tasks 7,8)

Second example:
In a 100 core system with a 1% CPU load selected:
In the old situation all 100 cores would run for 1 second, completely bogging down the system.
For the next 99 seconds there would be no BOINC tasks running.
In the new situation, every task would run for 1 second in consecutive order.

This would provide a better user experience:

  • Fans will not throttle up and down that aggressively reducing noise
  • Power supplies have a less hard time
  • More gentle thermal effect on CPU (less expansion/contraction cycles
  • Computer is more usable in general
  • Simply makes much more sense

Additional context
In case you want your computer to consume a steady amount of power this is the only way.

Note: It would be beneficial that CPU % LOAD is separately configurable to GPU % LOAD. Same concept, but time slots also take into account amount of GPUs installed in the system.

@cminnoy
Copy link
Author

cminnoy commented Feb 15, 2021

See function throttler in file client/app.cpp

@cnergyone
Copy link

@AenBleidd
Copy link
Member

@cnergyone, if you have smth - please send a PR

@truboxl
Copy link
Contributor

truboxl commented Feb 20, 2021

How's it different than setting number of CPU used with 100% CPU time?
The switch between tasks every X minute seems to cover this...

@cnergyone
Copy link

Sorry, but I don't understand your sentence 'switch between tasks every x minute'?
This has nothing to do with switching between tasks/projects.

A. Max cpu time is a value that determines how much time (%) BOINC gets over a time span.
B. Max cpu cores is a value that determines how many cores all running tasks may occupy together.
C. Switching between tasks/projects happens every 60 minutes (but you can set it differently).

Parameter B is very ridged. You don't want to have this value changed dynamically or even very often.
I have seen VM tasks crash if you change the parameter while those tasks are running. So set it once and leave it.

Parameter C is a about spreading compute time within and between projects. It will have zero impact on the performance of BOINC nor the CPU usage. Its only intend is to make sure that long duration tasks don't get advantage compared to short running tasks.

Parameter A can be changed whenever you want, and will not lead to a crash of any task, also not VM tasks. It is intended to be more dynamically set, and can be used with temperature control or power control. I will add this feature later to boinccmd so you can set the cpu time used dynamically; You can then control BOINC cpu usage easily from a script.

The difference between the old throttler code and the proposed one:
Old code: Put a frog in a pot with water, let it boil for 1 second, then let the frog freeze at 0 C for 4 seconds.
New code: Put the frog in a pot at 20 degrees C for 5 seconds.

The average temperature is the same, but the user experience for the frog is quite different.

@truboxl
Copy link
Contributor

truboxl commented Feb 22, 2021

I still don't understand why do you need to

run 2 tasks and suspend 6 tasks at that one second timeframe
and then consecutively suspend / resume the other tasks at the other second

when you can just effectively set 25% CPU (8 * 0.25 = 2) with 100% CPU time used and achieve the same effect

this is just tasks hoarding

I also think there's an overhead doing suspend / resume every now and then, especially the tasks are not set to stay in memory

edit: if the suggestion is to improve the cpu time used to be consistent, then fine I suppose, but dismissing the core count settings as useless are not. The crash in VM should be looked into and fixed properly.

@cnergyone
Copy link

cnergyone commented Feb 22, 2021

  • The preempt is with REMOVE_NEVER, so the tasks should remain in memory. That's also what I see in top.
  • The overhead is really small (milliseconds on most systems). The difference in overhead between the new throttler and the existing throttler is close to zero. The difference in scheduling overhead between the two algos towards the tasks is ZERO.
    If you have proof of extra overhead, please post it here.
  • A task runs for many seconds (for example 50 seconds if you set it to 50%). Tasks are shifted only in time towards each other, even in the end improving memory throughput as they don't starve the cache that much.
  • If you have proof that dynamically scheduling max_ncpus_pct is exactly the same, again please post it here.

@davidpanderson
Copy link
Contributor

The purpose of CPU throttling is to let you reduce average CPU temperature,
and to divide the heat evenly among the cores, i.e. to minimize the max temp of the cores.
If you have 4 cores, and you run 4 jobs, the OS will run them on different cores.
If you stagger the jobs so that only 2 run at a time, the OS will (possibly)
run everything on the same 2 cores; they'll get hot and the others will stay cold.
The current implementation is fine as far as I can tell.

@cnergyone
Copy link

"The purpose of CPU throttling is to let you reduce average CPU temperature,
and to divide the heat evenly among the cores, i.e. to minimize the max temp of the cores."

This is the job of the kernel. If you provide it less tasks than there are cores it happily shuffles tasks around on cores.
The kernel knows best where to put tasks, and will never stick a task to a core for longer than a few hundred of milliseconds,
unless it has no other place to put it.

"If you have 4 cores, and you run 4 jobs, the OS will run them on different cores.
If you stagger the jobs so that only 2 run at a time, the OS will (possibly)
run everything on the same 2 cores; they'll get hot and the others will stay cold.
The current implementation is fine as far as I can tell."

This will not happen on Linux nor Windows. You can use htop and see.

You can find info on the linux scheduler here: https://www.kernel.org/doc/html/latest/scheduler/index.html

But nothing beats a scientific approach, so lets measure:

8 core linux system, 16 logical cpus, 16 CPU tasks running + 1 GPU task, each run takes 10 minutes
Projects active: Universe and World Community Grid

Original algorithm CPU time set to 20%:
Min temp: 62.3C
Max temp: 76.6C
Avg temp: 65.9C

Original algorithm CPU time set to 69%:
Min temp: 63.8C
Max temp: 76.0C
Avg temp: 65.8C

New algorithm CPU time set to 20%:
Min temp: 50.8C
Max temp: 69.8C
Avg temp: 56.7C

New algorithm CPU time set to 69%:
Min temp: 58.8C
Max temp: 64.1C
Avg temp: 62.1C

Conclusion: the average temperature of the cpu with the old algorithm remains the same between low and medium setting. There are temperature spikes that go up to 76C. The average temperature in the new algorithm is lower than the old. There is also a clearer distinction between low and medium setting in the average temperatures. The spikes go less high leading to a lower maximum temperature of the CPU.

Interesting observation:
The new algo isn't perfect either, as there is a wobbling in the CPU temperatures.
I will investigate this further in the coming weeks and see where it can still be improved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

5 participants