Not compiled with GPU offload support #4486

oldmanjk · 2024-05-17T03:05:49Z

What is the issue?

Trying to use ollama like normal with GPU. Worked before update. Now only using CPU.
$ journalctl -u ollama
reveals
WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1

I do not manually compile ollama. I use the standard install script.
Main README.md contains no mention of BLAS

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.38

The text was updated successfully, but these errors were encountered:

oldmanjk · 2024-05-17T03:12:01Z

Figured it out. Ollama seems to think the model is too big to fit in VRAM (it isn't - it worked fine before the update). There is a lack of any useful communication about this to the user. As mentioned above, digging in the log actually sends you in the wrong direction

jmorganca · 2024-05-17T06:37:08Z

Hi @oldmanjk sorry about this. May I ask which model you are running? and on which GPU?

mroxso · 2024-05-17T07:45:11Z

I think I got the same issue.
Running llama2:latest and llama3:latest on my GTX 1660 SUPER.
Worked before, now I updated to the latest Ollama and it seems that it mostly uses CPU which is way slower.

// Update:
For me it seems like I had another process blocking my VRAM (A Python process). I saw this with nvidia-smi
I restarted my local PC and now it works again with GPU for me.

jukofyork · 2024-05-17T18:58:41Z

Has anybody an idea of the code we need to remove to stop it ignoring our num_gpu settings (again, sigh...)?

jukofyork · 2024-05-17T19:15:03Z

It's at the bottom of llm/memory.go:

        //if memoryRequiredPartial > memoryAvailable {
        //      slog.Debug("insufficient VRAM to load any model layers")
        //      return 0, 0, memoryRequiredTotal
        //}

oldmanjk · 2024-05-17T21:39:31Z

Hi @oldmanjk sorry about this. May I ask which model you are running? and on which GPU?

llama3 on a 1080 Ti

oldmanjk · 2024-05-17T21:41:42Z

I think I got the same issue. Running llama2:latest and llama3:latest on my GTX 1660 SUPER. Worked before, now I updated to the latest Ollama and it seems that it mostly uses CPU which is way slower.

// Update: For me it seems like I had another process blocking my VRAM (A Python process). I saw this with nvidia-smi I restarted my local PC and now it works again with GPU for me.

Definitely worth keeping an eye on your GPU memory (which I do - I keep a widget in view at all times - that wasn't the issue for me)

oldmanjk · 2024-05-17T21:44:13Z

Has anybody an idea of the code we need to remove to stop it ignoring our num_gpu settings (again, sigh...)?

Also weird is how, if ollama thinks it can't fit the entire model in VRAM, it doesn't attempt to put any layers in VRAM. I actually like this behavior though because it makes it obvious something is wrong. Still, more communication to the user would be good

uncomfyhalomacro · 2024-05-18T15:09:10Z

got the same issue here on openSUSE Tumbleweed. one thing i noticed is, it uses the GPU for a moment then gone...

Screencast_20240518_221101.webm

dhiltgen · 2024-05-21T21:18:17Z

We've recently introduced ollama ps which will help show how much of the model has loaded into VRAM.

We've fixed a few bugs recently around num_gpu handling in some of our prediction logic, but I'm not sure that addresses your comment @jukofyork. Can you explain what you're trying to do? The goal of our prediction algorithm is to set num_gpu automatically based on the available VRAM. There is a minimum requirement for models and if we can't even allocate that minimal amount, then we will fall back to CPU. If we can satisfy the minimal amount, but not load the full amount, we will partially load on the GPU. Are you trying to set a lower value to preserve more space on the GPU, or did we predict incorrectly and you're trying to specify more layers? If our prediction was right, and you still push higher, we'll likely OOM crash by trying to allocate too many layers on the GPU.

@oldmanjk can you clarify your problem? Perhaps ollama ps output and server log can help us understand what's going on.

oldmanjk · 2024-05-21T21:33:48Z

We've recently introduced ollama ps which will help show how much of the model has loaded into VRAM.

We've fixed a few bugs recently around num_gpu handling in some of our prediction logic, but I'm not sure that addresses your comment @jukofyork. Can you explain what you're trying to do? The goal of our prediction algorithm is to set num_gpu automatically based on the available VRAM. There is a minimum requirement for models and if we can't even allocate that minimal amount, then we will fall back to CPU. If we can satisfy the minimal amount, but not load the full amount, we will partially load on the GPU. Are you trying to set a lower value to preserve more space on the GPU, or did we predict incorrectly and you're trying to specify more layers? If our prediction was right, and you still push higher, we'll likely OOM crash by trying to allocate too many layers on the GPU.

@oldmanjk can you clarify your problem? Perhaps ollama ps output and server log can help us understand what's going on.

I'm not at a terminal atm, but ollama refuses to load the same size models it used to and that other back ends will (like ooba with llama-cpp-python). Depending on the model/quant, I have to reduce num_gpu by a few layers compared to old ollama or ooba. When you've carefully optimized your quants like i have, this is the difference between fully-offloaded and not. On a repurposed mining rig, this destroys performance. Also, if I don't change the modelfile (which is a pain on a slow rig), ollama won't offload anything to gpu

oldmanjk · 2024-05-22T02:20:49Z

Example walkthrough:

Determine ooba/llama-cpp-python can load and run over 1K context (this usually means it will run full context without crashing) my latest Meta-Llama-3-70B-Instruct-Q3_K_L quant with 8K context and 48 layers offloaded to GPU.
So now I'll go to the Modelfile and set num_ctx to 8192, num_gpu to 48, num_thread to 32 (to get ollama to use 24 threads (sigh)), and import the gguf into ollama.
About a minute and 37.1 GB of wear and tear on my nvme later, success.
Attempt to call model.
ollama offloads nothing to GPU (or system RAM, for that matter (outside of a few GiB) - it's running it straight off the nvme? Why? I have over 90 GiB of RAM free)
```
$ ollama ps
NAME                                            ID              SIZE    PROCESSOR       UNTIL              
Meta-Llama-3-70B-Instruct-Q3_K_L-8K:latest      b9345a582769    41 GB   100% CPU        4 minutes from now
```
ollama_logs.txt attached - note the three locations I've highlighted falsehoods claimed by ollama.
ollama_logs.txt
ollama rm Meta-Llama-3-70B-Instruct-Q3_K_L-8K (autocomplete would be nice)
Go back to the Modelfile, set num_gpu to 47, and import the gguf into ollama again.
Wait another minute or so (on this very-fast machine - on the old mining rig this can take upwards of ten minutes).
Another 37.1 GB of wear and tear on my nvme later, success.
Attempt to call model.
Lucky! This time it works. Sometimes I have to repeat these steps a few times. 22.8 / 24.0 GiB used - that layer should have fit (in fact, we already know it does).

Edit - Now ollama is using all 32 threads (I want it to use 24 probably) and basically 0% GPU. I have no idea what's going on here.
Edit - Removing num_thread produces 20% CPU utilization, whereas before I was seeing 10%. I don't know what's going on here either. Assuming we want all physical cores utilized, it should be 24/32 or 75%

dhiltgen · 2024-05-31T23:46:04Z

@oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the runners/cpu_avx2/ollama_llama_server CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong?

sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

oldmanjk · 2024-06-01T04:10:17Z

@oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the runners/cpu_avx2/ollama_llama_server CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong?
sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

requested.log

What is clear, from both logs (as I already pointed out in the previous log), is ollama is wrong about memory, both total and available. Ollama says my NVIDIA GeForce RTX 4090 (founder's edition - as standard as it gets) has 23.6 GiB total memory (obviously wrong) and 23.2 GiB available memory (also wrong). The true numbers, according to nvidia-smi, are 24564 MiB (24.0 GiB, of course) total memory and 55 MiB used (24564 MiB - 55 MiB = 24509 MiB = 23.9 GiB available memory). So ollama thinks I have less memory than I do, so it refuses to load models it used to load just fine. Hence why not offloading a layer or two to GPU causes it to work again. I think you have all the information you need from me. You just need to figure out why ollama is incorrectly detecting memory. If I had to guess, it's probably a classic case of wrong units or conversions thereof (GiB vs GB). You know, that thing they beat into our heads to be careful about in high school science class. The thing that caused the Challenger disaster. Y'all need to slow down, be more careful, and put out good code. This would, paradoxically, give you more time because you wouldn't have to spend so much time putting out fires. Again, all of this information was already available, so this was an unnecessary waste of my time too. I've attached the requested log anyway.

Edit - I'm no software dev, but...maybe start here: #4328
If I'm right that that's the problem (that a dev arbitrarily decided to shave a layer of space off as a "buffer", breaking the existing workflows of countless users, and no one notifying the user base or even all the other devs, causing hours of wasted time and confusion)...well...that's pretty bone-headed. The obvious typo in the original comment (the one one would catch by reviewing one's pull request even once) illustrates my point (about slowing down) pretty spectacularly. Hell, a spell checker would have caught that. If I sound frustrated, it's because I am

oldmanjk added the bug Something isn't working label May 17, 2024

dhiltgen self-assigned this May 21, 2024

dhiltgen added the needs more info More information is needed to assist label May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not compiled with GPU offload support #4486

Not compiled with GPU offload support #4486

oldmanjk commented May 17, 2024

oldmanjk commented May 17, 2024

jmorganca commented May 17, 2024

mroxso commented May 17, 2024 •

edited

jukofyork commented May 17, 2024

jukofyork commented May 17, 2024

oldmanjk commented May 17, 2024

oldmanjk commented May 17, 2024

oldmanjk commented May 17, 2024

uncomfyhalomacro commented May 18, 2024

dhiltgen commented May 21, 2024 •

edited

oldmanjk commented May 21, 2024

oldmanjk commented May 22, 2024 •

edited

dhiltgen commented May 31, 2024 •

edited

oldmanjk commented Jun 1, 2024 •

edited

Not compiled with GPU offload support #4486

Not compiled with GPU offload support #4486

Comments

oldmanjk commented May 17, 2024

What is the issue?

OS

GPU

CPU

Ollama version

oldmanjk commented May 17, 2024

jmorganca commented May 17, 2024

mroxso commented May 17, 2024 • edited

jukofyork commented May 17, 2024

jukofyork commented May 17, 2024

oldmanjk commented May 17, 2024

oldmanjk commented May 17, 2024

oldmanjk commented May 17, 2024

uncomfyhalomacro commented May 18, 2024

dhiltgen commented May 21, 2024 • edited

oldmanjk commented May 21, 2024

oldmanjk commented May 22, 2024 • edited

dhiltgen commented May 31, 2024 • edited

oldmanjk commented Jun 1, 2024 • edited

mroxso commented May 17, 2024 •

edited

dhiltgen commented May 21, 2024 •

edited

oldmanjk commented May 22, 2024 •

edited

dhiltgen commented May 31, 2024 •

edited

oldmanjk commented Jun 1, 2024 •

edited