Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not compiled with GPU offload support #4486

Open
oldmanjk opened this issue May 17, 2024 · 14 comments
Open

Not compiled with GPU offload support #4486

oldmanjk opened this issue May 17, 2024 · 14 comments
Assignees
Labels
bug Something isn't working needs more info More information is needed to assist

Comments

@oldmanjk
Copy link

What is the issue?

Trying to use ollama like normal with GPU. Worked before update. Now only using CPU.
$ journalctl -u ollama
reveals
WARN [server_params_parse] Not compiled with GPU offload support, --n-gpu-layers option will be ignored. See main README.md for information on enabling GPU BLAS support | n_gpu_layers=-1

  1. I do not manually compile ollama. I use the standard install script.
  2. Main README.md contains no mention of BLAS

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.1.38

@oldmanjk oldmanjk added the bug Something isn't working label May 17, 2024
@oldmanjk
Copy link
Author

Figured it out. Ollama seems to think the model is too big to fit in VRAM (it isn't - it worked fine before the update). There is a lack of any useful communication about this to the user. As mentioned above, digging in the log actually sends you in the wrong direction

@jmorganca
Copy link
Member

Hi @oldmanjk sorry about this. May I ask which model you are running? and on which GPU?

@mroxso
Copy link

mroxso commented May 17, 2024

I think I got the same issue.
Running llama2:latest and llama3:latest on my GTX 1660 SUPER.
Worked before, now I updated to the latest Ollama and it seems that it mostly uses CPU which is way slower.

// Update:
For me it seems like I had another process blocking my VRAM (A Python process). I saw this with nvidia-smi
I restarted my local PC and now it works again with GPU for me.

@jukofyork
Copy link

Has anybody an idea of the code we need to remove to stop it ignoring our num_gpu settings (again, sigh...)?

@jukofyork
Copy link

It's at the bottom of llm/memory.go:

        //if memoryRequiredPartial > memoryAvailable {
        //      slog.Debug("insufficient VRAM to load any model layers")
        //      return 0, 0, memoryRequiredTotal
        //}

@oldmanjk
Copy link
Author

Hi @oldmanjk sorry about this. May I ask which model you are running? and on which GPU?

llama3 on a 1080 Ti

@oldmanjk
Copy link
Author

I think I got the same issue. Running llama2:latest and llama3:latest on my GTX 1660 SUPER. Worked before, now I updated to the latest Ollama and it seems that it mostly uses CPU which is way slower.

// Update: For me it seems like I had another process blocking my VRAM (A Python process). I saw this with nvidia-smi I restarted my local PC and now it works again with GPU for me.

Definitely worth keeping an eye on your GPU memory (which I do - I keep a widget in view at all times - that wasn't the issue for me)

@oldmanjk
Copy link
Author

Has anybody an idea of the code we need to remove to stop it ignoring our num_gpu settings (again, sigh...)?

Also weird is how, if ollama thinks it can't fit the entire model in VRAM, it doesn't attempt to put any layers in VRAM. I actually like this behavior though because it makes it obvious something is wrong. Still, more communication to the user would be good

@uncomfyhalomacro
Copy link

got the same issue here on openSUSE Tumbleweed. one thing i noticed is, it uses the GPU for a moment then gone...

Screencast_20240518_221101.webm

@dhiltgen dhiltgen self-assigned this May 21, 2024
@dhiltgen
Copy link
Collaborator

dhiltgen commented May 21, 2024

We've recently introduced ollama ps which will help show how much of the model has loaded into VRAM.

We've fixed a few bugs recently around num_gpu handling in some of our prediction logic, but I'm not sure that addresses your comment @jukofyork. Can you explain what you're trying to do? The goal of our prediction algorithm is to set num_gpu automatically based on the available VRAM. There is a minimum requirement for models and if we can't even allocate that minimal amount, then we will fall back to CPU. If we can satisfy the minimal amount, but not load the full amount, we will partially load on the GPU. Are you trying to set a lower value to preserve more space on the GPU, or did we predict incorrectly and you're trying to specify more layers? If our prediction was right, and you still push higher, we'll likely OOM crash by trying to allocate too many layers on the GPU.

@oldmanjk can you clarify your problem? Perhaps ollama ps output and server log can help us understand what's going on.

@dhiltgen dhiltgen added the needs more info More information is needed to assist label May 21, 2024
@oldmanjk
Copy link
Author

We've recently introduced ollama ps which will help show how much of the model has loaded into VRAM.

We've fixed a few bugs recently around num_gpu handling in some of our prediction logic, but I'm not sure that addresses your comment @jukofyork. Can you explain what you're trying to do? The goal of our prediction algorithm is to set num_gpu automatically based on the available VRAM. There is a minimum requirement for models and if we can't even allocate that minimal amount, then we will fall back to CPU. If we can satisfy the minimal amount, but not load the full amount, we will partially load on the GPU. Are you trying to set a lower value to preserve more space on the GPU, or did we predict incorrectly and you're trying to specify more layers? If our prediction was right, and you still push higher, we'll likely OOM crash by trying to allocate too many layers on the GPU.

@oldmanjk can you clarify your problem? Perhaps ollama ps output and server log can help us understand what's going on.

I'm not at a terminal atm, but ollama refuses to load the same size models it used to and that other back ends will (like ooba with llama-cpp-python). Depending on the model/quant, I have to reduce num_gpu by a few layers compared to old ollama or ooba. When you've carefully optimized your quants like i have, this is the difference between fully-offloaded and not. On a repurposed mining rig, this destroys performance. Also, if I don't change the modelfile (which is a pain on a slow rig), ollama won't offload anything to gpu

@oldmanjk
Copy link
Author

oldmanjk commented May 22, 2024

Example walkthrough:

  1. Determine ooba/llama-cpp-python can load and run over 1K context (this usually means it will run full context without crashing) my latest Meta-Llama-3-70B-Instruct-Q3_K_L quant with 8K context and 48 layers offloaded to GPU.
  2. So now I'll go to the Modelfile and set num_ctx to 8192, num_gpu to 48, num_thread to 32 (to get ollama to use 24 threads (sigh)), and import the gguf into ollama.
  3. About a minute and 37.1 GB of wear and tear on my nvme later, success.
  4. Attempt to call model.
  5. ollama offloads nothing to GPU (or system RAM, for that matter (outside of a few GiB) - it's running it straight off the nvme? Why? I have over 90 GiB of RAM free)
    $ ollama ps
    NAME                                            ID              SIZE    PROCESSOR       UNTIL              
    Meta-Llama-3-70B-Instruct-Q3_K_L-8K:latest      b9345a582769    41 GB   100% CPU        4 minutes from now
    
    ollama_logs.txt attached - note the three locations I've highlighted falsehoods claimed by ollama.
    ollama_logs.txt
  6. ollama rm Meta-Llama-3-70B-Instruct-Q3_K_L-8K (autocomplete would be nice)
  7. Go back to the Modelfile, set num_gpu to 47, and import the gguf into ollama again.
  8. Wait another minute or so (on this very-fast machine - on the old mining rig this can take upwards of ten minutes).
  9. Another 37.1 GB of wear and tear on my nvme later, success.
  10. Attempt to call model.
  11. Lucky! This time it works. Sometimes I have to repeat these steps a few times. 22.8 / 24.0 GiB used - that layer should have fit (in fact, we already know it does).

Edit - Now ollama is using all 32 threads (I want it to use 24 probably) and basically 0% GPU. I have no idea what's going on here.
Edit - Removing num_thread produces 20% CPU utilization, whereas before I was seeing 10%. I don't know what's going on here either. Assuming we want all physical cores utilized, it should be 24/32 or 75%

@dhiltgen
Copy link
Collaborator

dhiltgen commented May 31, 2024

@oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the runners/cpu_avx2/ollama_llama_server CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong?

sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

@oldmanjk
Copy link
Author

oldmanjk commented Jun 1, 2024

@oldmanjk the log you attached above seems to show a 2nd attempt where we fell back to the runners/cpu_avx2/ollama_llama_server CPU subprocess, after most likely unsuccessfully running on the GPU. Can you share a complete log so we can see what went wrong?

sudo systemctl stop ollama
OLLAMA_DEBUG=1 ollama serve 2>&1 | tee server.log

requested.log

What is clear, from both logs (as I already pointed out in the previous log), is ollama is wrong about memory, both total and available. Ollama says my NVIDIA GeForce RTX 4090 (founder's edition - as standard as it gets) has 23.6 GiB total memory (obviously wrong) and 23.2 GiB available memory (also wrong). The true numbers, according to nvidia-smi, are 24564 MiB (24.0 GiB, of course) total memory and 55 MiB used (24564 MiB - 55 MiB = 24509 MiB = 23.9 GiB available memory). So ollama thinks I have less memory than I do, so it refuses to load models it used to load just fine. Hence why not offloading a layer or two to GPU causes it to work again. I think you have all the information you need from me. You just need to figure out why ollama is incorrectly detecting memory. If I had to guess, it's probably a classic case of wrong units or conversions thereof (GiB vs GB). You know, that thing they beat into our heads to be careful about in high school science class. The thing that caused the Challenger disaster. Y'all need to slow down, be more careful, and put out good code. This would, paradoxically, give you more time because you wouldn't have to spend so much time putting out fires. Again, all of this information was already available, so this was an unnecessary waste of my time too. I've attached the requested log anyway.

Edit - I'm no software dev, but...maybe start here: #4328
If I'm right that that's the problem (that a dev arbitrarily decided to shave a layer of space off as a "buffer", breaking the existing workflows of countless users, and no one notifying the user base or even all the other devs, causing hours of wasted time and confusion)...well...that's pretty bone-headed. The obvious typo in the original comment (the one one would catch by reviewing one's pull request even once) illustrates my point (about slowing down) pretty spectacularly. Hell, a spell checker would have caught that. If I sound frustrated, it's because I am

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs more info More information is needed to assist
Projects
None yet
Development

No branches or pull requests

6 participants