Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any way to increase performance? And switch to F32? #4506

Closed
AncientMystic opened this issue May 17, 2024 · 6 comments
Closed

Any way to increase performance? And switch to F32? #4506

AncientMystic opened this issue May 17, 2024 · 6 comments
Labels
question General questions

Comments

@AncientMystic
Copy link

I am using a pascal Tesla P4 8gb gpu and i am looking for a way to increase performance.

Are there any tweaks/environment variables i can apply or things i can install such a pytorch version or something that will boost ollama performance?

I am getting very mixed results, any model bigger than a few gb has a massive performance loss, roughly 2-6x slower generation over a model that is under 2gb, 4-5gb is 2x slower 6-8gb models are about 6+x slower and models over roughly 8-11gb either are so slow they are useless or won't load.

One recent test a response of 521 tokens took 20 minutes on a 8gb model (which is something like 0.4 tokens/s)

It is to be expected to some degree as i do not have adequate vram to run extremely large models but it would be nice if i could somehow get at least slightly faster results on models most if not all of should fit into vram

Also is there a setting to try F32 over F16 precision? Pascals seem to have much higher f32 performance so I figure it is worth a try

I have already set OLLAMA_NUM_PARALLEL & OLLAMA_MAX_LOADED to 1 to achieve lower vram usage (any more tweaks would be very much so appreciated)

System:
OS: proxmox
CPU: i7-6700
Mem: 64GB DDR4 2133mhz
Main drive: 1TB nvme

VM: ubuntu 22.04.4 LTS
Ollama: 0.1.38
vGPU: GRID-P4-2Q 6GB profile
Mem: 32gb

@pdevine pdevine added the question General questions label May 18, 2024
@pdevine
Copy link
Contributor

pdevine commented May 18, 2024

Hey @AncientMystic , the problem is the P4 doesn't have a lot of vram on it, so almost certainly you're offloading a lot of layers onto the GPU. You can verify this w/ the ollama ps command which was just introduced in 0.1.38.

Increasing from a non-quantized 16 bit model to a non-quantized 32 bit model (which most of the models don't even come in those sizes) would only exacerbate the issue. You can increase the quantization level (i.e. use a 4bit or lower quantized model) which will increase the performance, but you're going to sacrifice the accuracy/quality of the results. You could also decrease the context size which will save more memory

The short answer is either:

  • choose a more quantized model;
  • decrease the context size; or
  • (unfortunately) upgrade to a beefier GPU

I hope that helps! I close out the issue, but feel free to keep commenting.

@pdevine pdevine closed this as completed May 18, 2024
@AncientMystic
Copy link
Author

Thank you for your response, i am using lower models than F16 already my favourites are usually the Q5 for performance vs quality but i was referring to ollama itself not the models used, the buffers and everything seem to be using f16 and for the p4 for example half precision performance is 89 GFLOPS where as full precision is 5.7 TFLOPS the card (as many others) just does not perform well with half precision since most gpus aren't made with that in mind, so i was hoping f32 would give some sort of benefits, other tools and AI image generators, etc i was able to switch from f16 to f32 with this card yielded quite a boost in performance, was hoping there was an environment variable to force full precision in ollama itself.

@pdevine
Copy link
Contributor

pdevine commented May 19, 2024

@AncientMystic unfortunately not in ollama. Ollama uses llama.cpp as the runner for running inference, and I'm not aware of of any options which will allow you to do that (although it's certainly plausible).

@AncientMystic
Copy link
Author

AncientMystic commented May 19, 2024

Thank you for your responses i really appreciate it.

I found that llama.cpp supports "--memory-f32" which uses f32 instead of f16 and might be what i am looking for (or at least a step in the right direction possibly) but also doubles the context memory requirement and cached prompt file size, its not recommended and i think it would probably be a bad choice on newer gpus with tensor cores, but it might be beneficial on pascal with such abysmal f16 performance, theres a few commands i kind of want to try from llama.cpp

is there a way to use llama.cpp flags with ollama?

@AncientMystic
Copy link
Author

AncientMystic commented May 22, 2024

I just was playing around with LM Studio and ollama using the same model (7B Q5_k_M)

Ollama the best i can seem to get is 8 tokens/s, averages 4-7 it dips to 1-2 tokens/s at times with ollama ps reporting only 13-18% of the model is on cpu not gpu

LM studio using the same model as ollama and a context of 2048 achieves 19.93 tokens/s (highest Ive seen is 22) context 8192 achieves 13.13 tokens/s and 3.42 tokens/s with cpu only not using the gpu at all 🤔

I have also tried a 13B model which achieves 1.86 tokens/s in ollama and in LM studio it is 2.64 tokens/s this is much lower of a difference than the smaller model.

Wondering what i can do to get closer to the lm studio performance in ollama

@AncientMystic
Copy link
Author

@pdevine just found this llama.cpp fork with a vector kernel specifically for fp32 no tensor core flash attention made to run on pascal gpus that seems to accelerate token/s numbers by almost 2x

https://github.com/JohannesGaessler/llama.cpp/tree/cuda-fa-no-tc-11

And this detailing results and a pull request merging into llama.cpp

ggerganov/llama.cpp#7188

Many users seem to be still using pascal especially with the p40-p100s so this might be a great option to boost speeds if it could be added with an environment veritable to enable it like OLLAMA_FA_FP32 / FA_NOTC or something

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question General questions
Projects
None yet
Development

No branches or pull requests

2 participants