Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benching local GGUF model layers allocated to vRAM but no GPU activity #330

Closed
polarathene opened this issue May 19, 2024 · 3 comments
Closed
Labels
bug Something isn't working

Comments

@polarathene
Copy link
Contributor

Describe the bug

Building mistral.rs with the cuda feature, when I test it with mistralrs-bench and a local GGUF I observed via nvidia-smi that layers were allocated to vRAM, but GPU activity was 0 after warmup.

Despite this, within the same environment (llama-cpp official Dockerfile for full-cuda variant), the equivalent llama-cpp bench tool worked using the GPU at 100%. I built both projects within the same container environment myself, so something is off?

More details here: #329 (comment)

I can look at running the Dockerfile from this project, but besides cudnn, there shouldn't be much difference AFAIK. I've not tried other commands, or non-gguf, but assume that shouldn't affect this?

Latest commit

v0.1.8: ca9bf7d

Additional context

There is a modification I've applied to be able to load the local models without an HF token provided (I don't have an account yet and just wanted to try some projects with models), my workaround was to ignore 401 (unauthorized) similar to how 404 is ignored.

AFAIK this shouldn't affect using the GGUF model negatively? Additional files had to be provided despite this not being required by llama-cpp, from what I understand all the relevant metadata is already available with the GGUF file itself?

@polarathene polarathene added the bug Something isn't working label May 19, 2024
@EricLBuehler
Copy link
Owner

This seems very strange. I'll do some digging, but my suspicion is that they do device mapping differently. Please see my comment in #329.

There is a #326 (comment) to be able to load the local models without an HF token provided (I don't have an account yet and just wanted to try some projects with models), my workaround was to ignore 401 (unauthorized) similar to how 404 is ignored.

AFAIK this shouldn't affect using the GGUF model negatively? Additional files had to be provided despite this not being required by llama-cpp, from what I understand all the relevant metadata is already available with the GGUF file itself?

No, that shouldn't be a problem.

@polarathene
Copy link
Contributor Author

polarathene commented May 31, 2024

Seems to be using GPU now: #329 (comment)

Although the test finishes rather quickly it's a bit tricky to monitor the load, if you have a command that would take a little longer I could give that a go 👍

EDIT: Advice of increasing -r below, can confirm 100% GPU load.

@EricLBuehler
Copy link
Owner

You can configure the number of times the test runs with the -r or --repetitions flag, simply run --help for more info.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants