Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

30 seconds, sometimes even 1 minute before it can copy the text produced by the AI. #72

Open
amonpaike opened this issue Apr 4, 2024 · 6 comments

Comments

@amonpaike
Copy link

When the bot completes a response, especially if it was long, at least 30 seconds and sometimes even a minute pass before i can copy the text into the clipboard by clicking on it. Scrolling is also very slow, it seems related to memory running out or something by the AI after it has made a response, but I don't understand what it has to do with not being able to interact with the text or even with scrolling it, given that the operation with oterm is completed by the AI. I was wondering whether the two things can be made independent, or in any case made so that once the writing operation by the AI is completed, oterm immediately becomes available to be able to carry out other operations by the user.

@ggozad
Copy link
Owner

ggozad commented Apr 5, 2024

I've never had this happen to me, so might be tricky to debug. Can you give some insight? Does your machine have enough memory to run the models you are running?

@amonpaike
Copy link
Author

I've never had this happen to me, so might be tricky to debug. Can you give some insight? Does your machine have enough memory to run the models you are running?

16 GB DDR3 avx only cpu, and RTX 3060 12GB VRAM, in reality the models are very fast to respond, for example dolphin starcoder2 has 15B parameters and runs very fast without problems. it's "when it downloads the memory" (it's my guess) that oterm waits for this process to finish... it only happens in some cases where the responses are very long (also very fast responses).

If I have to add information, it similarly happens that initially when you don't use ollama for 5-10 minutes, then initially it takes 30 seconds to respond... (probably loads in the gpu vram)

@ggozad
Copy link
Owner

ggozad commented Apr 14, 2024

If you are referring to the delay that happens when you load a new model, then this is normal.
Ollama takes a while to load or switch models. There's nothing really oterm can do to remedy that.
If your situation is that you start chatting with a model and then leave it for a while and then there is a delay again, this happens because Ollama will release the model from memory and then has to load it again. Again, oterm cannot change that behavior, but there is a setting in Ollama if I remember correctly to keep things in memory for longer.

Let me know if this is the case, so that I close the ticket.

@amonpaike
Copy link
Author

amonpaike commented Apr 14, 2024

It's probably related to the fact that the model takes a while to load on vram (It's normal and probably unavoidable), also also takes a while to download from Vram (This is also normal I assume).
What is not normal is that once the text has been definitively generated, for a certain number of seconds (on my potato computer probably too much to become obvious and boring) oterm is unusable (this happens and becomes evident when the AI model has generated a very long text.)
The chat should be available for immediate interaction, as I have already said, to be able to copy the text definitively generated or to be able to use scrolling, I believe it should be independent of what the AI model and ollama doing, For example scrolling can also be used during the generation phase, perhaps separating the treads from the generation of the text from other functions such as scrolling or even the possibility of copying the generated text into the clipboard, I don't know if it is complicated at the programming level to make the things that are more independent of each other.
Haven't you managed diagnose yourself with this boredom so to be able to understand what I mean? Is your computer very powerful and so it doesn't manifest itself as a boring thing?
Try it yourself, as soon as a very long text has been generated, try scrolling or even try to copy the text into the clipboard, you will see that oterm does not respond for a certain number of seconds.

@ggozad
Copy link
Owner

ggozad commented Apr 16, 2024

I am afraid I can't reproduce it. Granted I have a pretty beefy M2 with 96gb available.
Is there anyone else that has the same experience?

@lainedfles
Copy link
Contributor

I've unsuccessfully attempted to re-produce this behavior using KDE Plasma 6 Konsole with a 16G VRAM 64G DRAM configuration. Neither fast GPU inference nor slow CPU inference seem to make much of a difference for me. I've however noticed that if I click fast enough, it does indeed copy incomplete replies when the mouse click registers in-between rendering of the Ollama stream output. Could it be related to your terminal emulator software (or are you using a native console)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants