Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop sequence missed #32

Open
Webifi opened this issue Jul 26, 2023 · 3 comments
Open

Stop sequence missed #32

Webifi opened this issue Jul 26, 2023 · 3 comments

Comments

@Webifi
Copy link
Contributor

Webifi commented Jul 26, 2023

Requests to api/v2/generate (websocket_api.py) can fail to detect the stop sequence in the generated response and will continue generation well after. This problem becomes even more apparent if max_new_tokens is greater than 1.

I have a change I can commit that works around this issue by scanning for the stop sequence in the last delta appended to the tail of the previous deltas, then stopping and returning a truncated new delta if the stop sequence is found.

If you'd like a PR for this, it will also require merging #31.

@borzunov
Copy link
Member

borzunov commented Jul 27, 2023

Hi @Webifi,

Thanks for reporting!

Please note that we can't always just truncate the new delta, since everything before the last token has already gone through the transformer and remembered in its attention caches. That is, you can truncate it for the UI but the model still sees the conversation history as the tokens were not truncated. This is why this code requires the extra_stop_sequences to have length of 1 tokens each (only the last token is easy to replace).

A proper fix would require adding inference_session.truncate(length: int) function to Petals that drops a part of activation/attention caches on both clients and servers. I've planned to add such a function at some point but it's not there yet.

That said, I think it's not always important for LLM performance, since LLM sees and understands that the response is finished anyway (since we put stop_token once we discover an extra_stop_sequence anyway). So we can merge a simple workaround if it's of any use to you, just keep in mind that it won't be 100% correct.

@Webifi
Copy link
Contributor Author

Webifi commented Jul 27, 2023

I'll just deal with it client side for now, until a proper fix can be implemented.

@Webifi
Copy link
Contributor Author

Webifi commented Jul 27, 2023

@borzunov Could you expand on "remembered in its attention caches" a bit? I'm a little confused on the flow here.

Say web-client requests generation of the following prompt using max_new_tokens of 2:
This is a system prompt###Human: Quack like a duck###Assistant:
And the eventual completion is:
This is a system prompt###Human: Quack like a duck###Assistant: Quack, quack, quack.###Human (We cut off generation here because of new stop code detection and set session.last_token_id = token for ###)

Now, the next new request to the WebSocket API is:
This is a system prompt###Human: Quack like a duck###Assistant: Quack, quack, quack.###Human: Great, now bark like a dog.###Assistant:

How do the attention caches play a role here, if at all?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants