Stop sequence missed #32

Webifi · 2023-07-26T15:50:32Z

Requests to api/v2/generate (websocket_api.py) can fail to detect the stop sequence in the generated response and will continue generation well after. This problem becomes even more apparent if max_new_tokens is greater than 1.

I have a change I can commit that works around this issue by scanning for the stop sequence in the last delta appended to the tail of the previous deltas, then stopping and returning a truncated new delta if the stop sequence is found.

If you'd like a PR for this, it will also require merging #31.

borzunov · 2023-07-27T12:00:41Z

Hi @Webifi,

Thanks for reporting!

Please note that we can't always just truncate the new delta, since everything before the last token has already gone through the transformer and remembered in its attention caches. That is, you can truncate it for the UI but the model still sees the conversation history as the tokens were not truncated. This is why this code requires the extra_stop_sequences to have length of 1 tokens each (only the last token is easy to replace).

A proper fix would require adding inference_session.truncate(length: int) function to Petals that drops a part of activation/attention caches on both clients and servers. I've planned to add such a function at some point but it's not there yet.

That said, I think it's not always important for LLM performance, since LLM sees and understands that the response is finished anyway (since we put stop_token once we discover an extra_stop_sequence anyway). So we can merge a simple workaround if it's of any use to you, just keep in mind that it won't be 100% correct.

Webifi · 2023-07-27T17:43:14Z

I'll just deal with it client side for now, until a proper fix can be implemented.

Webifi · 2023-07-27T18:30:47Z

@borzunov Could you expand on "remembered in its attention caches" a bit? I'm a little confused on the flow here.

Say web-client requests generation of the following prompt using max_new_tokens of 2:
This is a system prompt###Human: Quack like a duck###Assistant:
And the eventual completion is:
This is a system prompt###Human: Quack like a duck###Assistant: Quack, quack, quack.###Human (We cut off generation here because of new stop code detection and set session.last_token_id = token for ###)

Now, the next new request to the WebSocket API is:
This is a system prompt###Human: Quack like a duck###Assistant: Quack, quack, quack.###Human: Great, now bark like a dog.###Assistant:

How do the attention caches play a role here, if at all?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop sequence missed #32

Stop sequence missed #32

Webifi commented Jul 26, 2023 •

edited

borzunov commented Jul 27, 2023 •

edited

Webifi commented Jul 27, 2023 •

edited

Webifi commented Jul 27, 2023 •

edited

Stop sequence missed #32

Stop sequence missed #32

Comments

Webifi commented Jul 26, 2023 • edited

borzunov commented Jul 27, 2023 • edited

Webifi commented Jul 27, 2023 • edited

Webifi commented Jul 27, 2023 • edited

Webifi commented Jul 26, 2023 •

edited

borzunov commented Jul 27, 2023 •

edited

Webifi commented Jul 27, 2023 •

edited

Webifi commented Jul 27, 2023 •

edited