Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Batch-Encode Paired Input Sentences with Tokenizers: Seeking Clarification #1531

Open
insookim43 opened this issue May 14, 2024 · 0 comments

Comments

@insookim43
Copy link

Hello.

I'm using the tokenizer to encoding pair sentences in TemplateProcessing in batch_encode.
There's a confusing part where the method requires two lists for sentence A and sentence B.

According to the guide documentation: "To process a batch of sentences pairs, pass two lists to the Tokenizer.encode_batch method: the list of sentences A and the list of sentences B."

Since it instructs to input two lists, it seems like [[A1, A2], [B1, B2]] --(encode)-> {A1, B1}, {A2, B2}.

However, the actual input expects individual pairs batched, not splitting the sentence pairs into lists for A and B.
So, it should be [[A1, B1], [A2, B2]] to encode as {A1, B1}, {A2, B2}.

I've also confirmed that the length of the input list for encode_batch keeps increasing with the number of batches.

Since the guide instructs to input sentence A and sentence B, this is where the confusion arises.
If I've misunderstood anything, could you help clarify this point so I can understand it better?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant