Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix offset #1337

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

fix offset #1337

wants to merge 1 commit into from

Conversation

luochang212
Copy link

@luochang212 luochang212 commented May 8, 2024

8.3.4.2. 顺序分区 中,offset 应赋值为

offset = random.randint(0, num_steps - 1)

而非

offset = random.randint(0, num_steps)

我们注意到,random.randint(0, num_steps) 的可能取值是包含 0 和 num_steps 的。比如当 num_steps = 3 时,offset 被赋值为 random.randint(0, 3),此时 offset 的可能取值为:0, 1, 2, 3。

在这里,offset 取 0 和 3 是等效的,都代表偏移量为 0。为避免浪费 token,offset 最好赋值为 random.randint(0, num_steps - 1),这样能最大程度利用 corpus 序列(否则当 offset 正好取到 num_steps 时会浪费一个序列)。

Note: 这里正好有一个容易混淆的点,就是 offset 赋值的下一行是

num_tokens = ((len(corpus) - offset - 1) // batch_size) * batch_size

里面正好包含了一个 offset - 1,一开始我以为上面漏减的在这里补减了,往后看发现不是,这里减 1 的原因是,标号 (y label) 的序列与 x 的序列有长度为 1 的偏移,这里是为了保证 y label 不会取到数组右侧边界之外,才减的 1,才不是为上一行补减的 (・ω< )★

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant