About transpose processing in `MultiHeadedAttention` class. #118

Tinghao-NTU · 2023-11-17T09:14:50Z

Below is the forward function of the MultiHeadedAttention class:

    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)

        # 1) Do all the linear projections in batch from d_model => h x d_k
        query, key, value = [
            lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
            for lin, x in zip(self.linears, (query, key, value))
        ]

        # 2) Apply attention on all the projected vectors in batch.
        x, self.attn = attention(
            query, key, value, mask=mask, dropout=self.dropout
        )

        # 3) "Concat" using a view and apply a final linear.
        x = (
            x.transpose(1, 2)
            .contiguous()
            .view(nbatches, -1, self.h * self.d_k)
        )
        del query
        del key
        del value
        return self.linears[-1](x)

I notice that the query, key, value is transposed (lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)') after passing through the linear layers. After calculating the attention, x' is then transposed back (`x.transpose(1, 2)').

May I know why we need such processing? Can we just use `lin(x).view(nbatches, -1, self.h, self.d_k)' and
x =x.contiguous().view(nbatches, -1, self.h * self.d_k)?

I delete all the transposing processing and the result is different. So I am wondering which one is correct, the original one with transpose, or the one without transpose.

The text was updated successfully, but these errors were encountered:

BillyChen123 · 2023-12-21T06:40:55Z

I have the same confusion with this code.

gitfourteen · 2024-03-22T08:39:01Z

Note that -1 represents the length $N_{token}$ (#token) of the current input or time steps of a sequence in a batch and the shape of attention scores for each head is the same, $N_{token} \times N_{token}$.

def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = scores.softmax(dim=-1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About transpose processing in `MultiHeadedAttention` class. #118

About transpose processing in `MultiHeadedAttention` class. #118

Tinghao-NTU commented Nov 17, 2023

BillyChen123 commented Dec 21, 2023

gitfourteen commented Mar 22, 2024 •

edited

About transpose processing in MultiHeadedAttention class. #118

About transpose processing in MultiHeadedAttention class. #118

Comments

Tinghao-NTU commented Nov 17, 2023

BillyChen123 commented Dec 21, 2023

gitfourteen commented Mar 22, 2024 • edited

About transpose processing in `MultiHeadedAttention` class. #118

About transpose processing in `MultiHeadedAttention` class. #118

gitfourteen commented Mar 22, 2024 •

edited