Is there any reason we used a different LayerNorm implementation? #122

dpheap2222 · 2024-04-03T16:07:11Z

We have a custom defined LayerNorm

annotated-transformer/the_annotated_transformer.py

Lines 315 to 327 in debc9fd

    
           class LayerNorm(nn.Module): 
        
               "Construct a layernorm module (See citation for details)." 
        
               def __init__(self, features, eps=1e-6): 
        
                   super(LayerNorm, self).__init__() 
        
                   self.a_2 = nn.Parameter(torch.ones(features)) 
        
                   self.b_2 = nn.Parameter(torch.zeros(features)) 
        
                   self.eps = eps 
        
               def forward(self, x): 
        
                   mean = x.mean(-1, keepdim=True) 
        
                   std = x.std(-1, keepdim=True) 
        
                   return self.a_2 * (x - mean) / (std + self.eps) + self.b_2

From the look of line 326, there is no specification of 'correction=0'. By default, this means 'correction=1', which applies a Bessel’s correction. Had we removed this correction, we could easily implement with PyTorch's native LayerNorm class. Is there any reason we opted for the custom route? Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any reason we used a different LayerNorm implementation? #122

Is there any reason we used a different LayerNorm implementation? #122

dpheap2222 commented Apr 3, 2024

Is there any reason we used a different LayerNorm implementation? #122

Is there any reason we used a different LayerNorm implementation? #122

Comments

dpheap2222 commented Apr 3, 2024