Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding GeLU algorithm in layers #17

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

kayabaakihiko13
Copy link

GeLU

The Gaussian Error Linear Unit (GeLU) stands out as a robust and high-performing neural network activation function. It draws inspiration from an amalgamation of characteristics found in dropout, zoneout, and Rectified Linear Unit (ReLU) functions. GeLU introduces a smooth and continuous non-linearity, which effectively addresses the vanishing gradient problem often encountered in deep neural networks.

Refence

GAUSSIAN ERROR LINEAR UNITS (GELUS)

@eduardoleao052
Copy link
Owner

Amazing addition! I'll review and test it, then merge it straight away. Thanks for the contribution!

@eduardoleao052
Copy link
Owner

Hey, the GeLU was resulting in some NaNs, so I went to the paper you cited and I have a question: wouldn't this be closer to what it describes?

forward(z: Tensor): Tensor {
    // Implementasi forward pass untuk GeLU nonlinearity di sini
    const erf_coeff = 2/Math.PI;
    const exp_term = z.mul(z).mul(-1).exp();
    const erf_term = exp_term.mul(erf_coeff).sqrt().add(1);
    const result = z.mul(0.5).mul(erf_term);
    return result;
  }

In your code, you added z to erf_term inside of the last parenthesis. What is your opinion?

@kayabaakihiko13
Copy link
Author

Hey, the GeLU was resulting in some NaNs, so I went to the paper you cited and I have a question: wouldn't this be closer to what it describes?

However, the explanation regarding GeLU is similar to ReLU, with the distinction that GeLU permits the output to include small negative values when the input is less than zero.

for erf_term,i think like want calculation erf_term first,after that i multipilcation with z and 0.5

@medic-code
Copy link
Contributor

medic-code commented Apr 3, 2024

Just a quick thought, but is it worth doing some unit tests and consider a simple integration tests ? (You allude to it in the fact you got some NaN's).

Something to think about more generally on additions for the algorithms going forward.

@eduardoleao052
Copy link
Owner

@medic-code tested it by adding GeLUs in the integration test.
But it does not seem to be working

@medic-code
Copy link
Contributor

medic-code commented Apr 5, 2024

Sounds like some unit tests are needed atleast to interrogate it. Not necessarily to push to develop branch all of them. I'm more in favour of unit tests being kept to smaller numbers where possible (just a preference).

@medic-code
Copy link
Contributor

medic-code commented Apr 6, 2024

@kayabaakihiko13 could you perhaps support making some commits that test your feature to your PR ? If you need support with this just reach out.

Also is there any chance you can narrow down the any types in the typescript you've committed ? If you need support with this we've been working on the migrating to TS for the other parts of the code, which might be instructive.

@medic-code
Copy link
Contributor

medic-code commented Apr 9, 2024

@eduardoleao052 just a quick question but why do we multiply the GeLu applied tensor by the input tensor ? Note i see we do this for ReLu too.

I've just been getting my head around this feature and layers in general and isn't it such that we apply GeLu to a tensor and return that modified tensor ?

@eduardoleao052
Copy link
Owner

@medic-code
In the ReLU, what we're doing is creating a mask tensor, containing 0 where the input is negative, and 1 where the input is positive. Then we're multiplying it by the input tensor.

It could just create an output tensor by directly modifying the negative numbers in the tensor to zero, but that would mess with the tensor's gradients. By multiplying by a mask, it just multiplies by it in the backprop as well.

@medic-code
Copy link
Contributor

medic-code commented Apr 10, 2024

I suppose my question is why do we multiply by the input tensor ? Is it not that we apply ReLu and pass the modified tensor to the next layer in the forward pass ? I apologise i'm not an ML expert so just curious.

https://www.cs.cmu.edu/~./15780/notes/pytorch.html - The simple two layer network's forward method seems to suggest what I'm saying.

class ReLU(Module):
    def forward(self, X):
        return torch.maximum(X, torch.tensor(0.))
        
class TwoLayerNN(Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.linear1 = Linear(in_dim, hidden_dim)
        self.linear2 = Linear(hidden_dim, out_dim, init_factor=1.0)
        self.relu = ReLU()

    def forward(self, X):
        return self.linear2(self.relu(self.linear1(X)))

@eduardoleao052
Copy link
Owner

That implementation is correct, however it requires a differentiable (with backward pass) torch.maximum(a,b) operation, that compares the input tensor a and a zeros tensor b, returning the largest and thus applying ReLU.

In my implementation, I multiply by a b tensor as well, that simply has zeros where the input tensor must become zero. I multiply by the input tensor to make the operation differentiable. This way, when the tensors are coming from the output tensor, they reach the input tensor through a simple multiplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants