Gradient Descent from Scratch

It is a gradient descent algorithm for classification implemented from scratch using numpy library.

Dependencies

Numpy
Matplotlib
Pandas

Importing Dataset

Datasets are first imported as pandas dataframe and then converted into numpy arrays

train_data_frame = pd.read_csv('train_dataset.csv', header=None)
test_data_frame  = pd.read_csv('test_dataset.csv',  header=None)

train_dataset = np.array(train_data_frame)
test_dataset  = np.array(test_data_frame)

Splitting data into input and label

Here in MNIST, column 0 is label and all other are inputs

train_lable = np.array([train_dataset[:, 0]])
train_data  = np.array(train_dataset[:, 1:785]).T

test_lable  = np.array([test_dataset[:, 0]]).T
test_data   = np.array(test_dataset[:, 1:785])

It is good practice to shuffle data at first
numpy.random.shuffle() will shuffle array in place

np.random.shuffle(train_dataset)

Parameter Initialization

In This network, weights are initialized randomly while biases are initialized zero as a list of numpy array

def __init__(self, size):
	self.biases  = [np.zeros([y, 1]) for y in size[1:]]
	self.weights = [np.random.randn(y, x)*0.01 for x, y in zip(size[:-1], size[1:])]

Choosing Hyperparameter

Mini Batch Size is size of input data flowing through network at a time for calculating error as a whole
Learning Rate Alpha decides the rate at which, weights and biases will update while back propagation
Number of Epochs decides number of times, the whole dataset will be used to train the network
Set Mini Batch Size to 1/10^th of total data available. And update it manually after every train of network to find its optimum value
Alpha should be selected such that learning isn't very slow as well as it didn't take long jump or else, network will start diverging from local minima
Number of epochs are selected such that network don't overfit itself over noise

Feed Forward

For layer l = 2, 3,..., L compute

z[l] = W[l].A[l-1] + B[l]
A[l] = σ(Z[l])

def train_feed_forward(self, size, input, activators):
    self.z = [np.zeros([y, input]) for y in size[:]]
    i=0
    self.z[0] = input
    for bias, weight in zip(self.biases, self.weights):
        input = (np.dot(weight, input) + bias)
        self.z[i+1] = input
        input = activation(input)
        i=i+1
    return input

Activation Functions

Applying activation functions will change nature of network from linear from to non linear so that it could fit the outputs more accurately or else, it would be no different than any linear regression

def sigmoid(z, derivative=False):
    if derivative==True:
        return (activator.sigmoid(z=z, derivative=False) * (1 - activator.sigmoid(z=z, derivative=False)))
    return (1.0 / (1.0 + np.exp(-z)))

def softmax(z, derivative=False):
    if derivative==True:
        return (activator.softmax(z=z, derivative=False) * (1 - activator.softmax(z=z, derivative=False)))
    return (np.exp(z) / np.sum(np.exp(z)))

def tanh(z, derivative=False):
    if derivative==True:
        return (activator.tanh(z=z, derivative=False) * (1 - activator.tanh(z=z, derivative=False)))
    return (np.tanh(z))

def relu(z, derivative=False):
    if derivative==True:
        der_z = np.zeros(z.shape)
        for i in range(len(z.shape)):
            for j in range(len(z[i])):
                if(z[i, j]>0):
                    der_z[i, j] = 1
        return der_z
    return (np.maximum(z, 0))

Error and Loss Function

For error calculation, mean squared error is used

Output Error δ[L] = ∇aC ⊙ σ′(Z[L])

Mean Squared Error = (Predicted_value - Expected_value)²

def loss(self, Y, Y_hat, derivative=False):
    if derivative==True:
        return (Y_hat-Y)
    return ((Y_hat - Y) ** 2)

Backpropagation

In ANN, output will depend on every neuron it pass through
For output layer, we have label according to which, it is possible to find it's expected value
But for all other layers, there is no single solution available
So, finding optimum value is little harder for that

For each l=L−1,L−2,…,2 compute

δ^l = (W^l+1T . δ^l+1) ⊙ σ′Z^l

δ^l = ((W[l+1])T δ[l+1]) ⊙ σ′(Z[l])

delta_nabla = self.find_nabla(size=size, activators=activators, mini_batch=mini_batch, mini_batch_size=mini_batch_size, y=y, alpha=alpha)

y_hat = self.train_feed_forward(size=size, input=mini_batch, activators=activators, mini_batch_size=mini_batch_size)
delta_nabla_b = [np.zeros([y, 1]) for y in size[1:]]
delta_nabla_w = [np.zeros([y, x]) for x, y in zip(size[:-1], size[1:])]

delta = self.loss(Y=y, Y_hat=y_hat, derivative=True) * activator.sigmoid(z=y_hat, derivative=True)
delta_nabla_b[-1] += np.sum(delta)
delta_nabla_w[-1] += np.dot(delta, self.z[-2].T)

delta = np.dot(self.weights[layer_no].T, delta) * activator.sigmoid(z=self.z[layer_no-1], derivative=True)

delta_nabla_b[layer_no-1] += np.sum(delta)
delta_nabla_w[layer_no-1] += np.dot(delta, self.z[layer_no-2].T)

delta_nabla = [delta_nabla_b, delta_nabla_w]

Updating Weight and Biases

∂C/∂W^l_{j, k} = A^l-1_k . δ^l_j

∂C/∂B^l = δ^l_j

self.biases  = [b-((alpha/mini_batch_size)*n_b) for b, n_b in zip(self.biases, delta_nabla[0])]
self.weights = [w-((alpha/mini_batch_size)*n_w) for w, n_w in zip(self.weights, delta_nabla[1])]

Creating Network

In size_layers, define number of neurons on every layer of network and activation functions of every layer in activations

Network will work if more layers are added or deleted

neuron_layer = {"size_layers": [784, 2800, 10], "activations": ["tanh", "sigmoid"] }
my_network = network(neuron_layer["size_layers"])

Training Network

my_network.grad_descn(size=neuron_layer["size_layers"], expected_value=train_lable, training_data=train_data, activators=neuron_layer["activations"], alpha=0.01, mini_batch_size=2000, epochs=40)

Testing Network

result = test_feed_forward(size=neuron_layer["size_layers"], input=test_data.T, activators=neuron_layer["activations"])

no_trues = 0

for i in range(len(test_data)):
    max_ans = result[0, i]
    max_ind = 0
    for j in range(10):
        if(result[j, i]>max_ans):
            max_ind = j
            max_ans = result[j, i]
    if(test_lable[i]==max_ind):
        no_trues+=1

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
ann.ipynb		ann.ipynb
ann.jpg		ann.jpg
backprop.jpg		backprop.jpg

License

pushpulldecoder/Gradient-Descent-Scratch

Folders and files

Latest commit

History

Repository files navigation

Gradient Descent from Scratch

Dependencies

Importing Dataset

Datasets are first imported as pandas dataframe and then converted into numpy arrays

Splitting data into input and label

Here in MNIST, column 0 is label and all other are inputs

Parameter Initialization

In This network, weights are initialized randomly while biases are initialized zero as a list of numpy array

Choosing Hyperparameter

Feed Forward

For layer l = 2, 3,..., L compute

z[l] = W[l].A[l-1] + B[l]

A[l] = σ(Z[l])

Activation Functions

Applying activation functions will change nature of network from linear from to non linear so that it could fit the outputs more accurately or else, it would be no different than any linear regression

Error and Loss Function

For error calculation, mean squared error is used

Output Error δ[L] = ∇aC ⊙ σ′(Z[L])

Mean Squared Error = (Predicted_value - Expected_value)2

Backpropagation

For each l=L−1,L−2,…,2 compute

δl = (Wl+1T . δl+1) ⊙ σ′Zl

δl = ((W[l+1])T δ[l+1]) ⊙ σ′(Z[l])

Updating Weight and Biases

∂C/∂Wlj, k = Al-1k . δlj

∂C/∂Bl = δlj

Creating Network

Network will work if more layers are added or deleted

Training Network

Testing Network

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Mean Squared Error = (Predicted_value - Expected_value)²

δ^l = (W^l+1T . δ^l+1) ⊙ σ′Z^l

δ^l = ((W[l+1])T δ[l+1]) ⊙ σ′(Z[l])

∂C/∂W^l_{j, k} = A^l-1_k . δ^l_j

∂C/∂B^l = δ^l_j