DEEP LEARNING SERIES- PART 4

The previous article dealt with the networks and the backpropagation algorithm. This article is about the mathematical implementation of the algorithm in FFN followed by an important concept called hyper-parameter tuning.

In this FFN we apply the backpropagation to find the partial derivative of the loss function with respect to w1 so as to update w1.

Hence using backpropagation the algorithm determines the update required in the parameters so as to match the predicted output with the true output. The algorithm which performs this is known as Vanilla Gradient Descent.

The way of reading the input is determined using the strategy.

Strategy	Meaning
Stochastic	One by one
Batch	Splitting entire input into batches
Mini-batch	Splitting batch into batches

The sigmoid here is one of the types of the activation function. It is defined as the function pertaining to the transformation of input to output in a particular neuron. Differentiating the activation function gives the respective terms in the gradients.

There are two common phenomena seen in training networks. They are

Under fitting
Over fitting

If the model is too simple to learn the data then the model can underfit the data. In that case, complex models and algorithms must be used.

If the model is too complex to learn the data then the model can overfit the data. This can be visualized by seeing the differences in the training and testing loss function curves. The method adopted to change this is known as regularisation. Overfit and underfit can be visualized by plotting the graph of testing and training accuracies over the iterations. Perfect fit represents the overlapping of both curves.

Regularisation is the procedure to prevent the overfitting of data. Indirectly, it helps in increasing the accuracy of the model. It is either done by

Adding noises to input to affect and reduce the output.
To find the optimum iterations by early stopping
By normalising the data (applying normal distribution to input)
By forming subsets of a network and training them using dropout.

So far we have seen a lot of examples for a lot of procedures. There will be confusion arising at this point on what combination of items to use in the network for maximum optimization. There is a process known as hyper-parameter tuning. With the help of this, we can find the combination of items for maximum efficiency. The following items can be selected using this method.

Network architecture
Number of layers
Number of neurons in each layer
Learning algorithm
Vanilla Gradient Descent
Momentum based GD
Nesterov accelerated gradient
AdaGrad
RMSProp
Adam
Initialisation
Zero
He
Xavier
Activation functions
Sigmoid
Tanh
Relu
Leaky relu
Softmax
Strategy
Batch
Mini-batch
Stochastic
Regularisation
L2 norm
Early stopping
Addition of noise
Normalisation
Drop-out

All these six categories are essential in building a network and improving its accuracy. Hyperparameter tuning can be done in two ways

Based on the knowledge of task
Random combination

The first method involves determining the items based on the knowledge of the task to be performed. For example, if classification is considered then

Activation function- softmax in o/p and sigmoid for rest
Initialisation- zero or Xavier
Strategy- stochastic
Algorithm- vanilla GD

The second method involves the random combination of these items and finding the best combination for which the loss function is minimum and accuracy is high.

Hyperparameter tuning would already be done by researchers who finally report the correct combination of items for maximum accuracy.

HAPPY READING!!!

Rate this:

Share this:

Related

Discover more from Track2Training