The previous article dealt with the networks and the backpropagation algorithm. This article is about the mathematical implementation of the algorithm in FFN followed by an important concept called hyper-parameter tuning.
In this FFN we apply the backpropagation to find the partial derivative of the loss function with respect to w1 so as to update w1.
Hence using backpropagation the algorithm determines the update required in the parameters so as to match the predicted output with the true output. The algorithm which performs this is known as Vanilla Gradient Descent.
The way of reading the input is determined using the strategy.
|Stochastic||One by one|
|Batch||Splitting entire input into batches|
|Mini-batch||Splitting batch into batches|
The sigmoid here is one of the types of the activation function. It is defined as the function pertaining to the transformation of input to output in a particular neuron. Differentiating the activation function gives the respective terms in the gradients.
There are two common phenomena seen in training networks. They are
- Under fitting
- Over fitting
If the model is too simple to learn the data then the model can underfit the data. In that case, complex models and algorithms must be used.
If the model is too complex to learn the data then the model can overfit the data. This can be visualized by seeing the differences in the training and testing loss function curves. The method adopted to change this is known as regularisation. Overfit and underfit can be visualized by plotting the graph of testing and training accuracies over the iterations. Perfect fit represents the overlapping of both curves.
Regularisation is the procedure to prevent the overfitting of data. Indirectly, it helps in increasing the accuracy of the model. It is either done by
- Adding noises to input to affect and reduce the output.
- To find the optimum iterations by early stopping
- By normalising the data (applying normal distribution to input)
- By forming subsets of a network and training them using dropout.
So far we have seen a lot of examples for a lot of procedures. There will be confusion arising at this point on what combination of items to use in the network for maximum optimization. There is a process known as hyper-parameter tuning. With the help of this, we can find the combination of items for maximum efficiency. The following items can be selected using this method.
- Network architecture
- Number of layers
- Number of neurons in each layer
- Learning algorithm
- Vanilla Gradient Descent
- Momentum based GD
- Nesterov accelerated gradient
- Activation functions
- Leaky relu
- L2 norm
- Early stopping
- Addition of noise
All these six categories are essential in building a network and improving its accuracy. Hyperparameter tuning can be done in two ways
- Based on the knowledge of task
- Random combination
The first method involves determining the items based on the knowledge of the task to be performed. For example, if classification is considered then
- Activation function- softmax in o/p and sigmoid for rest
- Initialisation- zero or Xavier
- Strategy- stochastic
- Algorithm- vanilla GD
The second method involves the random combination of these items and finding the best combination for which the loss function is minimum and accuracy is high.
Hyperparameter tuning would already be done by researchers who finally report the correct combination of items for maximum accuracy.