# DEEP LEARNING SERIES- PART 4

In this FFN we apply the backpropagation to find the partial derivative of the loss function with respect to w1 so as to update w1.

Hence using backpropagation the algorithm determines the update required in the parameters so as to match the predicted output with the true output. The algorithm which performs this is known as Vanilla Gradient Descent.

The way of reading the input is determined using the strategy.

The sigmoid here is one of the types of the activation function. It is defined as the function pertaining to the transformation of input to output in a particular neuron. Differentiating the activation function gives the respective terms in the gradients.

There are two common phenomena seen in training networks. They are

1. Under fitting
2. Over fitting

If the model is too simple to learn the data then the model can underfit the data. In that case, complex models and algorithms must be used.

If the model is too complex to learn the data then the model can overfit the data. This can be visualized by seeing the differences in the training and testing loss function curves. The method adopted to change this is known as regularisation. Overfit and underfit can be visualized by plotting the graph of testing and training accuracies over the iterations. Perfect fit represents the overlapping of both curves.

Regularisation is the procedure to prevent the overfitting of data. Indirectly, it helps in increasing the accuracy of the model. It is either done by

1. Adding noises to input to affect and reduce the output.
2. To find the optimum iterations by early stopping
3. By normalising the data (applying normal distribution to input)
4. By forming subsets of a network and training them using dropout.

So far we have seen a lot of examples for a lot of procedures. There will be confusion arising at this point on what combination of items to use in the network for maximum optimization. There is a process known as hyper-parameter tuning. With the help of this, we can find the combination of items for maximum efficiency. The following items can be selected using this method.

1. Network architecture
2. Number of layers
3. Number of neurons in each layer
4. Learning algorithm
6. Momentum based GD
9. RMSProp
11. Initialisation
12. Zero
13. He
14. Xavier
15. Activation functions
16. Sigmoid
17. Tanh
18. Relu
19. Leaky relu
20. Softmax
21. Strategy
22. Batch
23. Mini-batch
24. Stochastic
25. Regularisation
26. L2 norm
27. Early stopping
29. Normalisation
30. Drop-out

All these six categories are essential in building a network and improving its accuracy. Hyperparameter tuning can be done in two ways

1. Based on the knowledge of task
2. Random combination

The first method involves determining the items based on the knowledge of the task to be performed. For example, if classification is considered then

• Activation function- softmax in o/p and sigmoid for rest
• Initialisation- zero or Xavier
• Strategy- stochastic
• Algorithm- vanilla GD

The second method involves the random combination of these items and finding the best combination for which the loss function is minimum and accuracy is high.

Hyperparameter tuning would already be done by researchers who finally report the correct combination of items for maximum accuracy.