In the leaky ReLU, instead of setting the alpha to a small value, we can feed the alpha as a parameter to the network and let the network learn the optimal value of the alpha, and this type of leaky ReLU is often called the parametric ReLU function. In the leaky ReLU, we can also set the random value to the alpha. When the value of alpha is set to some random value it is often called the randomized ReLU function.

The softmax function converts the given inputs to the range of 0 to 1. It acts as a generalization of the sigmoid activation function. When we perform a classification task, it is more desirable to apply the softmax function in the output layer since it returns the probability of each class is the output.

Batch size is often set to the power of 2 such as 32, 64, 128, and so on based on the memory requirements of CPU or GPU.

One iteration implies that our network has seen a batch of data points once whereas one epoch implies that our network has seen all the data points once.

The number of neurons in the input layer is set to the number of inputs. The number of neurons in the output layer is set based on the task we perform. For instance, if we are performing regression, then we can set the number of neurons in the output layer to 1. If we are performing classification, then we can set the number of neurons in the output layer to several classes present in the dataset where each neuron emits the probability of the corresponding class being an output.