The architecture of VGG consists of convolutional layers followed by a pooling layer. VGG net uses 3 x 3 convolution and 2 x 2 pooling. It is often referred to as VGG-n and the n corresponds to a number of layers, excluding the pooling and softmax layer.
Let’s suppose we are performing an object detection task. The object can appear anywhere in the image. That is, the object can be in the center region of the image, or it can be in the small corner of the image. Also, the shape of the object can vary from image to image. In some images, the object takes large shape while in other images the object takes small shape.
Since the object in the image varies greatly in the image in terms of size and location, it is difficult to identify the object in the image if we use only a single filter with a fixed size. So in the inception network, we use multiple filters of varying sizes.
The inception network contains nine inception blocks. These nine inception blocks are stacked one above the other. First, we take the input image and we perform the convolutional operation with three filters of varying size which includes 1 x 1, 3 x3, and 5 x5. Then we feed the result of the convolutional operation to the next inception block.
We can break down a convolutional layer with a larger filter size into a stack of convolutional layers with smaller filter size and this is known as factorized convolution.
Suppose, we have a convolutional layer with a 5 x 5 filter then it can be broken down into two convolutional layers with 3 x 3 filters.