# Convolutional Neural Networks

This post introduces the convolutional neural network.

# 1. Basic concepts of convolutional neural networks

Convolutional Neural Networks (CNN) is a kind of feedforward neural network with convolution calculation and depth structure. It is one of the representative algorithms of deep learning. The convolutional neural network has the capability of representation learning, which can translate the input information into shift-invariant classification according to its hierarchical structure, so it is also called Shift-Invariant Artificial Neural Networks(SIANN).

Inspired by Hubel and Wiesel's visual cortical electrophysiological studies on cats, a convolutional neural network (CNN) was proposed. Yann Lecun was the first to use CNN for handwritten digit recognition and has maintained its pioneering position in this field. In recent years, convolutional neural networks have continued to develop in multiple directions, with breakthroughs in speech recognition, face recognition, general object recognition, motion analysis, natural language processing, and even brain wave analysis.

The difference between convolutional neural network and common neural network lies in that the convolutional neural network contains a feature extractor composed of convolution layer and subsampling layer. In the convolution layer of a convolutional neural network, a neuron is connected only to some adjacent neurons. In a convolutional layer of the CNN, several feature Maps are usually included. Each feature map consists of a number of rectangularly arranged neurons. The neurons of the same feature Map share weights, and the weights shared here are convolution kernels. The convolution kernel is generally initialized in the form of a random decimal matrix. During the training process of the network, the convolution kernel learns a reasonable weight. The direct benefit of the shared weight (convolution kernel) is to reduce the connections between layers of the network and reduce the risk of overfitting. Sub-sampling is also called pooling. Generally, there are two types of sub-sampling: Average sub-sampling (mean pooling) and maximum sub-sampling (max pooling). Subsampling can be considered as a special convolution process. Convolution and subsampling greatly simplify the complexity of the model and reduce the parameters of the model.

# 2. Principle of convolutional neural network

## 2.1 Neural network

Each unit of the neural network is as follows:

The corresponding formula is as follows:

The unit may also be referred to as an Logistic regression model. When multiple units are combined and have a layered structure, a neural network model is formed. The following figure shows a neural network with an hidden layer.

The corresponding formula is as follows:

It can be extended to 2,3,4,5……hidden layers.

The training method of the neural network is similar to that of the Logistic. However, because of its multilayer, a chain derivation rule is required to guide the node of the hidden layers., that is, the gradient descent + chain derivation rule, and the professional name is reverse propagation.

## 2.2 Convolutional neural network

Inspired by Hubel and Wiesel's visual cortical electrophysiological studies on cats, a convolutional neural network (CNN) was proposed. Yann Lecun was the first to use CNN for handwritten digit recognition and has maintained its pioneering position in this field. In recent years, convolutional neural networks have continued to develop in multiple directions, with breakthroughs in speech recognition, face recognition, general object recognition, motion analysis, natural language processing, and even brain wave analysis.

The difference between convolutional neural network and common neural network lies in that the convolutional neural network contains a feature extractor composed of convolution layer and subsampling layer. In the convolution layer of a convolutional neural network, a neuron is connected only to some adjacent neurons. In a convolutional layer of the CNN, several feature Maps are usually included. Each feature map consists of a number of rectangularly arranged neurons. The neurons of the same feature Map share weights, and the weights shared here are convolution kernels. The convolution kernel is generally initialized in the form of a random decimal matrix. During the training process of the network, the convolution kernel learns a reasonable weight. The direct benefit of the shared weight (convolution kernel) is to reduce the connections between layers of the network and reduce the risk of overfitting. Sub-sampling is also called pooling. Generally, there are two types of sub-sampling: Average sub-sampling (mean pooling) and maximum sub-sampling (max pooling). Subsampling can be considered as a special convolution process. Convolution and subsampling greatly simplify the complexity of the model and reduce the parameters of the model.

The following figure shows the basic structure of the convolutional neural network.

The neural network consists of three parts. The first part is the input layer. The second part consists of multiple convolutional layers and pooling layers. The third part is composed of a fully connected multilayer perceptual classifier.

### 2.2.1 Local Receptive Field

There are two effective tools to reduce the number of parameters of the convolutional neural network. The first tool is called local receptive field. It is generally believed that people's perception of the outside world is from local to global, while the spatial relation of images is also close to local pixels, while the correlation of pixels with far distance is weaker. Therefore, each neuron does not need to perceive the global image, and only needs to sense the local information, and then synthesizes the local information at the higher layer to obtain the global information. The idea of partial connectivity of the neural network is also a thought inspired by the structure of the visual system in biology. Neurons in the visual cortex are receptive to information in a locally way.（

That is, these neurons only respond to stimulation in certain areas).

As shown in the following figure: The left figure shows the full connection, and the right figure shows the local connection.

In the upper right figure, if each neuron is connected to only 10×10 pixels, the weight data is reduced from 1000000×100 parameters to one in ten thousand of the original value. The 10×10 parameters corresponding to the 10×10 pixel values are actually equivalent to the convolution operation.

### 2.2.2 Weight sharing

But there are still too many parameters, so you need a second tool, weight sharing. In the above local connection, each neuron corresponds to 100 parameters, a total of 1000000 neurons. If the 100 parameters of the 1000000 neurons are equal, then the number of parameters becomes 100.

How to understand weight sharing? The 100 parameters (that is, the convolution operation) can be regarded as the way to extract features, which is irrelevant to the location. The principle is as follows: The statistical properties of a part of the image are the same as those of other parts. This also means that the features we learn in this part can also be used in another part, so we can use the same learning features for all the positions on this image.

More intuitively, when a small block (for example, 8x8) is randomly selected from a large size image as a sample, and some features are learned from the small sample, the features learned from the 8x8 sample can be used as detectors and applied to any place of the image. In particular, we can use the characteristics learned from the 8x8 sample to convolve with the original large size image, so as to obtain a different feature activation value for any position on the large size image.

As shown in the following figure, a process of convolution of a 3×3 convolution core on a 5×5 image is presented. Each convolution is a feature extraction mode. It is like a sieve, which filters out the parts that meet the conditions in the image (the larger the activation value is, the more the conditions are met).

### 2.2.3 Multiple-convolution kernel

If there are only 100 parameters, it indicates that there is only one 10*10 convolution kernel. Obviously, feature extraction is insufficient. We can add multiple convolution kernels, such as 32 convolution kernels, to learn 32 features. When there are multiple convolution kernel, as shown in the following figure:

In the preceding figure, different colors indicate different convolution kernels. Each convolution kernel generates another image. For example, two convolution kernels may generate two images, and the two images may be considered as different channels of an image. As shown in the following figure, there is a small error, that is, change w1 to w0 and w2 to w1. They are still referred to as w1 and w2 below.

The following figure shows the convolution operation on four channels. There are two convolution kernels and two channels are generated. Note that each channel on the four channels corresponds to one convolution kerne. If only w1 is used, then the value at a position (i, j) of w1 is obtained by adding the convolution results at (i, j) on the four channels and then taking the value of the activation function.

Therefore, when two channels are convolved from four channels in the preceding figure, the number of parameters is 4×2×2×2, where 4 indicates four channels, the first 2 indicates that two channels are generated, and the last 2×2 indicates the size of the convolution kernel.

### 2.2.4 Down-pooling

After the features are obtained by convolution, the next step is to use these features to classify. In theory, people can use all extracted features to train classifiers, such as softmax classifiers, but this is a challenge to computational complexity. For example: For a 96X96 pixel image, it is assumed that we have learned 400 features defined on the 8X8 input, and each feature and image convolution will obtain a convolution feature of the (96 - 8 + 1) × (96 - 8 + 1) = 7921 dimension. Because there are 400 features, therefore, each sample (example) obtains a 7921 × 400 = 3,168,400 dimensional convolution feature vector. It is inconvenient to learn a classifier with more than 3 million feature input, and it is easy to overfit (over-fitting).

To solve this problem, first of all, we decided to use the convolution feature because the image has a "static" attribute, which means that the features that are useful in an image area are likely to be equally applicable in another area. Therefore, in order to describe a large image, A natural idea is to aggregate statistics on features of different locations, for example, people can calculate the average (or maximum) value of a particular feature in an area of an image. These summary statistical features not only have a much lower dimension (compared with all extracted features), but also improve results (not easy to over-fit). This aggregation operation is called pooling (pooling).，

There are two types of subsampling, one is the average value subsampling (mean-pooling) and the other is the maximum subsampling (max-pooling). As shown in the following figure:

(1) Each weight in the convolution kernel of the Average sub-sampling (mean pooling) is 0.25, and the sliding step of the convolution kernel on the original inputX is 2. The effect of Average sub-sampling (mean pooling) is to reduce the original image to 1/4.

(2) Only one of the weights in the convolution kernel of the maximum sub-sampling (max pooling) is 1, the other values are 0, and the position where the convolution kernel is 1 corresponds to the position where the value of the inputX is the largest in the coverage part of the convolution kernel. The sliding step of convolution kernel on the original inputX is 2. The effect of maximum sub-sampling (max pooling) is to reduce the original image to 1/4 and retain the strongest input in each 2*2 area.

The above is the basic structure and principle of convolutional neural network.

### 2.2.5 Multi-convolutional layer

In actual application, multiple layers of convolution are used, and then the full connection layer is used for training. The features learned by a layer of convolution are often local, and the higher the number of layers, the more global the learned features are.

Translated from: https://blog.csdn.net/qq_45360887/article/details/94737562