The Basic Structure of a Convolutional Network

In convolutional neural networks, the states in each layer are arranged according to a spatial grid structure. These spatial relationships are inherited from one layer to the next because each feature value is based on a small local spatial region in the previous layer. It is important to maintain these spatial relationships among the grid cells, because the convolution operation and the transformation to the next layer is critically dependent on these relationships.

Each layer in the convolutional network is a 3-dimensional grid structure, which has a height, width, and depth. The depth of a layer in a convolutional neural network should not be confused with the depth of the network itself. The word “depth” (when used in the context of a single layer) refers to the number of channels in each layer, such as the number of primary color channels (e.g., blue, green, and red) in the input image or the number of feature maps in the hidden layers. The use of the word “depth” to refer to both the number of feature maps in each layer as well as the number of layers is an unfortunate overloading of terminology used in convolutional networks, but we will be careful while using this term, so that it is clear from its context.

The convolutional neural network functions much like a traditional feed-forward neural network, except that the operations in its layers are spatially organized with sparse (and carefully designed) connections between layers. The three types of layers that are commonly present in a convolutional neural network are convolution, pooling, and ReLU. The ReLU activation is no different from a traditional neural network. In addition, a final set of layers is often fully connected and maps in an application-specific way to a set of output nodes.In the following, we will describe each of the different types of operations and layers, and the typical way in which these layers are interleaved in a convolutional neural network.

The input data to the convolutional neural network is organized into a 2-dimensional grid structure, and the values of the individual grid points are referred to as pixels. Each pixel, therefore, corresponds to a spatial location within the image. However, in order to encode the precise color of the pixel, we need a multidimensional array of values at each grid location. In the RGB color scheme, we have an intensity of the three primary colors, corresponding to red, green, and blue, respectively. Therefore, if the spatial dimensions of an image are 32×32 pixels and the depth is 3 (corresponding to the RGB color channels), then the overall number of pixels in the image is 32 × 32 × 3. This particular image size is quite common, and also occurs in a popularly used data set for benchmarking, known as CIFAR-10.

An example of this organization is shown in Figure (a) below. It is natural to represent the input layer in this 3-dimensional structure because two dimensions are devoted to spatial relationships and a third dimension is devoted to the independent properties along these channels. For example, the intensities of the primary colors are the independent properties in the first layer. In the hidden layers, these independent properties correspond to various types of shapes extracted from local regions of the image. For the purpose of discussion, assume that the input in the $q$th layer is of size $L_q×B_q×d_q$. Here, $L_q$ refers to the height (or length), $B_q$ refers to the width (or breadth), and $d_q$ is the depth. In almost all image-centric applications, the values of $L_q$ and $B_q$ are the same. However, we will work with separate notations for height and width in order to retain generality in presentation.


For the first (input) layer, these values are decided by the nature of the input data and its preprocessing. In the above example, the values are $L1$ = 32, $B1$ = 32, and $d1$ = 3.Later layers have exactly the same 3-dimensional organization, except that each of the $d_q$ 2-dimensional grid of values for a particular input can no longer be considered a grid of raw pixels. Furthermore, the value of $d_q$ is much larger than three for the hidden layers because the number of independent properties of a given local region that are relevant to classification can be quite significant. For $q > 1$, these grids of values are referred to as feature maps or activation maps. These values are analogous to the values in the hidden layers in a feed-forward network.

In the convolutional neural network, the parameters are organized into sets of 3-dimensional structural units, known as filters or kernels. The filter is usually square in terms of its spatial dimensions, which are typically much smaller than those of the layer the filter is applied to. On the other hand, the depth of a filter is always same is the same as that of the layer to which it is applied. Assume that the dimensions of the filter in the $q$ th layer are $F_q × F_q × d_q$. An example of a filter with $F_1 = 5$ and $d_1 = 3$ is shown in Figure (a) above. It is common for the value of $F_q$ to be small and odd. Examples of commonly used values of $F_q$ are 3 and 5, although there are some interesting cases in which it is possible to use $F_q = 1$.

The convolution operation places the filter at each possible position in the image (or hidden layer) so that the filter fully overlaps with the image, and performs a dot product between the $F_q × F_q × d_q$ parameters in the filter and the matching grid in the input volume (with same size $F_q ×F_q ×d_q$). The dot product is performed by treating the entries in the relevant 3-dimensional region of the input volume and the filter as vectors of size $F_q ×F_q ×d_q$, so that the elements in both vectors are ordered based on their corresponding positions in the grid-structured volume. How many possible positions are there for placing the filter? This question is important, because each such position therefore defines a spatial “pixel” (or, more accurately, a feature) in the next layer. In other words, the number of alignments between the filter and image defines the spatial height and width of the next hidden layer. The relative spatial positions of the features in the next layer are defined based on the relative positions of the upper left corners of the corresponding spatial grids in the previous layer. When performing convolutions in the $q$th layer, one can align the filter at $L_q+1 = (L_q −F_q +1)$  positions along the height and $B_q+1 = (B_q −F_q+1)$ along the width of the image (without having a portion of the filter “sticking out” from the borders of the image). This results in a total of $L_q+1 × B_q+1$ possible dot products, which defines the size of the next hidden layer. In the previous example, the values of $L_2$ and $B_2$ are therefore defined as follows:

$L_2 = 32 − 5 + 1 = 28$

$B_2 = 32 − 5 + 1 = 28$

The next hidden layer of size 28 × 28 is shown in Figure (a) above. However, this hidden layer also has a depth of size $d_2 = 5$. Where does this depth come from? This is achieved by using 5 different filters with their own independent sets of parameters. Each of these 5 sets of spatially arranged features obtained from the output of a single filter is referred to as a feature map. Clearly, an increased number of feature maps is a result of a larger number of filters (i.e., parameter footprint), which is $F_q^2.d_q.d_{q+1}$ for the $q$th layer. The number of filters used in each layer controls the capacity of the model because it directly controls the number of parameters. Furthermore, increasing the number of filters in a particular layer increases the number of feature maps (i.e., depth) of the next layer. It is possible for different layers to have very different numbers of feature maps, depending on the number of filters we use for the convolution operation in the previous layer. For example, the input layer typically only has three color channels, but it is possible for the each of the later hidden layers to have depths (i.e., number of feature maps) of more than 500. The idea here is that each  filter tries to identify a particular type of spatial pattern in a small rectangular region of the image, and therefore a large number of filters is required to capture a broad variety of the possible shapes that are combined to create the final image (unlike the case of the input layer, in which three RGB channels are sufficient). Typically, the later layers tend to have a smaller spatial footprint, but greater depth in terms of the number of feature maps. For example, the filter shown in Figure (b) above  represents a horizontal edge detector on a grayscale image with one channel. As shown in Figure (b), the resulting feature will have high activation at each position where a horizontal edge is seen. A perfectly vertical edge will give zero activation, whereas a slanted edge might give intermediate activation.Therefore, sliding the filter everywhere in the image will already detect several key outlines of the image in a single feature map of the output volume. Multiple filters are used to create an output volume with more than one feature map. For example, a different filter might create a spatial feature map of vertical edge activations.

The expression above seems notationally complex, although the underlying convolutional operation is really a simple dot product over the entire volume of the filter, which is repeated over all valid spatial positions $(i, j)$ and filters (indexed by $p$). It is intuitively helpful to understand a convolution operation by placing the filter at each of the 28×28 possible spatial positions in the first layer of Figure (a) and performing a dot product between the vector of 5×5×3=75 values in the filter and the corresponding 75 values in $H(1)$. Even though the size of the input layer in Figure 8.1 is 32×32, there are only (32−5+1)×(32−5+1) possible spatial alignments between an input volume of size 32×32 and a filter of size 5×5.


The convolution operation brings to mind Hubel and Wiesel’s experiments that use the activations in small regions of the visual field to activate particular neurons. In the case of convolutional neural networks, this visual field is defined by the filter, which is applied to all locations of the image in order to detect the presence of a shape at each spatial location.Furthermore, the filters in earlier layers tend to detect more primitive shapes, whereas the filters in later layers create more complex compositions of these primitive shapes. This is not particularly surprising because most deep neural networks are good at hierarchical feature engineering.

One property of convolution is that it shows equivariance to translation. In other words, if we shifted the pixel values in the input in any direction by one unit and then applied convolution, the corresponding feature values will shift with the input values. This is because of the shared parameters of the filter across the entire convolution. The reason for sharing  parameters across the entire convolution is that the presence of a particular shape in any part of the image should be processed in the same way irrespective of its specific spatial location.

In Figure above, we have shown an example of an input layer and a filter with depth 1 for simplicity (which does occur in the case of grayscale images with a single color channel). Note that the depth of a layer must exactly match that of its filter/kernel, and the contributions of the dot products over all the feature maps in the corresponding grid region of a particular layer will need to be added (in the general case) to create a single output feature value in the next layer.Figure above depicts two specific examples of the convolution operations with a layer of size $7×7×1$ and a $3×3×1$ filter in the bottom row. Furthermore, the entire feature map of the next layer is shown on the upper right-hand side of Figure . Examples of two convolution operations are shown in which the outputs are 16 and 26, respectively. These values are arrived at by using the following multiplication and aggregation operations:

$5 × 1 + 8 × 1 + 1 × 1 + 1 × 2 = 16$

$4 × 1 + 4 × 1 + 4 × 1 + 7 × 2 = 26$

The multiplications with zeros have been omitted in the above aggregation. In the event that the depths of the layer and its corresponding filter are greater than 1, the above operations are performed for each spatial map and then aggregated across the entire depth of the filter.

A convolution in the qth layer increases the receptive field of a feature from the $q$th layer to the $(q+1)$th layer. In other words, each feature in the next layer captures a larger spatial region in the input layer. For example, when using a 3 × 3 filter convolution successively in three layers, the activations in the first, second, and third hidden layers capture pixel regions of size 3×3, 5×5, and 7×7, respectively, in the original input image. As we will see later, other types of operations increase the receptive fields further, as they reduce the size of the spatial footprint of the layers. This is a natural consequence of the fact that features in later layers capture complex characteristics of the image over larger spatial regions, and then combine the simpler features in earlier layers.

When performing the operations from the qth layer to the $(q + 1)$th layer, the depth $d_{q+1}$ of the computed layer depends on the number of filters in the $q$th layer, and it is independent of the depth of the $q$th layer or any of its other dimensions. In other words, the depth $d_{q+1}$ in the $(q + 1)$th layer is always equal to the number of filters in the $q$thlayer. For example, the depth of the second layer in Figure 1(a) above is 5, because a total of five filters are used in the first layer for the transformation. However, in order to perform the convolutions in the second layer (to create the third layer), one must now use filters of depth 5 in order to match the new depth of this layer, even though filters of depth 3 were used in the convolutions of the first layer (to create the second layer).

Comments

Popular posts from this blog

NEURAL NETWORKS AND DEEP LEARNING CST 395 CS 5TH SEMESTER HONORS COURSE NOTES - Dr Binu V P, 9847390760

Syllabus CST 395 Neural Network and Deep Learning

Back propagation example