Posts

Parameter Sharing and Parameter Tying

A natural form of regularization that reduces the parameter footprint of the model is the sharing of parameters across different connections. Often, this type of parameter sharing is enabled by domain-specific insights. The main insight required to share parameters is that the function computed at two nodes should be related in some way. This type of insight can be obtained when one has a good idea of how a particular computational node relates to the input data. Examples of such parameter-sharing methods are as follows: 1. Sharing weights in autoencoders: The symmetric weights in the encoder and decoder portion of the autoencoder are often shared. Although an autoencoder will work whether or not the weights are shared, doing so improves the regularization properties of the algorithm. In a single-layer autoencoder with linear activation, weight sharing forces orthogonality among the different hidden components of the weight matrix. This provides the same reduction as singular value ...

The Bias-Variance Trade-Off

Image
The bias-variance trade-off states that the squared error of a learning algorithm can be partitioned into three components: 1. Bias: The bias is the error caused by the simplifying assumptions in the model, which causes certain test instances to have consistent errors across different choices of training data sets. Even if the model has access to an infinite source of training data, the bias cannot be removed. For example, in the case of Figure 4.2, the linear model has a higher model bias than the polynomial model, because it can never fit the (slightly curved) data distribution exactly, no matter how much data is available. The prediction of a particular out-of-sample test instance at x = 2 will always have an error in a particular direction when using a linear model for any choice of training sample. If we assume that the linear and curved lines in the top left of Figure 4.2 were estimated using an infinite amount of data, then the difference between the two at any particular values...

Dropout

Image
  Parametric Model Selection and Averaging One challenge in the case of neural network construction is the selection of a large number of  hyperparameters like the depth of the network and the number of neurons in each layer.  Furthermore, the  choice of the activation function also has an effect on performance , depending on the application at hand. The presence of a large number of parameters creates problems in model construction, because the performance might be sensitive to the particular configuration used. One possibility is to hold out a portion of the training data and try different combinations of parameters and model choices. The selection that provides the highest accuracy on the held-out portion of the training data is then used for prediction. This is, of course, the standard approach used for parameter tuning in all machine learning models, and is also referred to as model selection. In a sense, model selection is inherently an ensemble-centric app...

Convolution Operation

Image
In its most general form, convolution is an operation on two functions of a real valued argument. To motivate the definition of convolution, we start with examples of two functions we might use. Suppose we are tracking the location of a spaceship with a laser sensor. Our laser sensor provides a single output $x(t)$, the position of the spaceship at time $t$. Both $x$ and $t$ are real-valued, i.e., we can get a different reading from the laser sensor at any instant in time.Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of the spaceship’s position, we would like to average together several measurements. Of course, more recent measurements are more relevant, so we will want this to be a weighted average that gives more weight to recent measurements. We can do this with a weighting function $w(a)$, where $a$ is the age of a measurement.If we apply such a weighted average operation at every moment, we obtain a new function providing a smoothed estimate ...

Introduction to CNN

 Convolutional networks (LeCun, 1989), also known as convolutional neural networks or CNNs , are a specialized kind of neural network for processing data that has a known, grid-like topology. Examples include time-series data, which can be thought of as a 1D grid taking samples at regular time intervals, and image data, which can be thought of as a 2D grid of pixels. Convolutional networks have been tremendously successful in practical applications. The name “convolutional neural network” indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation. Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.The vast majority of applications of convolutional neural networks focus on image data, although one can also use these networks for all types of temporal, spatial, and spatiotemporal data.An important property of image...

Motivation

Image
Convolution leverages three important ideas that can help improve a machine learning system: sparse interactions, parameter sharing and equivariant representations . Moreover, convolution provides a means for working with inputs of variable size. We now describe each of these ideas in turn. Traditional neural network layers use matrix multiplication by a matrix of parameters with a separate parameter describing the interaction between each input unit and each output unit. This means every output unit interacts with every input unit.Convolutional networks, however, typically have sparse interactions (also referred to as sparse connectivity or sparse weights). This is accomplished by making the kernel smaller than the input. For example, when processing an image, the input image might have thousands or millions of pixels, but we can detect small, meaningful features such as edges with kernels that occupy only tens or hundreds of pixels. This means that we need to store fewer parameter...

Padding

Image
One observation is that the convolution operation reduces the size of the $(q + 1)$th layer  in comparison with the size of the $q$th layer. This type of reduction in size is not desirable  in general, because it tends to lose some information along the borders of the image (or  of the feature map, in the case of hidden layers). This problem can be resolved by using  padding. In padding, one adds $(F_q −1)/2$ “pixels” all around the borders of the feature map  in order to maintain the spatial footprint. Note that these pixels are really feature values  in the case of padding hidden layers. The value of each of these padded feature values is set  to 0, irrespective of whether the input or the hidden layers are being padded. As a result, the spatial height and width of the input volume will both increase by $(F_q − 1)$, which is  exactly what they reduce by (in the output volume) after the convolution is performed. The  padded portions do ...