Teaching Deep Learners to Generalize
Neural networks are powerful learners that have repeatedly proven to be capable of learning
complex functions in many domains. However, the great power of neural networks is also
their greatest weakness; neural networks often simply overfit the training data if care is not
taken to design the learning process carefully.
In practical terms, what overfitting means
is that a neural network will provide excellent prediction performance on the training data
that it is built on, but will perform poorly on unseen test instances. This is caused by the fact
that the learning process often remembers random artifacts of the training data that do not
generalize well to the test data. Extreme forms of overfitting are referred to as memorization.
A helpful analogy is to think of a child who can solve all the analytical problems for which
he or she has seen the solutions, but is unable to provide useful solutions to a new problem.
However, if the child is exposed to the solutions of more and more different types of problems,
he or she will be more likely to solve a new problem by abstracting out the essence of the
patterns that are repeated across different problems and their solutions. Machine learning
proceeds in a similar way by identifying patterns that are useful for prediction. For example,
in a spam detection application, if the pattern “Free Money!!” occurs thousands of times
in spam emails, the machine learner generalizes this rule to identify spam email instances
it has not seen before. On the other hand, a prediction that is based on the patterns seen
in a tiny training data set of two emails will lead to good performance on those emails but
not on new emails. The ability of a learner to provide useful predictions for instances it has
not seen before is referred to as generalization.
Generalization is a useful practical property, and is therefore the holy grail in all machine learning applications. After all, if the training examples are already labeled, there is no practical use of predicting such examples again. For example, in an image-captioning application one is always looking to use the labeled images in order to learn captions for images that the learner has not seen before.
The level of overfitting depends both on the complexity of the model and on the amount of data available. The complexity of the model defined by a neural network depends on the number of underlying parameters. Parameters provide additional degrees of freedom, which can be used to explain specific training data points without generalizing well to unseen points. For example, imagine a situation in which we attempt to predict the variable $y$ from $x$ using the following formula for polynomial regression:
$\hat{y}=\sum_{i=0}^d w_ix^i$
This is a model that uses $(d + 1)$ parameters $w_0 . . . w_d$ in order to explain pairs $(x, y)$ available to us. One could implement this model by using a neural network with $d$ inputs corresponding to $x^1, x^2 . . . x^d$, and a single bias neuron whose coefficient is $w_0$. The loss function uses the squared difference between the observed value $y$ and predicted value $\hat{y}$. In general, larger values of $d$ can capture better nonlinearity. For example, in the case of Figure below, a nonlinear model with $d = 4$ should be able to fit the data better than a linear model with $ d= 1$, given an infinite amount (or a lot) of data. However, when working with a small, finite data set, this does not always turn out to be the case.
If we have $(d+1)$ or less training pairs $(x, y)$, it is possible to fit the data exactly with zero error irrespective of how well these training pairs reflect the true distribution. For example, consider a situation in which we have five training points available. One can show that it is possible to fit the training points exactly with zero error using a polynomial of degree 4. This does not, however, mean that zero error will be achieved on unseen test data. An example of this situation is illustrated in Figure below, where both the linear and polynomial models on three sets of five randomly chosen data points are shown. It is clear that the linear model is stable, although it is unable to exactly model the curved nature of the true data distribution. On the other hand, even though the polynomial model is capable of modeling the true data distribution more closely, it varies wildly over the different training data sets. Therefore, the same test instance at $x = 2$ (shown in Figure ) would receive similar predictions from the linear model, but would receive very different predictions from the polynomial model over different choices of training data sets. The behavior of the polynomial model is, of course, undesirable from a practitioner’s point of view, who would expect similar predictions for a particular test instance, even when different samples of the training data set are used.Since all the different predictions of the polynomial model cannot be correct, it is evident that the increased power of the polynomial model over the linear model actually increases the error rather than reducing it. This difference in predictions for the same test instance (but different training data sets) is manifested as the variance of a model. As evident from Figure below, models with high variance tend to memorize random artifacts of the training data, causing inconsistency and inaccuracy in the prediction of unseen test instances. It is noteworthy that a polynomial model with higher degree is inherently more powerful than a linear model because the higher-order coefficients could always be set to 0; however, it is unable to achieve its full potential when the amount of data is limited. Simply speaking, the variance inherent in the finiteness of the data set causes increased complexity to be counterproductive. This trade-off between the power of a model and its performance on limited data is captured with the bias-variance trade-off.
1. When a model is trained on different data sets, the same test instance might obtain very different predictions. This is a sign that the training process is memorizing the nuances of the specific training data set, rather than learning patterns that generalize to unseen test instances. Note that the three predictions at $x = 2 in Figure 4.2 are quite different for the polynomial model. This is not quite the case for the linear model.
2. The gap between the error of predicting training instances and unseen test instances is rather large. Note that in Figure 4.2, the predictions at the unseen test point x = 2 are often more inaccurate in the polynomial model than in the linear model. On the other hand, the training error is always zero for the polynomial model, whereas the training error is always nonzero for the linear model.
Because of the large gaps between training and test error, models are often tested on unseen portions of the training data. These unseen portions of the test data are often held out early on, and then used in order to make different types of algorithmic decisions such as parameter tuning. This set of points is referred to as the validation set. The final accuracy is tested on a fully out-of-sample set of points that was not used for either model building or for parameter tuning. The error on out-of-sample test data is also referred to as the generalization error.
Neural networks are large, and they might have millions of parameters in complex applications.In spite of these challenges, there are a number of tricks that one can use in order to ensure that overfitting is not a problem. The choice of method depends on the specific setting, and the type of neural network used. The key methods for avoiding overfitting in a neural network are as follows:
1. Penalty-based regularization: Penalty-based regularization is the most common technique used by neural networks in order to avoid overfitting. The idea in regularization is to create a penalty or other types of constraints on the parameters in order to favor simpler models. For example, in the case of polynomial regression, a possible constraint on the parameters would be to ensure that at most $k$ different values of $w_i$ are non-zero. This will ensure simpler models. However, since it is hard to impose such constraints explicitly, a simpler approach is to impose a softer penalty like $λ\sum_{i=0}^d w^2$ and add it to the loss function. Such an approach roughly amounts to multiplying
each parameter $w_i$ with a multiplicative decay factor of $(1 − αλ)$ before each update at learning rate $α$. Aside from penalizing parameters of the network, one can also choose to penalize the activations of hidden units. This approach often leads to sparse hidden representations.
2. Generic and tailored ensemble methods: Many ensemble methods are not specific to neural networks, but can be used for other machine learning problems. We will discuss bagging and subsampling, which are two of the simplest ensemble methods that can be implemented for virtually any model or learning problem. These methods are inherited from traditional machine learning. There are several ensemble methods that are specifically designed for neural networks.A straightforward approach is to average the predictions of different neural architectures obtained by quick and dirty hyper-parameter optimization. Dropout is another ensemble technique that is designed for neural networks. This technique uses the selective dropping of nodes to create different neural networks. The predictions of different networks are combined to create the final result. Dropout reduces overfitting by indirectly acting as a regularizer.
3. Early stopping: In early stopping, the iterative optimization method is terminated early without converging to the optimal solution on the training data. The stopping point is determined using a portion of the training data that is not used for model building. One terminates when the error on the held-out data begins to rise. Even though this approach is not optimal for the training data, it seems to perform well on the test data because the stopping point is determined on the basis of the held-out data.
4. Pretraining: Pretraining is a form of learning in which a greedy algorithm is used to find a good initialization. The weights in different layers of the neural network are trained sequentially in greedy fashion. These trained weights are used as a good starting point for the overall process of learning. Pretraining can be shown to be an indirect form of regularization.
5. Continuation and curriculum methods: These methods perform more effectively by first training simple models, and then making them more complex. The idea is that it is easy to train simpler models without overfitting. Furthermore, starting with the optimum point of the simpler model provides a good initialization for a complex model that is closely related to the simpler model. It is noteworthy that some of these methods can be considered similar to pretraining. Pretraining also finds solutions from the simple to the complex by decomposing the training of a deep neural network into a set of shallow layers.
6. Sharing parameters with domain-specific insights: In some data-domains like text and images, one often has some insight about the structure of the parameter space. In such cases, some of the parameters in different parts of the network can be set to the same value. This reduces the number of degrees of freedom of the model. Such an approach is used in recurrent neural networks (for sequence data) and convolutional neural networks (for image data). Sharing parameters does come with its own set of challenges because the backpropagation algorithm needs to be appropriately modified to account for the sharing.
An interesting observation is that several forms of regularization can be shown to be roughly equivalent to the injection of noise in either the input data or the hidden variables.For example, it can be shown that many penalty-based regularizers are equivalent to the addition of noise/ Furthermore, even the use of stochastic gradient descent instead of gradient descent can be viewed as a kind of noise addition to the steps of the algorithm. As a result, stochastic gradient descent often shows excellent accuracy on the test data, even though its performance on the training data might not be as good as that of gradient descent.Furthermore, some ensemble techniques like Dropout and data perturbation are equivalent to injecting noise.
Even though a natural way of avoiding overfitting is to simply build smaller networks (with fewer units and parameters), it has often been observed that it is better to build large networks and then regularize them in order to avoid overfitting. This is because large networks retain the option of building a more complex model if it is truly warranted. At the same time, the regularization process can smooth out the random artifacts that are not supported by sufficient data. By using this approach, we are giving the model the choice to decide what complexity it needs, rather than making a rigid decision for the model up front (which might even underfit the data).
Comments
Post a Comment