There are several important issues associated with the setup of the neural network, preprocessing, and initialization. First, the hyperparameters of the neural network (such as the learning rates and regularization parameters) need to be selected. Feature preprocessing and initialization can also be rather important. Neural networks tend to have larger parameter spaces compared to other machine learning algorithms, which magnifies the effect of preprocessing and initialization in many ways. In the following, we will discuss the basic methods used for feature preprocessing and initialization. Strictly speaking, advanced methods like pretraining can also be considered initialization techniques.

Tuning Hyperparameters
Neural networks have a large number of hyperparameters such as the learning rate, the weight of regularization, and so on. The term “hyperparameter” is used to specifically refer to the parameters regulating the design of the model (like learning rate and regularization), and they are different from the more fundamental parameters representing the weights of connections in the neural network.

In the neural network, in which primary model parameters like weights are optimized with backpropagation only after fixing the hyperparameters either manually or with the use of a tuning phase.The hyperparameters should not be tuned using the same data used for gradient descent. Rather, a portion of the data is held out as validation data, and the performance of the model is tested on the validation set with various choices of hyperparameters. This type of approach ensures that the tuning process does not overfit to the training data set (while providing poor test data performance).

How should the candidate hyperparameters be selected for testing? The most well-known technique is grid search, in which a set of values is selected for each hyperparameter. In the most straightforward implementation of grid search, all combinations of selected values of the hyperparameters are tested in order to determine the optimal choice. One issue with this procedure is that the number of hyperparameters might be large, and the number of points in the grid increases exponentially with the number of hyperparameters. For example, if we have 5 hyperparameters, and we test 10 values for each hyperparameter, the training procedure needs to be executed $10^5 = 100000$ times to test its accuracy. Although one does not run such testing procedures to completion, the number of runs is still too large to be reasonably executed for most settings of even modest size. Therefore, a commonly used trick is to first work with coarse grids. Later, when one narrows down to a particular range of interest, finer grids are used.

It has been pointed out that grid-based hyperparameter exploration is not necessarily the best choice. In some cases, it makes sense to randomly sample the hyperparameters uniformly within the grid range. As in the case of grid ranges, one can perform multi-resolution sampling, where one first samples in the full grid range. One then creates a new set of grid ranges that are geometrically smaller than the previous grid ranges and centered around the optimal parameters from the previously explored samples. Sampling is repeated on this smaller box and the entire process is iteratively repeated multiple times to refine the parameters.

Another key point about sampling many types of hyperparameters is that the logarithms of the hyperparameters are sampled uniformly rather than the hyperparameters themselves.Two examples of such parameters include the regularization rate and the learning rate. For example, instead of sampling the learning rate $α$ between 0.1 and 0.001, we first sample $log(α)$ uniformly between −1 and −3, and then exponentiate it to the power of 10. It is more common to search for hyperparameters in the logarithmic space, although there are some hyperparameters that should be searched for on a uniform scale.

Finally, a key point about large-scale settings is that it is sometimes impossible to run these algorithms to completion because of the large training times involved. For example, a single run of a convolutional neural network in image processing might take a couple of weeks. Trying to run the algorithm over many different choices of parameter combinations is impractical. However, one can often obtain a reasonable estimate of the broader behavior of the algorithm in a short time. Therefore, the algorithms are often run for a certain number of epochs to test the progress. Runs that are obviously poor or diverge from convergence can be quickly killed. In many cases, multiple threads of the process with different hyperparameters can be run, and one can successively terminate or add new sampled runs. In the end, only one winner is allowed to train to completion. Sometimes a few winners may be allowed to train to completion, and their predictions will be averaged as an ensemble.

A mathematically justified way of choosing for hyperparameters is the use of Bayesian optimization . However, these methods are often too slow to practically use in largescale neural networks and remain an intellectual curiosity for researchers. For smaller networks, it is possible to use libraries such as Hyperopt , Spearmint , and SMAC.

Feature Preprocessing
The feature processing methods used for neural network training are not very different from those in other machine learning algorithms. There are two forms of feature preprocessing used in machine learning algorithms:

1. Additive preprocessing and mean-centering: It can be useful to mean-center the data in order to remove certain types of bias effects. Many algorithms in traditional machine learning (such as principal component analysis) also work with the assumption of mean-centered data. In such cases, a vector of column-wise means is subtracted from each data point. Mean-centering is often paired with standardization, which is discussed in the section of feature normalization.

A second type of pre-processing is used when it is desired for all feature values to be non-negative. In such a case, the absolute value of the most negative entry of a feature is added to the corresponding feature value of each data point. The latter is typically combined with min-max normalization, which is discussed below.

2. Feature normalization: A common type of normalization is to divide each feature value by its standard deviation. When this type of feature scaling is combined with mean-centering, the data is said to have been standardized. The basic idea is that each feature is presumed to have been drawn from a standard normal distribution with zero mean and unit variance.
The other type of feature normalization is useful when the data needs to be scaled in the range (0, 1). Let $min_j$ and $max_j$ be the minimum and maximum values of the $j'th$ attribute. Then, each feature value $x_{ij}$ for the $j'th$ dimension of the $i'th$ point is scaled by min-max normalization as follows:
$x_{ij} ⇐ \frac{x_{ij} − min_j}{max_j − min_j}$

Feature normalization often does ensure better performance, because it is common for the relative values of features to vary by more than an order of magnitude. In such cases, parameter learning faces the problem of ill-conditioning, in which the loss function has an inherent tendency to be more sensitive to some parameters than others. As we will see later in this chapter, this type of ill-conditioning affects the performance of gradient descent.Therefore, it is advisable to perform the feature scaling up front.
Another form of feature pre-processing is referred to as whitening, in which the axis-system is rotated to create a new set of de-correlated features, each of which is scaled to unit variance. Typically, principal component analysis is used to achieve this goal.Principal component analysis can be viewed as the application of singular value decomposition after mean-centering a data matrix (i.e., subtracting the mean from each column).
Let $D$ be an $n × d$ data matrix that has already been mean-centered. Let $C$ be the $d × d$ co-variance matrix of $D$ in which the $(i, j)$th entry is the co-variance between the dimensions $i$ and $j$. Because the matrix $D$ is mean-centered, we have the following:
$C =\frac{D^TD }{n}∝ D^TD $
The eigenvectors of the co-variance matrix provide the de-correlated directions in the data.Furthermore, the eigenvalues provide the variance along each of the directions. Therefore, if one uses the top-$k$ eigenvectors (i.e., largest $k$ eigenvalues) of the covariance matrix, most of the variance in the data will be retained and the noise will be removed. 

Let $P$ be a $d × k$ matrix in which each column contains one of the top-$k$ eigenvectors.Then, the data matrix $D$ can be transformed into the $k$-dimensional axis system by postmultiplying with the matrix $P$. The resulting $n × k$ matrix $U$, whose rows contain the transformed $k$-dimensional data points, is given by the following:
$U = DP$
Note that the variances of the columns of $U$ are the corresponding eigenvalues, because this is the property of the de-correlating transformation of principal component analysis. In whitening, each column of $U$ is scaled to unit variance by dividing it with its standard deviation (i.e., the square root of the corresponding eigenvalue). The transformed features are fed into the neural network. Since whitening might reduce the number of features, this type of preprocessing might also affect the architecture of the network, because it reduces the number of inputs.

The basic idea behind whitening is that data is assumed to be generated from an independent Gaussian distribution along each principal component. By whitening, one assumes that each such distribution is a standard normal distribution, and provides equal importance to the different features. Note that after whitening, the scatter plot of the data will roughly have a spherical shape, even if the original data is elliptically elongated with an arbitrary orientation. The idea is that the uncorrelated concepts in the data have now been scaled to equal importance (on an a priori basis), and the neural network can decide which of them to emphasize in the learning process. Another issue is that when different features are scaled very differently, the activations and gradients will be dominated by the “large” features in the initial phase of learning (if the weights are initialized randomly to values of similar magnitude). This might hurt the relative learning rate of some of the important weights in the network.

Initialization is particularly important in neural networks because of the stability issues associated with neural network training. As you will learn in , neural networks often exhibit stability problems in the sense that the activations of each layer either become successively weaker or successively stronger. The effect is exponentially related to the depth of the network, and is therefore particularly severe in deep networks. One way of ameliorating this effect to some extent is to choose good initialization points in such a way that the gradients are stable across the different layers.

One possible approach to initialize the weights is to generate random values from a Gaussian distribution with zero mean and a small standard deviation, such as $10^{−2}$. Typically, this will result in small random values that are both positive and negative. One problem with this initialization is that it is not sensitive to the number of inputs to a specific neuron.For example, if one neuron has only 2 inputs and another has 100 inputs, the output of the former is far more sensitive to the average weight because of the additive effect of more inputs (which will show up as a much larger gradient). In general, it can be shown that the variance of the outputs linearly scales with the number of inputs, and therefore the standard deviation scales with the square root of the number of inputs. To balance this fact, each weight is initialized to a value drawn from a Gaussian distribution with standard deviation $1/\sqrt{r}$, where $r$ is the number of inputs to that neuron. Bias neurons are always initialized to zero weight. Alternatively, one can initialize the weight to a value that is uniformly distributed in $[−1/√r, 1/√r]$.

More sophisticated rules for initialization consider the fact that the nodes in different layers interact with one another to contribute to output sensitivity. Let $r_{in}$ and $r_{out}$ respectively be the fan-in and fan-out for a particular neuron. One suggested initialization rule, referred to as Xavier initialization or Glorot initialization is to use a Gaussian distribution with standard deviation of $\sqrt{2/(r_{in} + r_{out})}$.An important consideration in using randomized methods is that symmetry breaking is important. if all weights are initialized to the same value (such as 0), all updates will move in lock-step in a layer. As a result, identical features will be created by the neurons in a layer. It is important to have a source of asymmetry among the neurons to begin with.


