Deep Feedforward Network -learning XOR
Modern deep learning provides a very powerful framework for supervised learning. By adding more layers and more units within a layer, a deep network can represent functions of increasing complexity. Most tasks that consist of mapping an input vector to an output vector, and that are easy for a person to do rapidly, can be accomplished via deep learning, given sufficiently large models and sufficiently large datasets of labeled training examples.
Deep feedforward networks, also often called feedforward neural networks, or multilayer perceptrons (MLPs), are the quintessential deep learning models. The goal of a feedforward network is to approximate some function $f^∗$. For example, for a classifier, $y = f^∗(x)$ maps an input $x$ to a category $y$. A feedforward network defines a mapping $y = f (x; θ)$ and learns the value of the parameters $\theta $ that result in the best function approximation.
These models are called feedforward because information flows through the function being evaluated from $x$, through the intermediate computations used to define $f$ , and finally to the output $y$. There are no feedback connections in which outputs of the model are fed back into itself. When feedforward neural networks are extended to include feedback connections, they are called recurrent neural networks,
Feedforward networks are of extreme importance to machine learning practitioners.They form the basis of many important commercial applications. For example, the convolutional networks used for object recognition from photos are a specialized kind of feedforward network. Feedforward networks are a conceptual stepping stone on the path to recurrent networks, which power many natural language applications.
Feedforward neural networks are called networks because they are typically represented by composing together many different functions. The model is associated with a directed acyclic graph describing how the functions are composed together. For example, we might have three functions $f^{ (1)}, f ^{(2)}$, and $f ^{(3)}$ connected in a chain, to form $f(x) = f^{(3)}(f ^{(2)}(f^{(1)}(x )))$. These chain structures are the most commonly used structures of neural networks. In this case, $f^{(1)}$ is called the first layer of the network, $f^{ (2)}$ is called the second layer, and so on. The overall length of the chain gives the depth of the model. It is from this terminology that the name “deep learning” arises. The final layer of a feedforward network is called the output layer. During neural network training, we drive $f(x)$ to match $f^∗(x)$.The training data provides us with noisy, approximate examples of $f^ ∗(x)$ evaluated at different training points. Each example $x$ is accompanied by a label$ y ≈ f ^∗(x)$.
The training examples specify directly what the output layer must do at each point $x$; it must produce a value that is close to $y$. The behavior of the other layers is not directly specified by the training data. The learning algorithm must decide how to use those layers to produce the desired output, but the training data does not say what each individual layer should do. Instead, the learning algorithm must decide how to use these layers to best implement an approximation of $f^∗$. Because the training data does not show the desired output for each of these layers, these layers are called hidden layers.
Finally, these networks are called neural because they are loosely inspired by neuroscience. Each hidden layer of the network is typically vector-valued. The dimensionality of these hidden layers determines the width of the model. Each element of the vector may be interpreted as playing a role analogous to a neuron.Rather than thinking of the layer as representing a single vector-to-vector function, we can also think of the layer as consisting of many units that act in parallel, each representing a vector-to-scalar function. Each unit resembles a neuron in the sense that it receives input from many other units and computes its own activation value. The idea of using many layers of vector-valued representation is drawn from neuroscience. The choice of the functions $f^{(i)}(x)$ used to compute these representations is also loosely guided by neuroscientific observations about the functions that biological neurons compute. However, modern neural network research is guided by many mathematical and engineering disciplines, and the goal of neural networks is not to perfectly model the brain. It is best to think of feedforward networks as function approximation machines that are designed to achieve statistical generalization, occasionally drawing some insights from what we know about the brain, rather than as models of brain function.
One way to understand feedforward networks is to begin with linear models and consider how to overcome their limitations. Linear models, such as logistic regression and linear regression, are appealing because they may be fit efficiently and reliably, either in closed form or with convex optimization. Linear models also have the obvious defect that the model capacity is limited to linear functions, so the model cannot understand the interaction between any two input variables.To extend linear models to represent nonlinear functions of $x$, we can apply the linear model not to $x$ itself but to a transformed input $\Phi(x $), where $\Phi$ is a nonlinear transformation.
Example: learning XOR
The XOR function (“exclusive or”) is an operation on two binary values, $x_1$ and $x_2$. When exactly one of these binary values is equal to 1, the XOR function returns 1. Otherwise, it returns 0. The XOR function provides the target function $y = f^∗(x)$ that we want to learn. Our model provides a function $y = f(x;θ)$ and our learning algorithm will adapt the parameters $\theta$ to make $f$ as similar as possible to $f^∗$.
In this simple example, we will not be concerned with statistical generalization.We want our network to perform correctly on the four points $X = \{[0, 0]^T, [0,1]^T,[1, 0]^T, and [1, 1]^T\}$. We will train the network on all four of these points. The only challenge is to fit the training set.Note that the linear model cannot solve this problem
We will introduce a very simple feedforward network with one hidden layer containing two hidden units.This feedforward network has a vector of hidden units h that are computed by a function $f^{(1)}(x;W, c)$. The values of these hidden units are then used as the input for a second layer. The second layer is the output layer of the network. The output layer is still just a linear regression model, but now it is applied to $h$ rather than to $x$ . The network now contains two functions chained together: $h = f^{(1)}(x;W, c)$ and $y = f ^(2)(h;w, b$), with the complete model being $f(x;W, c,w, b) = f^{(2)}(f ^{(1)}(x))$.
Suppose$ f ^{(1)}(x ) = W^Tx$ and $f^{(2)}( h) = h ^Tw$. Then $f (x) = w^TW ^Tx$. We could represent this function as $f(x) = x^T{w}'$ where ${w}'= Ww$.Clearly, we must use a nonlinear function to describe the features. Most neural networks do so using an affine transformation controlled by learned parameters, followed by a fixed, nonlinear function called an activation function. We use that
strategy here, by defining $h = g(W ^Tx + c)$ , where $W$ provides the weights of a linear transformation and $c$ the biases. Previously, to describe a linear regression model, we used a vector of weights and a scalar bias parameter to describe an affine transformation from an input vector to an output scalar. Now, we describe an affine transformation from a vector $x$ to a vector $h$, so an entire vector of bias parameters is needed. The activation function $g$ is typically chosen to be a function that is applied element-wise, with $h_i = g(x^TW_{:,i} + c_i)$. In modern neural networks, the default recommendation is to use the rectified linear unit or ReLU (Jarrett et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011a) defined by the activation function $g(z) = max\{0, z\}$
We can now specify our complete network as
$f(x;W, c,w, b) = w^T .max\{0,W^Tx + c\} + b.$
We can now specify the solution of the XOR problem
$W=\begin{bmatrix}1 & 1 \\
1 & 1 \\
\end{bmatrix}$
$c=\begin{bmatrix}
0 \\
-1 \\
\end{bmatrix}$
0 \\
-1 \\
\end{bmatrix}$
$w=\begin{bmatrix}
1 \\
-2 \\
\end{bmatrix}$
1 \\
-2 \\
\end{bmatrix}$
and $b=0$
output
$X=\begin{bmatrix}0 &0 \\
0 & 1 \\
1 &0 \\
1 & 1 \\
\end{bmatrix}$
$XW=\begin{bmatrix}
0 &0 \\
1 & 1 \\
1 &1 \\
2 & 2 \\
\end{bmatrix}$
Next, we add the bias vector $c$, to obtain
$XW+c =\begin{bmatrix}
0 &-1 \\
1 & 0 \\
1 &0\\
2 & 1 \\
\end{bmatrix}$
To finish computing the value
of h for each example, we apply the rectified linear transformation:ReLU
$h =\begin{bmatrix}
0 &0 \\
1 & 0 \\
1 &0\\
2 & 1 \\
\end{bmatrix}$
We finish by multiplying by the weight vector w:
$hw =\begin{bmatrix}
0 \\
1 \\
1 \\
0 \\
\end{bmatrix}$
In this example, we simply specified the solution, then showed that it obtained zero error. In a real situation, there might be billions of model parameters and billions of training examples, so one cannot simply guess the solution as we did here. Instead, a gradient-based optimization algorithm can find parameters that produce very little error. The solution we described to the XOR problem is at a global minimum of the loss function, so gradient descent could converge to this point. There are other equivalent solutions to the XOR problem that gradient descent could also find. The convergence point of gradient descent depends on the initial values of the parameters. In practice, gradient descent would usually not find clean, easily understood, integer-valued solutions like the one we presented here.
Gradient-Based Learning
Designing and training a neural network is not much different from training any other machine learning model with gradient descent.
The largest difference between the linear models we have seen so far and neural networks is that the nonlinearity of a neural network causes most interesting loss functions to become non-convex. This means that neural networks are usually trained by using iterative, gradient-based optimizers that merely drive the cost function to a very low value, rather than the linear equation solvers used to train linear regression models or the convex optimization algorithms with global convergence guarantees used to train logistic regression or SVMs. Convex optimization converges starting from any initial parameters (in theory—in practice it is very robust but can encounter numerical problems). Stochastic gradient descent applied to non-convex loss functions has no such convergence guarantee, and is sensitive to the values of the initial parameters. For feedforward neural networks, it is important to initialize all weights to small random values. The biases may be initialized to zero or to small positive values. The iterative gradient-based optimization algorithms used to train feedforward networks and almost all other deep models will be described in detail later.
Comments
Post a Comment