Posts

Fully Connected Layers

Each feature in the final spatial layer is connected to each hidden state in the first fully connected layer. This layer functions in exactly the same way as a traditional feed-forward network. In most cases, one might use more than one fully connected layer to increase the power of the computations towards the end. The connections among these layers are exactly structured like a traditional feed-forward network. Since the fully connected layers are densely connected, the vast majority of parameters lie in the fully connected layers. For example, if each of two fully connected layers has 4096 hidden units, then the connections between them have more than 16 million weights. Similarly, the connections from the last spatial layer to the first fully connected layer will have a large number of parameters.Even though the convolutional layers have a larger number of activations (and a larger memory footprint), the fully connected layers often have a larger number of connections (and paramete...

Local Response Normalization

A trick that is introduced in  CNN is local response normalization, which is always used immediately after the ReLU layer. The use of this trick aids generalization. The basic idea of this normalization approach is inspired from biological principles, and it is intended to create competition among different filters. First, we describe the normalization formula using all filters, and then we describe how it is actually computed using only a subset of filters. Consider a situation in which a layer contains $N$ filters, and the activation values of these $N$ filters at a particular spatial position $(x, y)$ are given by $a_1 . . . a_N$. Then, each $a_i$ is converted into a normalized value $b_i$ using the following formula: $b_i=\frac{a_i}{(k+\alpha \sum _j a_i^2)^\beta}$ The values of the underlying parameters used in the original paper  are $k = 2, α = 10^{−4}$, and $β = 0.75$. However, in practice, one does not normalize over all $N$ filters. Rather the filters are ordered a...

Hierarchical Feature Engineering

Image
It is instructive to examine the activations of the filters created by real-world images in  different layers. T he  activations of the filters in the early layers are low-level features like edges, whereas those in  later layers put together these low-level features. For example, a mid-level feature might put  together edges to create a hexagon, whereas a higher-level feature might put together the  mid-level hexagons to create a honeycomb. It is fairly easy to see why a low-level filter might  detect edges. Consider a situation in which the color of the image changes along an edge. As a result, the difference between neighboring pixel values will be non-zero only across  the edge. This can be achieved by choosing the appropriate weights in the corresponding  low-level filter. Note that the filter to detect a horizontal edge will not be the same as that  to detect a vertical edge. This brings us back to Hubel and Weisel’s experiments in wh...

Case Studies-Convolutional Architecture -AlexNet

Image
In the following, we provide some case studies of convolutional architectures. These case studies were derived from successful entries to the ILSVRC competition in recent years.These are instructive because they provide an understanding of the important factors in neural network design that can make these networks work well. Even though recent years have seen some changes in architectural design (like ReLU activation), it is striking how similar the modern architectures are to the basic design of LeNet-5. The main changes from LeNet-5 to modern architectures are in terms of the explosion of depth, the use of ReLU activation, and the training efficiency enabled by modern hardware/optimization enhancements. Modern architectures are deeper, and they use a variety of computational, architectural, and hardware tricks to efficiently train these networks with large amounts of data. Hardware advancements should not be underestimated; modern GPU-based platforms are 10,000 times faster than ...

Gated Recurrent Units-GRUs

Image
The Gated Recurrent Unit (GRU) can be viewed as a simplification of the LSTM, which does not use explicit cell states. Another difference is that the LSTM directly controls the amount of information changed in the hidden state using separate forget and output gates.On the other hand, a GRU uses a single reset gate to achieve the same goal. However, the basic idea in the GRU is quite similar to that of an LSTM, in terms of how it partially resets the hidden states.    It was introduced by Kyunghyun Cho et al in the year 2014. GRU does not have a separate cell state($C_t)$.It only has a hidden state($H_t$). Due to the simpler architecture, GRUs are faster to train. They are almost similar to LSTMs except that they have two gates: reset gate and update gate . Reset gate determines how to combine new input to previous memory and update gate determines how much of the previous state to keep. Update gate in GRU is what input gate and forget gate were in LSTM. We don't have the...

Long Short-Term Memory- LSTM

Image
Recurrent neural networks have problems associated with vanishing and exploding gradients . This is a common problem in neural network updates where successive multiplication by the matrix $W^{(k)}$ is inherently unstable; it either results in the gradient disappearing during backpropagation, or in it blowing up to large values in an unstable way. This type of instability is the direct result of successive multiplication with the (recurrent) weight matrix at various time-stamps. One way of viewing this problem is that a neural network that uses only multiplicative updates is good only at learning over short sequences, and is therefore inherently endowed with good short-term memory but poor long-term memory. To address this problem, a solution is to change the recurrence equation for the hidden vector with the use of the LSTM with the use of long-term memory . The operations of the LSTM are designed to have fine-grained control over the data written into this long-term memory. LSTM ...

Applications of Recurrent Neural Networks

Image
Recurrent neural networks have numerous applications in machine learning applications, which are associated with information retrieval, speech recognition, and handwriting recognition. Text data forms the predominant setting for applications of RNNs, although there are several applications to computational biology as well. Most of the applications of RNNs fall into one of two categories: 1. Conditional language modeling: When the output of a recurrent network is a language model, one can enhance it with context in order to provide a relevant output to the context. In most of these cases, the context is the neural output of another neural network. To provide one example, in image captioning the context is the neural representation of an image provided by a convolutional network, and the language model provides a caption for the image. In machine translation, the context is the representation of a sentence in a source language (produced by another RNN), and the language model in the t...