Posts

Showing posts from September, 2022

Hyperparameters and Validation Sets

Image
Most machine learning algorithms have several settings that we can use to control the behavior of the learning algorithm. These settings are called hyperparameters.The values of hyperparameters are not adapted by the learning algorithm itself (though we can design a nested learning procedure where one learning algorithm learns the best hyperparameters for another learning algorithm). In the polynomial regression example we saw early, there is a single hyperparameter: the degree of the polynomial, which acts as a capacity hyperparameter.The $\lambda$ value used to control the strength of weight decay is another example of a hyperparameter. Sometimes a setting is chosen to be a hyper parameter that the learning algorithm does not learn because it is difficult to optimize. More frequently, the setting must be a hyper parameter because it is not appropriate to learn that hyper parameter on the training set. This applies to all hyper parameters that control model capacity. If learned on

Regularization

Image
The no free lunch theorem implies that we must design our machine learning algorithms to perform well on a specific task. We do so by building a set of preferences into the learning algorithm. When these preferences are aligned with the learning problems we ask the algorithm to solve, it performs better. So far, the only method of modifying a learning algorithm that we have discussed concretely is to increase or decrease the model’s representational capacity by adding or removing functions from the hypothesis space of solutions the learning algorithm is able to choose. We gave the specific example of increasing or decreasing the degree of a polynomial for a regression problem. The behavior of our algorithm is strongly affected not just by how large we make the set of functions allowed in its hypothesis space, but by the specific identity of those functions. The learning algorithm we have studied so far, linear regression, has a hypothesis space consisting of the set of linear funct

Linear regression

Image
Our definition of a machine learning algorithm as an algorithm that is capable of improving a computer program’s performance at some task via experience is somewhat abstract. To make this more concrete, we present an example of a simple machine learning algorithm: linear regression As the name implies, linear regression solves a regression problem. In other words, the goal is to build a system that can take a vector $x \in R_n$ as input and predict the value of a scalar $y \in R$ as its output. In the case of linear regression, the output is a linear function of the input. Let $\hat{y}$ be the value that our model predicts $y$ should take on. We define the output to be $\hat{y }= w^Tx$ where $w \in R^n$ is a vector of parameters.Parameters are values that control the behavior of the system. In this case, $w_i$ is the coefficient that we multiply by feature $x_i$ before summing up the contributions from all the features. We can think of $w$ as a set of weights that determine how eac

Gradient Descent, Stochastic Gradient Descent, Batch Gradient Descent

Gradient Descent Gradient descent is an optimization algorithm often used for finding the weights or coefficients of machine learning algorithms, such as artificial neural networks and logistic regression. It works by having the model make predictions on training data and using the error on the predictions to update the model in such a way as to reduce the error. The goal of the algorithm is to find model parameters (e.g. coefficients or weights) that minimize the error of the model on the training dataset. It does this by making changes to the model that move it along a gradient or slope of errors down toward a minimum error value. This gives the algorithm its name of “gradient descent.” Types of Gradient Descent Gradient descent can vary in terms of the number of training patterns used to calculate error; that is in turn used to update the model. The number of patterns used to calculate the error includes how stable the gradient is that is used to update the model. We will see that t

Penalty Based Regularizations- L1 and L2

Image
Penalty-based regularization is the most common approach for reducing overfitting. In order  to understand this point, let us revisit the example of the polynomial with degree $d$. In this  case, the prediction $\hat{y}$ for a given value of $x$ is as follows: $\hat{y}=\sum_{i=0}^d w_ix^i$ It is possible to use a single-layer network with $d$ inputs and a single bias neuron with weight $w_0$ in order to model this prediction. The $i$th input is $x_i$. This neural network uses linear activations, and the squared loss function for a set of training instances $(x, y)$ from data set $D$ can be defined as follows: $L=\sum_{(x,y) \in D} (y-\hat{y})^2$ As discussed earlier, a large value of $d$ tends to increase overfitting.One possible solution to this problem is to reduce the value of d. In other words, using a model with economy in parameters leads to a simpler model. For example, reducing $d$ to 1 creates a linear model that has fewer degrees of freedom and tends to fit the data in a simi