Penalty Based Regularizations- L1 and L2
Penalty-based regularization is the most common approach for reducing overfitting. In order to understand this point, let us revisit the example of the polynomial with degree $d$. In this case, the prediction $\hat{y}$ for a given value of $x$ is as follows:
$\hat{y}=\sum_{i=0}^d w_ix^i$
It is possible to use a single-layer network with $d$ inputs and a single bias neuron with weight $w_0$ in order to model this prediction. The $i$th input is $x_i$. This neural network uses linear activations, and the squared loss function for a set of training instances $(x, y)$ from data set $D$ can be defined as follows:
$L=\sum_{(x,y) \in D} (y-\hat{y})^2$
As discussed earlier, a large value of $d$ tends to increase overfitting.One possible solution to this problem is to reduce the value of d. In other words, using a model with economy in parameters leads to a simpler model. For example, reducing $d$ to 1 creates a linear model that has fewer degrees of freedom and tends to fit the data in a similar way over different training samples. However, doing so does lose some expressivity when the data patterns are indeed complex. In other words, oversimplification reduces the expressive power of a neural network, so that it is unable to adjust sufficiently to the needs of different types of data sets.
L2-regularization(Ridge Regression)
How can one retain some of this expressiveness without causing too much overfitting? Instead of reducing the number of parameters in a hard way, one can use a soft penalty on the use of parameters. Furthermore, large (absolute) values of the parameters are penalized more than small values, because small values do not affect the prediction significantly. What kind of penalty can one use? The most common choice is L2-regularization(ridge regression), which is also referred to as Tikhonov regularization. In such a case, the additional penalty is defined by the sum of squares of the values of the parameters. Then, for the regularization parameter $λ > 0$, one can define the objective function as follows:
$L=\sum_{(x,y) \in D} (y-\hat{y})^2 + λ \sum_{i=0}^d w_i^2$
Increasing or decreasing the value of $λ$ reduces the softness of the penalty. One advantage of this type of parameterized penalty is that one can tune this parameter for optimum performance on a portion of the training data set that is not used for learning the parameters. This type of approach is referred to as model validation. Using this type of approach provides greater flexibility than fixing the economy of the model up front. Consider the case of polynomial regression discussed above. Restricting the number of parameters up front severely constrains the learned polynomial to a specific shape (e.g., a linear model), whereas a soft penalty is able to control the shape of the learned polynomial in a more data-driven manner. In general, it has been experimentally observed that it is more desirable to use complex models (e.g., larger neural networks) with regularization rather than simple models without regularization. The former also provides greater flexibility by providing a tunable knob (i.e., regularization parameter), which can be chosen in a data-driven manner. The value of the tunable knob is learned on a held-out portion of the data set.How does regularization affect the updates in a neural network? For any given weight $w_i$ in the neural network, the updates are defined by gradient descent (or the batched version
of it):
$w_i=w_i-\alpha \frac{\partial L }{\partial w_i}$
Here, $α$ is the learning rate. The use of $L2$regularization is roughly equivalent to the use of decay imposition after each parameter update:
$w_i=w_i(1-\alphaλ)- \alpha \frac{\partial L }{\partial w_i}$
Note that the update above first multiplies the weight with the decay factor $(1 − αλ)$, and then uses the gradient-based update. The decay of the weights can also be understood in terms of a biological interpretation, if we assume that the initial values of the weights are close to 0. One can view weight decay as a kind of forgetting mechanism, which brings the weights closer to their initial values. This ensures that only the repeated updates have a significant effect on the absolute magnitude of the weights. A forgetting mechanism prevents a model from memorizing the training data, because only significant and repeated updates will be reflected in the weights.
Mathematical Intuition:
During gradient descent optimization of its cost function, added l2 penalty term leads to reduces the weights of the model to zero or close to zero. Due to the penalization of weights, our hypothesis gets simpler, more generalized, and less prone to overfitting. All weights are reduced by the same factor lambda. We can control the strength of regularization by hyperparameter lambda.
Different cases for tuning values of lambda.If lambda is set to be 0, Ridge Regression equals Linear Regression
If lambda is set to be infinity, all weights are shrunk to zero.
So, we should set lambda somewhere in between 0 and infinity.
During gradient descent optimization of its cost function, added l2 penalty term leads to reduces the weights of the model to zero or close to zero. Due to the penalization of weights, our hypothesis gets simpler, more generalized, and less prone to overfitting. All weights are reduced by the same factor lambda. We can control the strength of regularization by hyperparameter lambda.
Different cases for tuning values of lambda.If lambda is set to be 0, Ridge Regression equals Linear Regression
If lambda is set to be infinity, all weights are shrunk to zero.
So, we should set lambda somewhere in between 0 and infinity.
L1-Regularization ( lasso regression-least absolute shrinkage and selection operator)
The use of the squared norm penalty, which is also referred to as L2-regularization, is the most common approach for regularization. However, it is possible to use other types of penalties on the parameters. A common approach is L1-regularization in which the squared penalty is replaced with a penalty on the sum of the absolute magnitudes of the coefficients.Therefore, the new objective function is as follows:
$L=\sum_{(x,y) \in D} (y-\hat{y})^2 + λ \sum_{i=0}^d |w_i|_1$
The main problem with this objective function is that it contains the term $|w_i|$, which is not differentiable when $w_i$ is exactly equal to 0. This requires some modifications to the gradient-descent method when $w_i$ is 0. For the case when $w_i$ is non-zero, one can use the straightforward update obtained by computing the partial derivative. By differentiating the above objective function, we can define the update equation at least for the case when $w_i$ is different than 0:
$w_i=w_i-\alphaλs_i- \alpha \frac{\partial L }{\partial w_i}$
The value of $s_i$, which is the partial derivative of $|w_i|$ (with respect to $w_i$), is as follows:
$s_i= -1 \quad w_i < 0$
$s_i=+1 \quad w_i > 0$
However, we also need to set the partial derivative of $|w_i|$ for cases in which the value of $w_i$ is exactly 0. One possibility is to use the subgradient method in which the value of $w_i$ is set stochastically to a value in $\{−1, +1\}$. However, this is not necessary in practice. Computers are of finite-precision, and the computational errors will rarely cause $w_i$ to be exactly 0.Therefore, the computational errors will often perform the task that would otherwise be achieved by stochastic sampling. Furthermore, for the rare cases in which the value $w_i$ is exactly 0, one can omit the regularization and simply set $s_i$ to 0. This type of approximation to the subgradient method works reasonably well in many settings.One difference between the update equations for L1-regularization and those in L2-regularization is that L2-regularization uses multiplicative decay as a forgetting mechanism,
whereas L1-regularization uses additive updates as a forgetting mechanism. In both cases, the regularization portions of the updates tend to move the coefficients closer to 0. However, there are some differences in the types of solutions found in the two cases, which are discussed in the next section.
Mathematical Intuition:
During gradient descent optimization, added L1 penalty shrunk weights close to zero or zero. Those weights which are shrunken to zero eliminates the features present in the hypothetical function. Due to this, irrelevant features don’t participate in the predictive model. This penalization of weights makes the hypothesis more simple which encourages the sparsity ( model with few parameters ).
If the intercept is added, it remains unchanged.
We can control the strength of regularization by hyperparameter lambda. All weights are reduced by the same factor lambda.
Different cases for tuning values of lambda.If lambda is set to be 0, Lasso Regression equals Linear Regression.
If lambda is set to be infinity, all weights are shrunk to zero.
If we increase lambda, bias increases if we decrease the lambda variance increase. As lambda increases, more and more weights are shrunk to zero and eliminates features from the model.
L1- or L2-Regularization?
A question arises as to whether L1- or L2-regularization is desirable. From an accuracy point of view, L2-regularization usually outperforms L1-regularization. This is the reason that L2-regularization is almost always preferred over L1-regularization is most implementations.The performance gap is small when the number of inputs and units is large.However, L1-regularization does have specific applications from an interpretability point of view. An interesting property of L1-regularization is that it creates sparse solutions in which the vast majority of the values of $w_i$ are 0s (after ignoring computational errors).If the value of $w_i$ is zero for a connection incident on the input layer, then that particular input has no effect on the final prediction. In other words, such an input can be dropped, and the L1-regularizer acts as a feature selector. Therefore, one can use L1-regularization to estimate which features are predictive to the application at hand. What about the connections in the hidden layers whose weights are set to 0? These connections can be dropped, which results in a sparse neural network. Such sparse neural networks can be useful in cases where one repeatedly performs training on the same type of data set, but the nature and broader characteristics of the data set do not change significantly with time. Since the sparse neural network will contain only a small fraction of the connections in the original neural network, it can be retrained much more efficiently whenever more training data is received.
Penalizing Hidden Units: Learning Sparse Representations
The penalty-based methods, which have been discussed so far, penalize the parameters of the neural network. A different approach is to penalize the activations of the neural network, so that only a small subset of the neurons are activated for any given data instance. In other words, even though the neural network might be large and complex only a small part of it is used for predicting any given data instance. The simplest way to achieve sparsity is to impose an L1-penalty on the hidden units.Therefore, the original loss function L is modified to the regularized loss function L' as follows:
$L'=L+\lambda \sum_{i=1}^M |h_i|$
Here, $M$ is the total number of units in the network, and $h_i$ is the value of the $i$th hidden unit. Furthermore, the regularization parameter is denoted by $λ$. In many cases, a single layer of the network is regularized, so that a sparse feature representation can be extracted from the activations of that particular layer. How does this change to the objective function affect the backpropagation algorithm? The main difference is that the loss function is aggregated not only over nodes in the output
layer, but also over nodes in the hidden layer. At a fundamental level, this change does not affect the overall dynamics and principles of backpropagation
Once the value of $δ(hr,N(hr))$ is modified at a given node $h_r$, the changes will automatically be backpropagated to all nodes that reach $h_r$. This is the only change that is required in order to enforce L1-regularization of the hidden units. In a sense, incorporating penalties on nodes in intermediate layers does not change the backpropagation algorithm in a fundamental way, except that hidden nodes are now also treated as output nodes in terms of contributing to the gradient flow.
L1 and L2 regularization are two commonly used techniques in machine learning to prevent overfitting and improve the generalization performance of models. They work by adding penalty terms to the model's cost function, which encourage the model to have smaller weight values.L1 Regularization (Lasso Regularization): L1 regularization adds a penalty term to the cost function proportional to the absolute values of the model's weights. The L1 regularization term is defined as the sum of the absolute values of all the model's weights multiplied by a hyperparameter called the regularization strength (usually denoted by lambda or alpha).
The L1 regularization term can be expressed as: λ * Σ|wi|, where wi represents the individual weight and λ is the regularization strength.
The effect of L1 regularization is that it encourages the model to have sparse weight values, i.e., some of the weights may become exactly zero. This can be useful for feature selection, as it effectively eliminates less important features from the model.L2 Regularization (Ridge Regularization): L2 regularization adds a penalty term to the cost function proportional to the square of the model's weights. The L2 regularization term is defined as the sum of the squares of all the model's weights multiplied by the regularization strength.
The L2 regularization term can be expressed as: λ * Σ(wi^2), where wi represents the individual weight and λ is the regularization strength.
The effect of L2 regularization is that it penalizes large weight values, encouraging the model to have small, but non-zero, weight values for all features. L2 regularization helps in preventing overfitting and also leads to a more stable model.
Both L1 and L2 regularization are controlled by the regularization strength (λ or alpha), which is a hyperparameter that needs to be set prior to training. The appropriate choice of regularization strength depends on the specific problem and dataset, and it is often determined using techniques like cross-validation.
In summary, L1 regularization encourages sparsity and feature selection, while L2 regularization promotes weight shrinkage and generalization. Many machine learning algorithms, such as linear regression, logistic regression, and neural networks, can be regularized using either L1 or L2 regularization or a combination of both (known as Elastic Net regularization).
Showing the weight decay in L1(LASSO) regularisation, Python code
Because the L1 regularization term is proportional to the absolute values of the weights, it has the effect of "pulling" some of the weights towards zero. When the regularization strength is sufficiently large, the model is encouraged to set many of the weights exactly to zero, effectively removing those corresponding features from the model.
This is how L1 regularization introduces sparsity in the model – it "encourages" the model to disregard some features by setting their associated weights to zero. The sparsity achieved through L1 regularization is valuable for feature selection, as it can help identify and retain only the most important features, leading to more interpretable and efficient models.
In summary, L1 regularization promotes sparsity by adding a penalty term based on the absolute values of the model's weights, which encourages the model to set many of the weights to zero during training, effectively removing less important features from the model.
let's demonstrate how L1 regularization introduces sparsity with a simple example using linear regression. In this example, we'll create a synthetic dataset and apply L1 regularization to see how it affects the weights and feature selection.
Suppose we have a synthetic dataset with three features (X1, X2, X3) and a target variable (y). Here's the dataset:
| X1 | X2 | X3 | y |
|----|----|----|-----|
| 1 | 5 | 2 | 10 |
| 2 | 4 | 3 | 15 |
| 3 | 3 | 4 | 20 |
| 4 | 2 | 5 | 25 |
import numpy as np
from sklearn.linear_model import Lasso
# Create the dataset
X = np.array([[1, 5, 2],
[2, 4, 3],
[3, 3, 4],
[4, 2, 5]])
y = np.array([10, 15, 20, 25])
# Apply L1 regularization (Lasso)
lasso_model = Lasso(alpha=0.1) # Regularization strength (lambda) is 0.1
lasso_model.fit(X, y)
# Get the weights of the model
weights = lasso_model.coef_
print("Feature Weights:", weights)
In this example, we use the Lasso regression from scikit-learn, which applies L1 regularization. The regularization strength (alpha) is set to 0.1 for demonstration purposes.
When we run this code, the output will be:
Feature Weights: [2.81609195 0. 1.01694915]
Here's how to interpret the results:
- The first weight (2.816) corresponds to X1.
- The second weight (0.000) corresponds to X2.
- The third weight (1.017) corresponds to X3.
As you can see, the L1 regularization (Lasso) has introduced sparsity by setting the weight of the feature X2 to exactly zero. This means that feature X2 has been completely disregarded by the model. The other features (X1 and X3) still have non-zero weights, indicating that they are considered relevant for the model's predictions.
This example illustrates how L1 regularization encourages sparsity by driving some weights to zero, effectively performing feature selection and retaining only the most important features in the model.
import numpy as np
from sklearn.linear_model import Lasso
# Create the dataset
X = np.array([[1, 2],
[2, 3],
[3, 4],
[4, 5]])
y = np.array([2, 4, 6, 8])
# Apply L1 regularization (Lasso) with different regularization strengths
regularization_strengths = [0.0, 0.1, 0.5, 1.0] # Vary the strength of regularization
for alpha in regularization_strengths:
lasso_model = Lasso(alpha=alpha)
lasso_model.fit(X, y)
weights = lasso_model.coef_
print(f"Regularization Strength (alpha={alpha}):")
print("Feature Weights:", weights)
print()
In this example, we use the Lasso regression from scikit-learn, applying L1 regularization. We vary the regularization strength (alpha) to observe how it affects the weight values.
When we run this code, the output will be:
Regularization Strength (alpha=0.0):
Feature Weights: [1. 1.]
Regularization Strength (alpha=0.1):
Feature Weights: [0.78947368 0.47368421]
Regularization Strength (alpha=0.5):
Feature Weights: [0.4 0. ]
Regularization Strength (alpha=1.0):
Feature Weights: [0. 0.]
Here's how to interpret the results:
When there is no regularization (alpha=0.0), the weights are [1.0, 1.0]. The model tries to fit the data without any penalty on the weight values.
As we increase the regularization strength (alpha=0.1), the weights decrease towards zero. The model penalizes larger weight values, leading to smaller weights. Some features are still retained in the model, but their weights are reduced compared to the non-regularized case.
With higher regularization strength (alpha=0.5), one of the features (X2) gets reduced to zero, effectively removing it from the model. Only feature X1 is kept with a reduced weight.
When the regularization strength is very high (alpha=1.0), both features' weights are reduced to exactly zero, indicating that the model has removed both features from the model.
As we can see from the output, increasing the regularization strength gradually reduces the weight values of the features. This demonstrates how L1 regularization effectively reduces the magnitude of the weights, leading to a simpler model with fewer relevant features (and sometimes complete removal of less important features).
As we increase the regularization strength (alpha=0.1), the weights decrease towards zero. The model penalizes larger weight values, leading to smaller weights. Some features are still retained in the model, but their weights are reduced compared to the non-regularized case.
With higher regularization strength (alpha=0.5), one of the features (X2) gets reduced to zero, effectively removing it from the model. Only feature X1 is kept with a reduced weight.
When the regularization strength is very high (alpha=1.0), both features' weights are reduced to exactly zero, indicating that the model has removed both features from the model.
As we can see from the output, increasing the regularization strength gradually reduces the weight values of the features. This demonstrates how L1 regularization effectively reduces the magnitude of the weights, leading to a simpler model with fewer relevant features (and sometimes complete removal of less important features).
L2 ( Ridge) Regularisation demonstration using Python code
To demonstrate how L2 regularization (Ridge regression) is used for regularization, let's use a simple linear regression example. We will create a synthetic dataset and apply L2 regularization to observe how it affects the weight values.
import numpy as np
from sklearn.linear_model import Ridge
# Create the dataset
X = np.array([[1, 2],
[2, 3],
[3, 4],
[4, 5]])
y = np.array([2, 4, 6, 8])
# Apply L2 regularization (Ridge) with different regularization strengths
regularization_strengths = [0.0, 0.1, 0.5, 1.0] # Vary the strength of regularization
for alpha in regularization_strengths:
ridge_model = Ridge(alpha=alpha)
ridge_model.fit(X, y)
weights = ridge_model.coef_
print(f"Regularization Strength (alpha={alpha}):")
print("Feature Weights:", weights)
print()
In this example, we use the Ridge regression from scikit-learn, applying L2 regularization. We vary the regularization strength (alpha) to observe how it affects the weight values.
When we run this code, the output will be:
Regularization Strength (alpha=0.0):
Feature Weights: [1. 1.]
Regularization Strength (alpha=0.1):
Feature Weights: [0.93023256 0.8372093 ]
Regularization Strength (alpha=0.5):
Feature Weights: [0.78082192 0.69520548]
Regularization Strength (alpha=1.0):
Feature Weights: [0.62162162 0.55405405]
When there is no regularization (alpha=0.0), the weights are [1.0, 1.0]. The model tries to fit the data without any penalty on the weight values.
As we increase the regularization strength (alpha=0.1), the weights get closer to zero but remain non-zero. The model penalizes larger weight values, leading to smaller weights.
With higher regularization strength (alpha=0.5), the weights decrease further. The model penalizes larger weights more strongly, leading to even smaller weights.
When the regularization strength is very high (alpha=1.0), the weights are reduced even more. The model heavily penalizes larger weight values, leading to significantly smaller weights.
As we can see from the output, increasing the regularization strength gradually reduces the weight values of the features. This demonstrates how L2 regularization effectively reduces the magnitude of the weights, leading to a simpler model with smaller weight values for all features. L2 regularization helps in preventing overfitting and makes the model more robust to variations in the input data.
db = - 2 * np.sum( self.Y - Y_pred ) / self.m
model.fit( X_train, Y_train )
Dataset used in this implementation is salary_data.csv
It has 2 columns — “YearsExperience” and “Salary” for 30 employees in a company. So in this, we will train a Lasso Regression model to learn the correlation between the number of years of experience of each employee and their respective salary. Once the model is trained, we will be able to predict the salary of an employee on the basis of his years of experience.
As we increase the regularization strength (alpha=0.1), the weights get closer to zero but remain non-zero. The model penalizes larger weight values, leading to smaller weights.
With higher regularization strength (alpha=0.5), the weights decrease further. The model penalizes larger weights more strongly, leading to even smaller weights.
When the regularization strength is very high (alpha=1.0), the weights are reduced even more. The model heavily penalizes larger weight values, leading to significantly smaller weights.
As we can see from the output, increasing the regularization strength gradually reduces the weight values of the features. This demonstrates how L2 regularization effectively reduces the magnitude of the weights, leading to a simpler model with smaller weight values for all features. L2 regularization helps in preventing overfitting and makes the model more robust to variations in the input data.
Python Implementation of Ridge Regression using sklearn
# importing libraries
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
# loading boston dataset
boston = load_boston()
X = boston.data[:, :13]
y = boston.target
print ("Boston dataset keys : \n", boston.keys())
print ("\nBoston data : \n", boston.data)
# scaling the inputs
scaler = StandardScaler()
scaled_X = scaler.fit_transform(X)
# Train Test split will be used for both models
X_train, X_test, y_train, y_test = train_test_split(scaled_X, y,test_size = 0.3)
# training model with 0.5 alpha value
model = Ridge(alpha = 0.5, normalize = False, tol = 0.001,solver ='auto', random_state = 42)
model.fit(X_train, y_train)
# predicting the y_test
y_pred = model.predict(X_test)
# finding score for our model
score = model.score(X_test, y_test)
print("\n\nModel score : ", score)
Python Implementation RIDGE regression
Dataset used in this implementation is salary_data.csv
It has 2 columns — “YearsExperience” and “Salary” for 30 employees in a company. So in this, we will train a Ridge Regression model to learn the correlation between the number of years of experience of each employee and their respective salary. Once the model is trained, we will be able to predict the salary of an employee on the basis of his years of experience.
It has 2 columns — “YearsExperience” and “Salary” for 30 employees in a company. So in this, we will train a Ridge Regression model to learn the correlation between the number of years of experience of each employee and their respective salary. Once the model is trained, we will be able to predict the salary of an employee on the basis of his years of experience.
# Importing libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# Ridge Regression
class RidgeRegression() :
def __init__( self, learning_rate, iterations, l2_penality ) :
self.learning_rate = learning_rate
self.iterations = iterations
self.l2_penality = l2_penality
# Function for model training
def fit( self, X, Y ) :
# no_of_training_examples, no_of_features
self.m, self.n = X.shape
# weight initialization
self.W = np.zeros( self.n )
self.b = 0
self.X = X
self.Y = Y
# gradient descent learning
for i in range( self.iterations ) :
self.update_weights()
return self
# Helper function to update weights in gradient descent
def update_weights( self ) :
Y_pred = self.predict( self.X )
# calculate gradients
dW = ( - ( 2 * ( self.X.T ).dot( self.Y - Y_pred ) ) +( 2 * self.l2_penality * self.W ) ) / self.m db = - 2 * np.sum( self.Y - Y_pred ) / self.m
# update weights
self.W = self.W - self.learning_rate * dW
self.b = self.b - self.learning_rate * db
return self
# Hypothetical function h( x )
def predict( self, X ) :
return X.dot( self.W ) + self.b
# Driver code
def main() :
# Importing dataset
df = pd.read_csv( "salary_data.csv" )
X = df.iloc[:, :-1].values
Y = df.iloc[:, 1].values
# Splitting dataset into train and test set
X_train, X_test, Y_train, Y_test = train_test_split( X, Y,test_size = 1 / 3, random_state = 0 ) # Model training
model = RidgeRegression( iterations = 1000,learning_rate = 0.01, l2_penality = 1 )model.fit( X_train, Y_train )
# Prediction on test set
Y_pred = model.predict( X_test )
print( "Predicted values ", np.round( Y_pred[:3], 2 ) )
print( "Real values ", Y_test[:3] )
print( "Trained W ", round( model.W[0], 2 ) )
print( "Trained b ", round( model.b, 2 ) )
# Visualization on test set
plt.scatter( X_test, Y_test, color = 'blue' )
plt.plot( X_test, Y_pred, color = 'orange' )
plt.title( 'Salary vs Experience' )
plt.xlabel( 'Years of Experience' )
plt.ylabel( 'Salary' )
plt.show()
if __name__ == "__main__" :
main()
Python Implementation LASSO Regression
It has 2 columns — “YearsExperience” and “Salary” for 30 employees in a company. So in this, we will train a Lasso Regression model to learn the correlation between the number of years of experience of each employee and their respective salary. Once the model is trained, we will be able to predict the salary of an employee on the basis of his years of experience.
# Importing libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# Lasso Regression
class LassoRegression() :
def __init__( self, learning_rate, iterations, l1_penality ) :
self.learning_rate = learning_rate
self.iterations = iterations
self.l1_penality = l1_penality
# Function for model training
def fit( self, X, Y ) :
# no_of_training_examples, no_of_features
self.m, self.n = X.shape
# weight initialization
self.W = np.zeros( self.n )
self.b = 0
self.X = X
self.Y = Y
# gradient descent learning
for i in range( self.iterations ) :
.update_weights()
return self
# Helper function to update weights in gradient descent
def update_weights( self ) :
Y_pred = self.predict( self.X )
# calculate gradients
dW = np.zeros( self.n )
for j in range( self.n ) :
if self.W[j] > 0 :
dW[j] = ( - ( 2 * ( self.X[:, j] ).dot( self.Y - Y_pred ) )+ self.l1_penality ) / self.m
else :
dW[j] = ( - ( 2 * ( self.X[:, j] ).dot( self.Y - Y_pred ) )- self.l1_penality ) / self.m
db = - 2 * np.sum( self.Y - Y_pred ) / self.m
# update weights
self.W = self.W - self.learning_rate * dW
self.b = self.b - self.learning_rate * db
return self
# Hypothetical function h( x )
def predict( self, X ) :
return X.dot( self.W ) + self.b
def main() :
# Importing dataset
df = pd.read_csv( "salary_data.csv" )
X = df.iloc[:, :-1].values
Y = df.iloc[:, 1].values
# Splitting dataset into train and test set
X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size = 1 / 3, random_state = 0 )
# Model training
model = LassoRegression( iterations = 1000, learning_rate = 0.01, l1_penality = 500 )
model.fit( X_train, Y_train )
# Prediction on test set
Y_pred = model.predict( X_test )
print( "Predicted values ", np.round( Y_pred[:3], 2 ) )
print( "Real values ", Y_test[:3] )
print( "Trained W ", round( model.W[0], 2 ) )
print( "Trained b ", round( model.b, 2 ) )
# Visualization on test set
plt.scatter( X_test, Y_test, color = 'blue' )
plt.plot( X_test, Y_pred, color = 'orange' )
plt.title( 'Salary vs Experience' )
plt.xlabel( 'Years of Experience' )
plt.ylabel( 'Salary' )
plt.show()
if __name__ == "__main__" :
main()
Comparing linear regression, ridge and lasso regression
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
boston_dataset=datasets.load_boston()
boston_pd=pd.DataFrame(boston_dataset.data)
boston_pd.columns=pd.DataFrame(boston_dataset.feature_names)
boston_pd_target=np.asarray(boston_dataset.target)
boston_pd['House Price']=pd.Series(boston_pd_target)
print(boston_pd)
x=boston_pd.iloc[:,:-1]
y=boston_pd.iloc[:,-1]
x_train,x_test,y_train,y_test=train_test_split(boston_pd.iloc[:,:-1],boston_pd.iloc[:,-1],test_size=0.25)
print("train data shape",x_train.shape,y_train.shape)
print("test data shape",x_test.shape,y_test.shape)
## Linear Regression
lreg=LinearRegression()
lreg.fit(x_train,y_train)
lreg_y_pred=lreg.predict(x_test)
msqrerror=np.mean((lreg_y_pred-y_test)**2)
print('Mean sqrd error on test data set',msqrerror)
lreg_coefficient=pd.DataFrame()
lreg_coefficient['columns']=x_train.columns
lreg_coefficient['cof estimates']=pd.Series(lreg.coef_)
print(lreg_coefficient)
## Ridge Regression
from sklearn.linear_model import Ridge
ridger=Ridge(alpha=1)
ridger.fit(x_train,y_train)
ridge_y_pred=ridger.predict(x_test)
msqrerrorridge=np.mean((ridge_y_pred-y_test)**2)
print('Mean sqrd error on test data set',msqrerrorridge)
ridgereg_coefficient=pd.DataFrame()
ridgereg_coefficient['columns']=x_train.columns
ridgereg_coefficient['cof estimates']=pd.Series(ridger.coef_)
print(ridgereg_coefficient)
## Lasso Regression
from sklearn.linear_model import Lasso
lasso=Lasso(alpha=1)
lasso.fit(x_train,y_train)
lasso_y_pred=lasso.predict(x_test)
msqrerrorlasso=np.mean((lasso_y_pred-y_test)**2)
print('Mean sqrd error on test data set',msqrerrorlasso)
lassoreg_coefficient=pd.DataFrame()
lassoreg_coefficient['columns']=x_train.columns
lassoreg_coefficient['cof estimates']=pd.Series(lasso.coef_)
print(lassoreg_coefficient)
Elastic Net:
Elastic Net is a combination of L1 and L2 regularization, and it includes both the absolute values and the squares of the coefficients in the cost function. It has two hyperparameters, one for L1 regularization () and one for L2 regularization ().
Elastic Net combines the benefits of both L1 and L2 regularization and is useful in situations where both feature selection and preventing overfitting are important.
Comments
Post a Comment