Penalty Based Regularizations- L1 and L2

Penalty-based regularization is the most common approach for reducing overfitting. In order to understand this point, let us revisit the example of the polynomial with degree $d$. In this case, the prediction $\hat{y}$ for a given value of $x$ is as follows:

$\hat{y}=\sum_{i=0}^d w_ix^i$

It is possible to use a single-layer network with $d$ inputs and a single bias neuron with weight $w_0$ in order to model this prediction. The $i$th input is $x_i$. This neural network uses linear activations, and the squared loss function for a set of training instances $(x, y)$ from data set $D$ can be defined as follows:

$L=\sum_{(x,y) \in D} (y-\hat{y})^2$

As discussed earlier, a large value of $d$ tends to increase overfitting.One possible solution to this problem is to reduce the value of d. In other words, using a model with economy in parameters leads to a simpler model. For example, reducing $d$ to 1 creates a linear model that has fewer degrees of freedom and tends to fit the data in a similar way over different training samples. However, doing so does lose some expressivity when the data patterns are indeed complex. In other words, oversimplification reduces the expressive power of a neural network, so that it is unable to adjust sufficiently to the needs of different types of data sets.

L2-regularization(Ridge Regression)

How can one retain some of this expressiveness without causing too much overfitting? Instead of reducing the number of parameters in a hard way, one can use a soft penalty on the use of parameters. Furthermore, large (absolute) values of the parameters are penalized more than small values, because small values do not affect the prediction significantly. What kind of penalty can one use? The most common choice is L2-regularization(ridge regression), which is also referred to as Tikhonov regularization. In such a case, the additional penalty is defined by the sum of squares of the values of the parameters. Then, for the regularization parameter $λ > 0$, one can define the objective function as follows:

$L=\sum_{(x,y) \in D} (y-\hat{y})^2 + λ \sum_{i=0}^d w_i^2$

Increasing or decreasing the value of $λ$ reduces the softness of the penalty. One advantage of this type of parameterized penalty is that one can tune this parameter for optimum performance on a portion of the training data set that is not used for learning the parameters. This type of approach is referred to as model validation. Using this type of approach provides greater flexibility than fixing the economy of the model up front. Consider the case of polynomial regression discussed above. Restricting the number of parameters up front severely constrains the learned polynomial to a specific shape (e.g., a linear model), whereas a soft penalty is able to control the shape of the learned polynomial in a more data-driven manner. In general, it has been experimentally observed that it is more desirable to use complex models (e.g., larger neural networks) with regularization rather than simple models without regularization. The former also provides greater flexibility by providing a tunable knob (i.e., regularization parameter), which can be chosen in a data-driven manner. The value of the tunable knob is learned on a held-out portion of the data set.How does regularization affect the updates in a neural network? For any given weight $w_i$ in the neural network, the updates are defined by gradient descent (or the batched version

of it):

$w_i=w_i-\alpha \frac{\partial L }{\partial w_i}$

Here, $α$ is the learning rate. The use of $L2$regularization is roughly equivalent to the use of decay imposition after each parameter update:

$w_i=w_i(1-\alphaλ)- \alpha \frac{\partial L }{\partial w_i}$

Note that the update above first multiplies the weight with the decay factor $(1 − αλ)$, and then uses the gradient-based update. The decay of the weights can also be understood in terms of a biological interpretation, if we assume that the initial values of the weights are close to 0. One can view weight decay as a kind of forgetting mechanism, which brings the weights closer to their initial values. This ensures that only the repeated updates have a significant effect on the absolute magnitude of the weights. A forgetting mechanism prevents a model from memorizing the training data, because only significant and repeated updates will be reflected in the weights.

Linear regression treats all the features equally and finds unbiased weights to minimizes the cost function. This could arise the problem of overfitting ( or a model fails to perform well on new data ). Linear Regression also can’t deal with the collinear data ( collinearity refers to the event when the features are highly correlated ). In short, Linear Regression is a model with high variance. So, Ridge Regression comes for the rescue. In Ridge Regression, there is an addition of L2 penalty ( square of the magnitude of weights ) in the cost function of Linear Regression.This is done so that the model does not overfit the data.

Mathematical Intuition:

During gradient descent optimization of its cost function, added l2 penalty term leads to reduces the weights of the model to zero or close to zero. Due to the penalization of weights, our hypothesis gets simpler, more generalized, and less prone to overfitting. All weights are reduced by the same factor lambda. We can control the strength of regularization by hyperparameter lambda.

Different cases for tuning values of lambda.If lambda is set to be 0, Ridge Regression equals Linear Regression
If lambda is set to be infinity, all weights are shrunk to zero.

So, we should set lambda somewhere in between 0 and infinity.

L1-Regularization ( lasso regression-least absolute shrinkage and selection operator)

The use of the squared norm penalty, which is also referred to as L2-regularization, is the most common approach for regularization. However, it is possible to use other types of penalties on the parameters. A common approach is L1-regularization in which the squared penalty is replaced with a penalty on the sum of the absolute magnitudes of the coefficients.Therefore, the new objective function is as follows:

$L=\sum_{(x,y) \in D} (y-\hat{y})^2 + λ \sum_{i=0}^d |w_i|_1$

The main problem with this objective function is that it contains the term $|w_i|$, which is not differentiable when $w_i$ is exactly equal to 0. This requires some modifications to the gradient-descent method when $w_i$ is 0. For the case when $w_i$ is non-zero, one can use the straightforward update obtained by computing the partial derivative. By differentiating the above objective function, we can define the update equation at least for the case when $w_i$ is different than 0:

$w_i=w_i-\alphaλs_i- \alpha \frac{\partial L }{\partial w_i}$

The value of $s_i$, which is the partial derivative of $|w_i|$ (with respect to $w_i$), is as follows:

$s_i= -1 \quad w_i < 0$

$s_i=+1 \quad w_i > 0$

However, we also need to set the partial derivative of $|w_i|$ for cases in which the value of $w_i$ is exactly 0. One possibility is to use the subgradient method in which the value of $w_i$ is set stochastically to a value in $\{−1, +1\}$. However, this is not necessary in practice. Computers are of finite-precision, and the computational errors will rarely cause $w_i$ to be exactly 0.Therefore, the computational errors will often perform the task that would otherwise be achieved by stochastic sampling. Furthermore, for the rare cases in which the value $w_i$ is exactly 0, one can omit the regularization and simply set $s_i$ to 0. This type of approximation to the subgradient method works reasonably well in many settings.One difference between the update equations for L1-regularization and those in L2-regularization is that L2-regularization uses multiplicative decay as a forgetting mechanism,

whereas L1-regularization uses additive updates as a forgetting mechanism. In both cases, the regularization portions of the updates tend to move the coefficients closer to 0. However, there are some differences in the types of solutions found in the two cases, which are discussed in the next section.

Lasso Regression performs both, variable selection and regularization too.
Mathematical Intuition:
During gradient descent optimization, added L1 penalty shrunk weights close to zero or zero. Those weights which are shrunken to zero eliminates the features present in the hypothetical function. Due to this, irrelevant features don’t participate in the predictive model. This penalization of weights makes the hypothesis more simple which encourages the sparsity ( model with few parameters ).

If the intercept is added, it remains unchanged.

We can control the strength of regularization by hyperparameter lambda. All weights are reduced by the same factor lambda.

Different cases for tuning values of lambda.If lambda is set to be 0, Lasso Regression equals Linear Regression.
If lambda is set to be infinity, all weights are shrunk to zero.

If we increase lambda, bias increases if we decrease the lambda variance increase. As lambda increases, more and more weights are shrunk to zero and eliminates features from the model.

L1- or L2-Regularization?

A question arises as to whether L1- or L2-regularization is desirable. From an accuracy point of view, L2-regularization usually outperforms L1-regularization. This is the reason that L2-regularization is almost always preferred over L1-regularization is most implementations.The performance gap is small when the number of inputs and units is large.However, L1-regularization does have specific applications from an interpretability point of view. An interesting property of L1-regularization is that it creates sparse solutions in which the vast majority of the values of $w_i$ are 0s (after ignoring computational errors).If the value of $w_i$ is zero for a connection incident on the input layer, then that particular input has no effect on the final prediction. In other words, such an input can be dropped, and the L1-regularizer acts as a feature selector. Therefore, one can use L1-regularization to estimate which features are predictive to the application at hand. What about the connections in the hidden layers whose weights are set to 0? These connections can be dropped, which results in a sparse neural network. Such sparse neural networks can be useful in cases where one repeatedly performs training on the same type of data set, but the nature and broader characteristics of the data set do not change significantly with time. Since the sparse neural network will contain only a small fraction of the connections in the original neural network, it can be retrained much more efficiently whenever more training data is received.

Penalizing Hidden Units: Learning Sparse Representations

The penalty-based methods, which have been discussed so far, penalize the parameters of the neural network. A different approach is to penalize the activations of the neural network, so that only a small subset of the neurons are activated for any given data instance. In other words, even though the neural network might be large and complex only a small part of it is used for predicting any given data instance. The simplest way to achieve sparsity is to impose an L1-penalty on the hidden units.Therefore, the original loss function L is modified to the regularized loss function L' as follows:

$L'=L+\lambda \sum_{i=1}^M |h_i|$

Here, $M$ is the total number of units in the network, and $h_i$ is the value of the $i$th hidden unit. Furthermore, the regularization parameter is denoted by $λ$. In many cases, a single layer of the network is regularized, so that a sparse feature representation can be extracted from the activations of that particular layer. How does this change to the objective function affect the backpropagation algorithm? The main difference is that the loss function is aggregated not only over nodes in the output

layer, but also over nodes in the hidden layer. At a fundamental level, this change does not affect the overall dynamics and principles of backpropagation

Once the value of $δ(hr,N(hr))$ is modified at a given node $h_r$, the changes will automatically be backpropagated to all nodes that reach $h_r$. This is the only change that is required in order to enforce L1-regularization of the hidden units. In a sense, incorporating penalties on nodes in intermediate layers does not change the backpropagation algorithm in a fundamental way, except that hidden nodes are now also treated as output nodes in terms of contributing to the gradient flow.

Summary
L1 and L2 regularization are two commonly used techniques in machine learning to prevent overfitting and improve the generalization performance of models. They work by adding penalty terms to the model's cost function, which encourage the model to have smaller weight values.L1 Regularization (Lasso Regularization): L1 regularization adds a penalty term to the cost function proportional to the absolute values of the model's weights. The L1 regularization term is defined as the sum of the absolute values of all the model's weights multiplied by a hyperparameter called the regularization strength (usually denoted by lambda or alpha).

The L1 regularization term can be expressed as: λ * Σ|wi|, where wi represents the individual weight and λ is the regularization strength.

The effect of L1 regularization is that it encourages the model to have sparse weight values, i.e., some of the weights may become exactly zero. This can be useful for feature selection, as it effectively eliminates less important features from the model.L2 Regularization (Ridge Regularization): L2 regularization adds a penalty term to the cost function proportional to the square of the model's weights. The L2 regularization term is defined as the sum of the squares of all the model's weights multiplied by the regularization strength.

The L2 regularization term can be expressed as: λ * Σ(wi^2), where wi represents the individual weight and λ is the regularization strength.

The effect of L2 regularization is that it penalizes large weight values, encouraging the model to have small, but non-zero, weight values for all features. L2 regularization helps in preventing overfitting and also leads to a more stable model.

Both L1 and L2 regularization are controlled by the regularization strength (λ or alpha), which is a hyperparameter that needs to be set prior to training. The appropriate choice of regularization strength depends on the specific problem and dataset, and it is often determined using techniques like cross-validation.

In summary, L1 regularization encourages sparsity and feature selection, while L2 regularization promotes weight shrinkage and generalization. Many machine learning algorithms, such as linear regression, logistic regression, and neural networks, can be regularized using either L1 or L2 regularization or a combination of both (known as Elastic Net regularization).

Showing the weight decay in L1(LASSO) regularisation, Python code

Because the L1 regularization term is proportional to the absolute values of the weights, it has the effect of "pulling" some of the weights towards zero. When the regularization strength is sufficiently large, the model is encouraged to set many of the weights exactly to zero, effectively removing those corresponding features from the model.

This is how L1 regularization introduces sparsity in the model – it "encourages" the model to disregard some features by setting their associated weights to zero. The sparsity achieved through L1 regularization is valuable for feature selection, as it can help identify and retain only the most important features, leading to more interpretable and efficient models.

In summary, L1 regularization promotes sparsity by adding a penalty term based on the absolute values of the model's weights, which encourages the model to set many of the weights to zero during training, effectively removing less important features from the model.

let's demonstrate how L1 regularization introduces sparsity with a simple example using linear regression. In this example, we'll create a synthetic dataset and apply L1 regularization to see how it affects the weights and feature selection.

Suppose we have a synthetic dataset with three features (X1, X2, X3) and a target variable (y). Here's the dataset:

| X1 | X2 | X3 | y |

|----|----|----|-----|

| 1 | 5 | 2 | 10 |

| 2 | 4 | 3 | 15 |

| 3 | 3 | 4 | 20 |

| 4 | 2 | 5 | 25 |

import numpy as np

from sklearn.linear_model import Lasso

# Create the dataset

X = np.array([[1, 5, 2],

[2, 4, 3],

[3, 3, 4],

[4, 2, 5]])

y = np.array([10, 15, 20, 25])

# Apply L1 regularization (Lasso)

lasso_model = Lasso(alpha=0.1) # Regularization strength (lambda) is 0.1

lasso_model.fit(X, y)

# Get the weights of the model

weights = lasso_model.coef_

print("Feature Weights:", weights)

In this example, we use the Lasso regression from scikit-learn, which applies L1 regularization. The regularization strength (alpha) is set to 0.1 for demonstration purposes.

When we run this code, the output will be:

Feature Weights: [2.81609195 0. 1.01694915]

Here's how to interpret the results:

The first weight (2.816) corresponds to X1.
The second weight (0.000) corresponds to X2.
The third weight (1.017) corresponds to X3.

As you can see, the L1 regularization (Lasso) has introduced sparsity by setting the weight of the feature X2 to exactly zero. This means that feature X2 has been completely disregarded by the model. The other features (X1 and X3) still have non-zero weights, indicating that they are considered relevant for the model's predictions.

This example illustrates how L1 regularization encourages sparsity by driving some weights to zero, effectively performing feature selection and retaining only the most important features in the model.

import numpy as np

from sklearn.linear_model import Lasso

# Create the dataset

X = np.array([[1, 2],

[2, 3],

[3, 4],

[4, 5]])

y = np.array([2, 4, 6, 8])

# Apply L1 regularization (Lasso) with different regularization strengths

regularization_strengths = [0.0, 0.1, 0.5, 1.0] # Vary the strength of regularization

for alpha in regularization_strengths:

lasso_model = Lasso(alpha=alpha)

lasso_model.fit(X, y)

weights = lasso_model.coef_

print(f"Regularization Strength (alpha={alpha}):")

print("Feature Weights:", weights)

print()

In this example, we use the Lasso regression from scikit-learn, applying L1 regularization. We vary the regularization strength (alpha) to observe how it affects the weight values.

When we run this code, the output will be:

Regularization Strength (alpha=0.0):

Feature Weights: [1. 1.]

Regularization Strength (alpha=0.1):

Feature Weights: [0.78947368 0.47368421]

Regularization Strength (alpha=0.5):

Feature Weights: [0.4 0. ]

Regularization Strength (alpha=1.0):

Feature Weights: [0. 0.]

Here's how to interpret the results:

When there is no regularization (alpha=0.0), the weights are [1.0, 1.0]. The model tries to fit the data without any penalty on the weight values.
As we increase the regularization strength (alpha=0.1), the weights decrease towards zero. The model penalizes larger weight values, leading to smaller weights. Some features are still retained in the model, but their weights are reduced compared to the non-regularized case.
With higher regularization strength (alpha=0.5), one of the features (X2) gets reduced to zero, effectively removing it from the model. Only feature X1 is kept with a reduced weight.
When the regularization strength is very high (alpha=1.0), both features' weights are reduced to exactly zero, indicating that the model has removed both features from the model.

As we can see from the output, increasing the regularization strength gradually reduces the weight values of the features. This demonstrates how L1 regularization effectively reduces the magnitude of the weights, leading to a simpler model with fewer relevant features (and sometimes complete removal of less important features).

L2 ( Ridge) Regularisation demonstration using Python code

To demonstrate how L2 regularization (Ridge regression) is used for regularization, let's use a simple linear regression example. We will create a synthetic dataset and apply L2 regularization to observe how it affects the weight values.

import numpy as np

from sklearn.linear_model import Ridge

# Create the dataset

X = np.array([[1, 2],

[2, 3],

[3, 4],

[4, 5]])

y = np.array([2, 4, 6, 8])

# Apply L2 regularization (Ridge) with different regularization strengths

regularization_strengths = [0.0, 0.1, 0.5, 1.0] # Vary the strength of regularization

for alpha in regularization_strengths:

ridge_model = Ridge(alpha=alpha)

ridge_model.fit(X, y)

weights = ridge_model.coef_

print(f"Regularization Strength (alpha={alpha}):")

print("Feature Weights:", weights)

print()

In this example, we use the Ridge regression from scikit-learn, applying L2 regularization. We vary the regularization strength (alpha) to observe how it affects the weight values.

When we run this code, the output will be:

Regularization Strength (alpha=0.0):

Feature Weights: [1. 1.]

Regularization Strength (alpha=0.1):

Feature Weights: [0.93023256 0.8372093 ]

Regularization Strength (alpha=0.5):

Feature Weights: [0.78082192 0.69520548]

Regularization Strength (alpha=1.0):

Feature Weights: [0.62162162 0.55405405]

Here's how to interpret the results:

When there is no regularization (alpha=0.0), the weights are [1.0, 1.0]. The model tries to fit the data without any penalty on the weight values.
As we increase the regularization strength (alpha=0.1), the weights get closer to zero but remain non-zero. The model penalizes larger weight values, leading to smaller weights.
With higher regularization strength (alpha=0.5), the weights decrease further. The model penalizes larger weights more strongly, leading to even smaller weights.
When the regularization strength is very high (alpha=1.0), the weights are reduced even more. The model heavily penalizes larger weight values, leading to significantly smaller weights.

As we can see from the output, increasing the regularization strength gradually reduces the weight values of the features. This demonstrates how L2 regularization effectively reduces the magnitude of the weights, leading to a simpler model with smaller weight values for all features. L2 regularization helps in preventing overfitting and makes the model more robust to variations in the input data.

Python Implementation of Ridge Regression using sklearn

# importing libraries

from sklearn.linear_model import Ridge

from sklearn.model_selection import train_test_split

from sklearn.datasets import load_boston

from sklearn.preprocessing import StandardScaler

# loading boston dataset

boston = load_boston()

X = boston.data[:, :13]

y = boston.target

print ("Boston dataset keys : \n", boston.keys())

print ("\nBoston data : \n", boston.data)

# scaling the inputs

scaler = StandardScaler()

scaled_X = scaler.fit_transform(X)

# Train Test split will be used for both models

X_train, X_test, y_train, y_test = train_test_split(scaled_X, y,test_size = 0.3)

# training model with 0.5 alpha value

model = Ridge(alpha = 0.5, normalize = False, tol = 0.001,solver ='auto', random_state = 42)

model.fit(X_train, y_train)

# predicting the y_test

y_pred = model.predict(X_test)

# finding score for our model

score = model.score(X_test, y_test)

print("\n\nModel score : ", score)

Python Implementation RIDGE regression

Dataset used in this implementation is salary_data.csv

It has 2 columns — “YearsExperience” and “Salary” for 30 employees in a company. So in this, we will train a Ridge Regression model to learn the correlation between the number of years of experience of each employee and their respective salary. Once the model is trained, we will be able to predict the salary of an employee on the basis of his years of experience.

# Importing libraries

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

# Ridge Regression

class RidgeRegression() :

def __init__( self, learning_rate, iterations, l2_penality ) :

self.learning_rate = learning_rate

self.iterations = iterations

self.l2_penality = l2_penality

# Function for model training

def fit( self, X, Y ) :

# no_of_training_examples, no_of_features

self.m, self.n = X.shape

# weight initialization

self.W = np.zeros( self.n )

self.b = 0

self.X = X

self.Y = Y

# gradient descent learning

for i in range( self.iterations ) :

self.update_weights()

return self

# Helper function to update weights in gradient descent

def update_weights( self ) :

Y_pred = self.predict( self.X )

# calculate gradients

dW = ( - ( 2 * ( self.X.T ).dot( self.Y - Y_pred ) ) +( 2 * self.l2_penality * self.W ) ) / self.m
db = - 2 * np.sum( self.Y - Y_pred ) / self.m

# update weights

self.W = self.W - self.learning_rate * dW

self.b = self.b - self.learning_rate * db

return self

# Hypothetical function h( x )

def predict( self, X ) :

return X.dot( self.W ) + self.b

# Driver code

def main() :

# Importing dataset

df = pd.read_csv( "salary_data.csv" )

X = df.iloc[:, :-1].values

Y = df.iloc[:, 1].values

# Splitting dataset into train and test set

X_train, X_test, Y_train, Y_test = train_test_split( X, Y,test_size = 1 / 3, random_state = 0 )

# Model training

model = RidgeRegression( iterations = 1000,learning_rate = 0.01, l2_penality = 1 )
model.fit( X_train, Y_train )

# Prediction on test set

Y_pred = model.predict( X_test )

print( "Predicted values ", np.round( Y_pred[:3], 2 ) )

print( "Real values ", Y_test[:3] )

print( "Trained W ", round( model.W[0], 2 ) )

print( "Trained b ", round( model.b, 2 ) )

# Visualization on test set

plt.scatter( X_test, Y_test, color = 'blue' )

plt.plot( X_test, Y_pred, color = 'orange' )

plt.title( 'Salary vs Experience' )

plt.xlabel( 'Years of Experience' )

plt.ylabel( 'Salary' )

plt.show()

if __name__ == "__main__" :

main()

Python Implementation LASSO Regression

Dataset used in this implementation is salary_data.csv

It has 2 columns — “YearsExperience” and “Salary” for 30 employees in a company. So in this, we will train a Lasso Regression model to learn the correlation between the number of years of experience of each employee and their respective salary. Once the model is trained, we will be able to predict the salary of an employee on the basis of his years of experience.

# Importing libraries

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

# Lasso Regression

class LassoRegression() :

def __init__( self, learning_rate, iterations, l1_penality ) :

self.learning_rate = learning_rate

self.iterations = iterations

self.l1_penality = l1_penality

# Function for model training

def fit( self, X, Y ) :

# no_of_training_examples, no_of_features

self.m, self.n = X.shape

# weight initialization

self.W = np.zeros( self.n )

self.b = 0

self.X = X

self.Y = Y

# gradient descent learning

for i in range( self.iterations ) :

.update_weights()

return self

# Helper function to update weights in gradient descent

def update_weights( self ) :

Y_pred = self.predict( self.X )

# calculate gradients

dW = np.zeros( self.n )

for j in range( self.n ) :

if self.W[j] > 0 :

dW[j] = ( - ( 2 * ( self.X[:, j] ).dot( self.Y - Y_pred ) )+ self.l1_penality ) / self.m

else :

dW[j] = ( - ( 2 * ( self.X[:, j] ).dot( self.Y - Y_pred ) )- self.l1_penality ) / self.m

db = - 2 * np.sum( self.Y - Y_pred ) / self.m

# update weights

self.W = self.W - self.learning_rate * dW

self.b = self.b - self.learning_rate * db

return self

# Hypothetical function h( x )

def predict( self, X ) :

return X.dot( self.W ) + self.b

def main() :

# Importing dataset

df = pd.read_csv( "salary_data.csv" )

X = df.iloc[:, :-1].values

Y = df.iloc[:, 1].values

# Splitting dataset into train and test set

X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size = 1 / 3, random_state = 0 )

# Model training

model = LassoRegression( iterations = 1000, learning_rate = 0.01, l1_penality = 500 )

model.fit( X_train, Y_train )

# Prediction on test set

Y_pred = model.predict( X_test )

print( "Predicted values ", np.round( Y_pred[:3], 2 ) )

print( "Real values ", Y_test[:3] )

print( "Trained W ", round( model.W[0], 2 ) )

print( "Trained b ", round( model.b, 2 ) )

# Visualization on test set

plt.scatter( X_test, Y_test, color = 'blue' )

plt.plot( X_test, Y_pred, color = 'orange' )

plt.title( 'Salary vs Experience' )

plt.xlabel( 'Years of Experience' )

plt.ylabel( 'Salary' )

plt.show()

if __name__ == "__main__" :

main()

Comparing linear regression, ridge and lasso regression

import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

boston_dataset=datasets.load_boston()

boston_pd=pd.DataFrame(boston_dataset.data)

boston_pd.columns=pd.DataFrame(boston_dataset.feature_names)

boston_pd_target=np.asarray(boston_dataset.target)

boston_pd['House Price']=pd.Series(boston_pd_target)

print(boston_pd)

x=boston_pd.iloc[:,:-1]

y=boston_pd.iloc[:,-1]

x_train,x_test,y_train,y_test=train_test_split(boston_pd.iloc[:,:-1],boston_pd.iloc[:,-1],test_size=0.25)

print("train data shape",x_train.shape,y_train.shape)

print("test data shape",x_test.shape,y_test.shape)

## Linear Regression

lreg=LinearRegression()

lreg.fit(x_train,y_train)

lreg_y_pred=lreg.predict(x_test)

msqrerror=np.mean((lreg_y_pred-y_test)**2)

print('Mean sqrd error on test data set',msqrerror)

lreg_coefficient=pd.DataFrame()

lreg_coefficient['columns']=x_train.columns

lreg_coefficient['cof estimates']=pd.Series(lreg.coef_)

print(lreg_coefficient)

## Ridge Regression

from sklearn.linear_model import Ridge

ridger=Ridge(alpha=1)

ridger.fit(x_train,y_train)

ridge_y_pred=ridger.predict(x_test)

msqrerrorridge=np.mean((ridge_y_pred-y_test)**2)

print('Mean sqrd error on test data set',msqrerrorridge)

ridgereg_coefficient=pd.DataFrame()

ridgereg_coefficient['columns']=x_train.columns

ridgereg_coefficient['cof estimates']=pd.Series(ridger.coef_)

print(ridgereg_coefficient)

## Lasso Regression

from sklearn.linear_model import Lasso

lasso=Lasso(alpha=1)

lasso.fit(x_train,y_train)

lasso_y_pred=lasso.predict(x_test)

msqrerrorlasso=np.mean((lasso_y_pred-y_test)**2)

print('Mean sqrd error on test data set',msqrerrorlasso)

lassoreg_coefficient=pd.DataFrame()

lassoreg_coefficient['columns']=x_train.columns

lassoreg_coefficient['cof estimates']=pd.Series(lasso.coef_)

print(lassoreg_coefficient)

Elastic Net:

Elastic Net is a combination of L1 and L2 regularization, and it includes both the absolute values and the squares of the coefficients in the cost function. It has two hyperparameters, one for L1 regularization (

α

) and one for L2 regularization (

β

${Cost}_{Elastic Net} = {Cost}_{original} + � \sum_{� = 1}^{�} ∣ �_{�} ∣ + � \sum_{� = 1}^{�} �_{�}^{2}$

Elastic Net combines the benefits of both L1 and L2 regularization and is useful in situations where both feature selection and preventing overfitting are important.

Search This Blog

Neural Networks and Deep Learning CST 395 CS 5TH Semester Honors Course notes - Dr Binu V P

Penalty Based Regularizations- L1 and L2

Comments

Post a Comment

Popular posts from this blog

Neural Networks and Deep Learning CST 395 CS 5TH Semester Honors Course notes - Dr Binu V P

Previous Year Question Papers CST 395 Neural Network and Deep Learning KTU

Capacity, Overfitting and Underfitting