Differentiation of the Sigmoid activation and cross-entropy loss function

 


A step-by-step differentiation of the Sigmoid activation and cross-entropy loss function is discussed here.
The understanding of derivatives of these two functions is essential in the area of machine learning when performing back-propagation during model training.

Derivative of Sigmoid Function

Sigmoid/ Logistic function is defined as:
$g(x)=\frac{1}{1+e^{-x}} \in (0,1)$

For any value of $x$, the Sigmoid function $g(x)$ falls in the range $(0, 1)$. As the value of  $x$ decreases, g(x) approaches 0, whereas as $x$ grows bigger, $g(x) $tends to 1. Examples,

$g(-5.5)=0.0040$
$g(6.5)=0.9984$
$g(0.4)=0.5986$

It is noted that we can further simplify the derivative and write in term of g(x)


Why do we use this version of the derivative?

In the forward propagation step, you compute the sigmoid function $(g(x))$ and have its value handy. While computing the derivative in the backpropagation step, all you have to do is plug in the value of $g(x$) in the formula derived above.


Here is the plot of sigmoid function and its derivative

Derivative of Cross-Entropy Function

Cross-Entropy loss function is a very important cost function used for classification problems.The concept of cross-entropy traces back into the field of Information Theory where Claude Shannon introduced the concept of entropy in 1948. Before diving into Cross-Entropy cost function, let us introduce entropy .

Entropy

Entropy of a random variable $X$ is the level of uncertainty inherent in the variables possible outcome.

For $p(x)$ — probability distribution and a random variable $X$, entropy is defined as follows



Cross-Entropy Loss Function  is also called logarithmic loss, log loss or logistic loss. Each predicted class probability is compared to the actual class desired output 0 or 1 and a score/loss is calculated that penalizes the probability based on how far it is from the actual expected value. The penalty is logarithmic in nature yielding a large score for large differences close to 1 and small score for small differences tending to 0.

Cross-entropy loss is used when adjusting model weights during training. The aim is to minimize the loss, i.e, the smaller the loss the better the model. A perfect model has a cross-entropy loss of 0.

Cross-entropy is defined as

$E_g= - \sum_{i=1}^n t_i ln (p_i)$
where $t_i$ is the truth value and $p_i$ is the probability of the $i$ᵗʰ class.

For classification with two classes, we have binary cross-entropy loss which is defined as follows

$E_b=-[ t_n ln(\bar{y}) + (1-t) ln (1-\hat{y})]$

The truth label, $t$, on the binary loss is a known value, whereas $\hat{y}$ is a variable. This means that the function will be differentiated with respect to $\hat{y}$ and treat $t$ as a constant. Let’s go ahead and work on the derivative now.

And therefore, the derivative of the binary cross-entropy loss function becomes

Partial Derivative of cost function for linear regression




Partial Derivative of Cost Function for Logistic Regression






Comments

Popular posts from this blog

NEURAL NETWORKS AND DEEP LEARNING CST 395 CS 5TH SEMESTER HONORS COURSE NOTES - Dr Binu V P, 9847390760

Syllabus CST 395 Neural Network and Deep Learning

Introduction to neural networks