I have received the following problem:
Concider the following simple model of a neuron
z = wx + b logits,
yˆ = g(z) activation,
L2 (w, b) = 12 (y − ŷ)^2 quadratic loss (Mean Squared Error (MSE), L2 loss, l2-norm),
L1 (w, b) = |y − yˆ| absolut value loss (Mean Absolut Error (MAE), L1 loss, l1-norm),
with x,w,b ∈ R. Calculate the derivatives ∂L/∂w and ∂L/∂b for updating the weight w and bias b. Determine the results for both loss functions (L1, L2) and assume a sigmoid and a tanh activation function g(z). Write down all steps of your derivation. Hint: You have to use the chain rule.
I have considered the following approach for example L2 derived after w:
\begin{equation}
L_2 = \frac{1}{2} (y - \hat{y})^2 = \frac{1}{2} \left( y - g(z) \right)^2 = \frac{1}{2} \left( y - g(wx + b) \right)^2 = \frac{1}{2} \left( y - \frac{1}{1 + e^{-wx + b}} \right)^2
\end{equation}
This expression could then be easily derived using the chain rule.
However, in the literature I find the following approach:
\begin{equation}
\frac{\partial L_2}{\partial w} = \frac{\partial L_2}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial z} \times \frac{\partial z}{\partial w} = -(y - \hat{y}) \times g'(z) \times x
\end{equation}
So which of the two approaches is the right one in this context? Or is there even an adner connection?