Cost Function

How to train a model?

  • Model $p_\theta(x)$
    • Random variable $x$
    • Parameter $\theta$
  • Data
    • $D = {x^{(1)}, x^{(2)}, …, x^{(N)}}$
  • Model learning is finding $\theta$ that can maximize probability of observing data set $D$ from model $p_\theta(x)$
    • Cost function
    • Parameter learning algorithm

Likelihood Estimation

  • Probability of model $p_\theta(x)$ observing data set $D={x^{(1)}, x^{(2)}, … ,x^{(N)}}$

    $p_\theta(x^{(1)}, x^{(2)}, … ,x^{(N)}) = p_\theta(x^{(1)})p_\theta(x^{(2)})…p_\theta(x^{(N)}) = \displaystyle\prod_{i=1}^{N}p_\theta(x^{(i)})$

  • Probability of model $p_\theta(y x)$ observing data set $D={(y^{(1)}, x^{(1)}), (y^{(2)}, x^{(2)}), … ,(y^{(n)}, x^{(N)})}$

    $p_\theta(y^{(1)}, y^{(2)}, … ,y^{(N)}|x^{(1)}, x^{(2)}, … ,x^{(N)}) = p_\theta(y^{(1)}|x^{(1)})p_\theta(y^{(2)}|x^{(2)})…p_\theta(y^{(N)}|x^{(N)}) = \displaystyle\prod_{i=1}^{N}p_\theta(y^{(i)}|x^{(i)})$

Maximum Likelihood Estimation

  • MLE is a methodology of maximizing probability of a model observing data set D

    $\hat\theta = \displaystyle\operatorname*{argmax}{\theta}\displaystyle\prod{i=1}^{N}p_\theta(y^{(i)}|x^{(i)})$

  • Negative log likelihood
    • Converts multiplying minus values into sum of logs since it’s better when being computed

      $\hat\theta = \displaystyle\operatorname*{argmin}{\theta}-\displaystyle\sum{i=1}^{N}logp_\theta(y^{(i)}|x^{(i)})$

  • Cost function

    $J(\theta) = -\frac{1}{N}\displaystyle\sum_{i=1}^{N}logp_\theta(y^{(i)}|x^{(i)})$

    $\hat\theta = \displaystyle\operatorname*{argmin}_{\theta}J(\theta)$

Cost function for Logistic Regression

  • Logistic Regression

    $P_\theta(y=1|\boldsymbol{x})=h(\boldsymbol{x})=\frac{1}{1+e^{-(\boldsymbol\theta^T\boldsymbol{x}+\theta_0)}}$

  • Cost function for logistic regression

    $J(\boldsymbol\theta)=-\frac{1}{N}\displaystyle\sum_{i=1}^{N}logP_\theta(y^{(i)}|x^{(i)})$

Cost function, Loss function

  • Loss function
    • Difference between gold and predicted values per sample $L(\boldsymbol\theta) = -logP_\boldsymbol\theta(y|\boldsymbol{x}) = -ylogh_\boldsymbol\theta(\boldsymbol{x})-(1-y)log(1-h_\boldsymbol\theta(\boldsymbol{x}))$
  • Cost function
    • Average difference between gold and predicted values from the entire data set $J(\boldsymbol\theta)=-\frac{1}{N}\displaystyle\sum_{i=1}^{N}logP_\theta(y^{(i)}|x^{(i)})$

Leave a comment