Lecture 03

  • Lecture02 recap
  • Learning machines:
    • experience E: supervised & unsupervised & reinforcement ......
    • tasks T: classification, regression, density estimation, etc.
    • performance measure P: accuracy, error rate
  • Example: linear regression (2D space)
    • What: $\hat{y} = w^Tx$
    • How: $D = \{(x_0, y_0), ..., (x_n, y_n)\}$, find $w$ that minimizes $Loss(D, w)$.
    • Train error vs. Validation error vs. Test error
    • Generalization capability (泛化性): the gap b/w train error and test error
    • Underfitting vs. Overfitting
    • Hyperparameter & validation set
  • Perceptron
    • Just a linear classifier
    • To adapt to non-linear problem: kernel method, deep learning
  • Deep learning: Single hidden layer
  • Deep neural network
    • A composite function composed a lot of first-order differentiable functions parametrized by $w$.
    • Ingredients: cost functions, output units, hidden layers, architecture
    • Cost function: cross entropy $L(\hat{y}) = -\sum\limits_{i=1}^{C}y_i\log{\hat{y}_i}$. Euclidian Loss: $L_2$ loss
      • 为什么习惯写成$\frac{\partial L}{\partial \hat{y}}$而不是$\frac{d L}{dw}$: 分离Loss与prediction: $\frac{dL}{dw} = \frac{\partial L}{\partial \hat{y}} \frac{d L}{dw}$.
    • Output units: Sigmoid, softmax
    • Hidden units: ReLU, sigmoid, tanh
  • Architecture design
    • Networks are organized into groups of units called layers. Layers form a chain.
    • Depth = number of layers; Width = size of layers
    • Universal approximation theorem: enough units => any function
  • Forward, backward and repetition
    • Forward: Model => Prediction => Loss
    • Backward: Loss gradient (w.s.t $\hat{y}$) => Prediction gradient => Model gradient
    • Repeat: Repeat "Forward" and "Backward" until convergence
  • Gradient descent: $w^{(t+1)} = w^{(t)} - \eta^{(t)}\nabla{L}|_{w}^{(t)}$.