Lecture 03

2022-03-14

Lecture02 recap
Learning machines:
- experience E: supervised & unsupervised & reinforcement ......
- tasks T: classification, regression, density estimation, etc.
- performance measure P: accuracy, error rate
Example: linear regression (2D space)
- What: $\hat{y} = w^Tx$
- How: $D = \{(x_0, y_0), ..., (x_n, y_n)\}$, find $w$ that minimizes $Loss(D, w)$.
- Train error vs. Validation error vs. Test error
- Generalization capability (泛化性): the gap b/w train error and test error
- Underfitting vs. Overfitting
- Hyperparameter & validation set
Perceptron
- Just a linear classifier
- To adapt to non-linear problem: kernel method, deep learning
Deep learning: Single hidden layer
Deep neural network
- A composite function composed a lot of first-order differentiable functions parametrized by $w$.
- Ingredients: cost functions, output units, hidden layers, architecture
- Cost function: cross entropy $L(\hat{y}) = -\sum\limits_{i=1}^{C}y_i\log{\hat{y}_i}$. Euclidian Loss: $L_2$ loss
  - 为什么习惯写成$\frac{\partial L}{\partial \hat{y}}$而不是$\frac{d L}{dw}$: 分离Loss与prediction: $\frac{dL}{dw} = \frac{\partial L}{\partial \hat{y}} \frac{d L}{dw}$.
- Output units: Sigmoid, softmax
- Hidden units: ReLU, sigmoid, tanh
Architecture design
- Networks are organized into groups of units called layers. Layers form a chain.
- Depth = number of layers; Width = size of layers
- Universal approximation theorem: enough units => any function
Forward, backward and repetition
- Forward: Model => Prediction => Loss
- Backward: Loss gradient (w.s.t $\hat{y}$) => Prediction gradient => Model gradient
- Repeat: Repeat "Forward" and "Backward" until convergence
Gradient descent: $w^{(t+1)} = w^{(t)} - \eta^{(t)}\nabla{L}|_{w}^{(t)}$.

首页

Lecture 03