Machine Learning and Pattern Recognition (Revision)
$\nabla_{\mathbf x} (\mathbf x^\top \mathbf a) = \mathbf a$, $\nabla_{\mathbf x} (\mathbf x^\top A \mathbf x) = A \mathbf x + A^\top \mathbf x$
try:
print(np.linalg.solve(np.dot(X.T, X), np.dot(X.T, y)))
except Exception as exception:
print(type(exception).__name__)
try:
print(np.dot(np.linalg.solve(np.dot(X.T, X), X.T), y))
except Exception as exception:
print(type(exception).__name__)
For univariate, $\boldsymbol\phi (x) = (1, x, x^2, \cdots, x^K)^\top$.
For multivariate, $\boldsymbol\phi (\mathbf x) = (1, x_1, x_2, \cdots x_D, x_1 x_2, x_1 x_3, \cdots, x_{D-1} x_D, x_D^2, \cdots)^\top$
Properties:
Properties:
Penalising extreme solutions of weights by adding sum of the square weights in the cost function.
By rewriting $\mathbf y' = \begin{pmatrix} \mathbf y \\ \mathbf 0_K \end{pmatrix}$ and $\Phi' = \begin{pmatrix} \Phi \\ \sqrt{\lambda} \mathbb I_K \end{pmatrix}$, then
$$E_\lambda (\mathbf w; \mathbf y', \Phi') = (\mathbf y' - \Phi' \mathbf w)^\top (\mathbf y' - \Phi' \mathbf w)$$For example, $f(x) = b$ or $f(\mathbf x) = \mathbf w^\top \mathbf x + b$.
Nested models (models with weights constraint to 0) always perform worse than the models without the constraint.
If test cases are from some (data) distribution $p(\mathbf x, y)$,
$$\text{Generalisation error} = \mathbb E_{p(\mathbf x, y)} [L(y, f(\mathbf x)] = \int L(y, f(\mathbf x)) p(\mathbf x, y) \mathrm d \mathbf x \mathrm d y$$Using Monte Carlo (unbiased) estimate,
$$\text{Average test error} = \frac{1}{M} \sum_{m=1}^{M} L \left( y^{(m)}, f(\mathbf x^{(m)}) \right), \quad \mathbf x^{(m)}, y^{(m)} \sim p(\mathbf x, y)$$One can show $\mathbb E [\text{Average test error}] = \text{Generalisation error}$.
One sensible split: 80% training, 10% validation, 10% testing.
Sum of random variables sampled identically and independently from some distribution is Gaussian.
Or formally, a paired t-test.
Transformation: $\begin{cases} \mathbf x \overset{iid}{\sim} \mathcal N(0,1) \\ \mathbf y = A \mathbf x \end{cases} \Rightarrow \mathrm{cov}(\mathbf y) = \Sigma = AA^\top$
$$p(\mathbf x) = \mathcal N(\mathbf x; \boldsymbol\mu, \Sigma) = \frac{1}{|\Sigma|^{1/2} (2\pi)^{D/2}} \exp \left( -\frac{1}{2} (\mathbf x - \boldsymbol\mu)^\top \Sigma^{-1} (\mathbf x - \boldsymbol\mu) \right)$$$\Sigma$ should be positive definite or positive semi-definite.
If $\Sigma$ is positive semi-definite, $\Sigma$ is not invertible. For example, if $x_1 = x_2$, $\Sigma = \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}$.
Given a joint distribution
$$p(\mathbf f, \mathbf g) = \mathcal N \left( \begin{bmatrix} \mathbf f \\ \mathbf g \end{bmatrix}; \begin{bmatrix} \boldsymbol\mu_f \\ \boldsymbol\mu_g \end{bmatrix}, \begin{bmatrix} \Sigma_f & C \\ C^\top & \Sigma_g \end{bmatrix} \right) \text{, }$$Represent categorical varible as binary vector.
Usually for features that are artitrary discrete distribution.
Gaussian naive Bayes: Covariance matrix $\Sigma_k$ are diagonal.
$K$ possible classes. Actual class $c$. $y_k = \delta_{kc} = \begin{cases} 1 & k=c \\ 0 & k \neq c \end{cases}$.
E-step: Set soft responsibilities $$r_k^{(n)} = P(z^{(n)}=k | \boldsymbol x^{(n)}, \boldsymbol\theta) = \frac{\pi_k \mathcal N(\boldsymbol x^{(n)}; \boldsymbol\mu_k, \Sigma_k)}{\sum_{k'}\pi_{k'} \mathcal N(x^{(n)}; \boldsymbol\mu_{k'}, \Sigma_{k'})} \text{,} \quad r_k = \sum_{n=1}^N r_k^{(n)}$$
M-step: Update parameter $\theta$ $$\pi_k = \frac{r_k}{N} \text{,} \quad \boldsymbol\mu_k = \frac{1}{r_k} \sum_{n=1}^N r_k^{(n)} \boldsymbol x^{(n)} \text{,} \quad \Sigma_k = \displaystyle\frac{1}{r_k} \sum_{n=1}^N r_k^{(n)} \boldsymbol x^{(n)} \boldsymbol x^{(n)\top} - \boldsymbol\mu_k \boldsymbol\mu_k^\top$$
$\mathbf w$ large $\quad \Rightarrow \quad \sigma(\mathbf w^\top \mathbf x) \to 1 \quad \Rightarrow \quad \nabla_{\mathbf w} \sigma = \sigma (1-\sigma) \mathbf x \to 0$
for epoch in num_epoch:
if val_cost is the smallest ever seen:
store weights
if val_cost is not improved in 20 looks-at:
return weights stored
- New axes (eigenvectors) are orthogonal
- If $V$ is $D \times D$ matrix, $VV^\top = \mathbb I$, i.e. $V$ is orthogonal matrix and no info is lost
- Transformation, not fitting, so no risk of overfitting
- Autoencoder is in $K$-dimensional linear subspace
Truncated SVD: the best low-rank approximation of a matrix, as measured by squared error
PCA: linear dimensionality reduction method minimising the least squares error of the distortion
Rewriting with $\mathcal D = \{ \mathbf x^{(n)}, y^{(n)} \}$,
$$p(\mathbf w | \mathcal D) \propto P(\mathcal D | \mathbf w) p(\mathbf w)$$Step 1: Matching the distributions
$$p(\mathbf w | \mathcal D) \approx \mathcal N(\mathbf w; \mathbf w^*, H^{-1})$$Step 2: Aproximating the normaliser
$$P(\mathcal D) \approx p(\mathbf w^*, \mathcal D) \left| 2 \pi H^{-1} \right|^{1/2}$$Limitations:
Ingredients:
NB: $D_\text{KL}(p \| q) \ne D_\text{KL}(q \| p)$
NB: $\log p(\mathcal D) \geq -J(q)$, where $-J(q)$ is called ELBO (Evidence Lower BOund)
Goal:
$\mathbf x^\top A \mathbf x = \mathrm{Tr} \left( \mathbf{xx}^\top A \right) = \mathrm{Tr} \left( A \mathbf{xx}^\top \right)$, $V = LL^\top$ (Cholesky decomposition)
Method 1: Monte-Carlo approximation (with $S=1$)
Method 2: Reparameterisation trick
$$\mathbb E_{\mathcal N(\mathbf w; \mathbf m, V)} \left[ f(\mathbf w) \right] = \mathbb E_{\mathcal N(\mathbf v; \mathbf 0, \mathbb I)} \left[ f(\mathbf m + L \mathbf v) \right]$$1-D posterior:
Setting hyperparameters:
Regularising:
Author: s1680642
Licensing:
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.