This is my review note of the DME course (Data Mining and Exploration (INFR11007), 2019) at the University of Edinburgh. Remeber to read the 'Lab' section of each chapter
Non-robust Measure
Robust Measure
Median: $$ median(x) = \begin{cases} x_{[(n+1)\mathbin{/}2]}& \text{; if $n$ is odd}\\ \frac{1}{2}[x_{(n\mathbin{/}2)}+x_{(n\mathbin{/}2)+1}]& \text{; if $n$ is even} \end{cases} $$
Mode: Value that occurs most frequent
Example
import numpy as np
import pandas as pd
set1 = np.array([0, 1, 1, 1, 2, 3, 4, 4, 5, 9])
set2 = np.array([0, 1, 1, 1, 2, 3, 4, 4, 5, 9000])
print('Set 1: {}'.format(', '.join(list(map(str, set1)))))
print('Set 2: {}'.format(', '.join(list(map(str, set2)))))
d = {
'mean': [set1.mean(), set2.mean()],
'median': [np.median(set1), np.median(set2)],
'$Q_1$': [np.quantile(set1, 0.25), np.quantile(set2, 0.25)],
'$Q_2$': [np.quantile(set1, 0.5), np.quantile(set2, 0.5)],
'$Q_3$': [np.quantile(set1, 0.75), np.quantile(set2, 0.75)]
}
pd.DataFrame(data=d, index=['Set 1','Set 2'])
Non-robust Measure
Robust Measure
from scipy.stats import iqr
def mad(x):
return np.median(np.abs(x - np.median(x)))
d = {
'variance': [set1.var(), set2.var()],
'std': [set1.std(), set2.std()],
'MAD': [mad(set1), mad(set2)],
'IQR': [iqr(set1), iqr(set2)]
}
pd.DataFrame(data=d, index=['Set 1','Set 2'])
Non-robust Measure
Robust Measure
from scipy.stats import skew, kurtosis
def Galton(x):
q1 = np.quantile(x, 0.25)
q2 = np.quantile(x, 0.5)
q3 = np.quantile(x, 0.75)
return ((q3-q2)-(q2-q1))/(q3-q1)
def robust_kurt(x):
q1 = np.quantile(x, 1/8)
q2 = np.quantile(x, 1/4)
q3 = np.quantile(x, 3/8)
q5 = np.quantile(x, 5/8)
q6 = np.quantile(x, 3/4)
q7 = np.quantile(x, 7/8)
return ((q7-q5)+(q3-q1))/(q6-q2)
d = {
'skewness': [skew(set1), skew(set2)],
'Galton': [Galton(set1), Galton(set2)],
'kurtosis': [kurtosis(set1), kurtosis(set2)],
'robustKurt': [robust_kurt(set1), robust_kurt(set2)]
}
pd.DataFrame(data=d, index=['Set 1','Set 2'])
Sample Covariance: $$Cov(x, y) = \frac{1}{n}\sum_{i=1}^{n} (x_{i} - \hat{x}) (y_{i} - \hat{y})$$
Pearson's Correlation Coefficient:$$\rho(x,y) = \frac{\text{cov}(x,y)}{Std(x) Std(y)}$$
Covariance Matrix: $$Cov[X] = \mathbb{E}[(X-\mathbb{E}[X])(X-\mathbb{E}[X])^{T}]$$
Correlation Matrix:$$\rho(X) = diag\left( \frac{1}{std(X)} \right) Cov[X]diag\left( \frac{1}{std(X)} \right)$$
Rank Correlation - Kendall's $\tau$: $$\tau(x,y) = \frac{n_{c}(x,y) - n_{d}(x,y)}{n(n-1)/2}$$
import seaborn as sns
sns.set()
iris = sns.load_dataset("iris")
titanic = sns.load_dataset("titanic")
titanic.head()
sns.catplot(x="sex", y="survived", hue="class", kind="bar", data=titanic);
g = sns.catplot(x="fare", y="survived", row="class",
kind="box", orient="h", height=1.5, aspect=4,
data=titanic.query("fare > 0"))
g.set(xscale="log");
sns.distplot(titanic.age, bins=40, kde=False, color="r");
sns.kdeplot(titanic.age, shade=True, color="r")
ax = sns.violinplot(x="class", y="age", data=titanic)
Normalising data to have 0 (sample) mean and unit (sample) variance:
Principal Component(PC) direction: $\boldsymbol{w}$, projected data: $\boldsymbol{w}^{T} \boldsymbol{x}$
According to the eigenvalue decomposition of convariance matrix $\Sigma$: $\Sigma = U \Lambda U^{T}$
Let $\boldsymbol{w_{1}} = \sum_{i=1}^{n} a_{i} \boldsymbol{u_{i}} = U \boldsymbol{a}$, then
\begin{aligned} & \boldsymbol{w_{1}}^T \Sigma \boldsymbol{w_{1}} = \sum_{i=1}^{n} a_{i}^{2} \lambda_{i} \\ & ||\boldsymbol{w_{1}}|| = \boldsymbol{w_{1}}^{T} \boldsymbol{w_{1}} = \sum_{i=1}^{n} a_{i}^{2} = 1 \end{aligned}Thus, the optimisation problem can be written as:
\begin{aligned} & {\text{maximise}} & & \sum_{i=1}^{n} a_{i}^{2} \lambda_{i} \\ & \text{subject to} & & \sum_{i=1}^{n} a_{i}^{2} = 1 \end{aligned}$\boldsymbol{a} = (1, 0, \dots, 0)^T$ is the unique solution, if $lambda_{1} > \lambda{i}$.
So the first PC direction is
$$\boldsymbol{w_{1}} = U \boldsymbol{a} = \boldsymbol{u_{1}}$$, where the first PC direction given by the first eigen vector, $\boldsymbol{u_{1}}$, of $\Sigma$ corresponding to the first(largest) eigen value $\lambda_{1}$.
First PC scores: $\boldsymbol{z_{1}}^{T} = \boldsymbol{w_{1}}^{T} X_{d \times n}$
Subsequent PC Direction $\boldsymbol{w_{m}}$:
Solution: similar to the previous procedure
$\boldsymbol{w_{m}} = \boldsymbol{u_{m}}$ is the m-th PC direction given by the m-th eigen vector of $\Sigma$ corresponding to the m-th largest eigen value $\lambda_{m}$.
$Var(z_{m}) = \lambda_{m}$, $\mathbb{E}(z_{m}) = 0$
PCs (scores) uncorrelated:
Projection Matrix: $$ P = \sum_{i=1}^{k}\boldsymbol{w_{i}} \boldsymbol{w_{i}}^{T} = W_{k} W_{k}^{T} $$ , where $W_{k} = (\boldsymbol{w_{1}}, \dots, \boldsymbol{w_{k}})$ is $d \times k$ matrix .
Approximating $\boldsymbol{x}$ into subspace $\boldsymbol{\hat{x}} = P \boldsymbol{x} = \sum_{i=1}^{k}\boldsymbol{w_{i}} \boldsymbol{w_{i}}^{T} \boldsymbol{x}$
Approximation Error: $\mathbb{E}||\boldsymbol{x} - P \boldsymbol{x}||^2 = \mathbb{E}||\boldsymbol{x} - W_{k} W_{k}^T \boldsymbol{x}||^2 = \mathbb{E}||\boldsymbol{x} - \sum_{i=1}^{k}\boldsymbol{w_k} \boldsymbol{w_k}^T \boldsymbol{x}||^2$
Optimisation Problem:
So,
Relative Approximation Error: $$ \frac{\mathbb{E}||\boldsymbol{x} - U_{k} U_{k}^T \boldsymbol{x}||^2}{\mathbb{E}||\boldsymbol{x}||^2} = 1 - \frac{\sum_{i=1}^{k} \lambda_{i}}{\sum_{i=1}^{d} \lambda_{i}} = 1 - \text{fraction of variance explained} $$
Gram Matrix: $$G = X^T X \text{, where} (G)_{ij} = \boldsymbol{x_i}^T\boldsymbol{x_j}$$
According the SVD of $X$: $$ G = X^T X = (USV^T)^T(USV^T) = V S^T U^T U S V^T = VS^T SV^T = V \tilde{\Lambda} V^T = \sum_{i=1}^{n} s_i^2 \boldsymbol{v_i} \boldsymbol{v_i}^T $$
Thus, the best rank k approximation of $G$ is $\hat{G} = \sum_{i=1}^{k} \boldsymbol{v_i} s_i^2 \boldsymbol{v_i}^T$.
Advantages:
Probabilistic Model:
Joint, Conditional and Observation Distribution
Conditional Distribution:
$$ p(\boldsymbol{x}|\boldsymbol{z}) = \mathcal{N}(\boldsymbol{x};\; W \boldsymbol{z} + \boldsymbol{\mu},\; \sigma^2I_{d}) $$
Joint Distribution:
\begin{aligned} p(\boldsymbol{z},\; \boldsymbol{x}) & = p(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z}) = \mathcal{N}(\boldsymbol{x};\; W \boldsymbol{z} + \boldsymbol{u},\; \sigma^2I_{d}) \mathcal{N}(\boldsymbol{z};\; 0,\; I_k)\\ & = \frac{1}{const}exp \left[ -\frac{1}{2} [(\boldsymbol{x} - W \boldsymbol{z} - \boldsymbol{\mu})^{T} (\frac{1}{\sigma^2}I_{d}) (\boldsymbol{x} - W \boldsymbol{z} - \boldsymbol{\mu}) + \boldsymbol{z}^{T} \boldsymbol{z}] \right] \end{aligned}
Important Equations:
\begin{aligned} -\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x}-\boldsymbol{\mu}) & = -\frac{1}{2}\boldsymbol{x}^T \Sigma^{-1} \boldsymbol{x} + \boldsymbol{x}^{T} \Sigma^{-1}\mu + const\\ & = -\frac{1}{2}\boldsymbol{x}^T A \boldsymbol{x} + \boldsymbol{x}^{T} \xi + const \end{aligned}
- For multivariate normal distribution:
Thus, $\Sigma = A^{-1}$ and $\boldsymbol{\mu} = \Sigma \ \xi$ .
Another option to find $W$ and $\sigma^2$ is EM algorithm.
Relation to PCA:
, where $M = W^T W + \sigma^2 I$ .
, where $M_{ML} = W_{ML}^{T} W_{ML} + \sigma^{2}I \;$ and $\; W_{ML} = U_k (\Lambda_k - \sigma^2 I)^{\frac{1}{2}}$ .
Observed (uncentered) data: $\tilde{X} = (\boldsymbol{x_1}, \boldsymbol{x_2}, \dots, \boldsymbol{x_n})_{d \times n}$
Center data: $X = \tilde{X} C_n$ , where $C_n = I_{n} - \frac{1}{n} 1_n 1_n^{T}\ $ .
Option 1 - compute PC scores via eigen values decomposition:
\begin{aligned} \Sigma & = \frac{1}{n}X X^T = U \Lambda U^T \end{aligned}
Denote $U_k$ with the first $k$ eigen vectors of $\Sigma$ corresponding to the top $k$ eigen values: $U_k = (\boldsymbol{u_1}, \boldsymbol{u_2}, \dots, \boldsymbol{u_k})_{d \times k}$
PC scores:
\begin{aligned} \underset{k \times 1}{\boldsymbol{z}_i} = \underset{k \times d}{U_k^T} \; \underset{d \times 1}{\boldsymbol{x}_i} , & & \underset{k \times n}{Z} = \underset{k \times d}{U_k^T} \; \underset{d \times n }{X} \end{aligned}
Option 2 - compute PC scores via Gram Matrix:
\begin{aligned} G = X^T X = (USV^T)^T(USV^T) = V S^T U^T U S V^T = VS^T SV^T = V \tilde{\Lambda} V^T \end{aligned}
\begin{aligned} \underset{k \times n}{Z} = \underset{k \times k}{\sqrt{\tilde{\Lambda}}} \underset{k \times n}{V_k^T}, & & V_k = (\boldsymbol{v}_1, \dots, \boldsymbol{v}_k) \end{aligned}
If only given squared distance $\delta_{ij}^2$ between data points $\tilde{\boldsymbol{x_i}}$ and $\tilde{\boldsymbol{x_j}} \ $.
$$ \delta_{ij}^2 = ||\tilde{\boldsymbol{x_i}} - \tilde{\boldsymbol{x_j}}||^2 = (\tilde{\boldsymbol{x_i}} - \tilde{\boldsymbol{x_j}})^T (\tilde{\boldsymbol{x_i}} - \tilde{\boldsymbol{x_j}}) $$Distance Matrix $\Delta$ contains elements $\delta_{ij} \ $.
$$ \delta_{ij}^2 = ||(\tilde{\boldsymbol{x_i}} -\mu) - (\tilde{\boldsymbol{x_j}} - \mu)||^2 = ||\boldsymbol{x_i} - \boldsymbol{x_j}||^2 = (\boldsymbol{x_i} - \boldsymbol{x_j})^T(\boldsymbol{x_i} - \boldsymbol{x_j})\\ \delta_{ij}^2 = ||\boldsymbol{x_i}||^2 + ||\boldsymbol{x_j}||^2 -2\boldsymbol{x_i}^T \boldsymbol{x_j} $$Center the distance:
$$ (C_n \Delta C_n)_{ij} = (\Delta C_n)_{ij} - \frac{1}{n} \sum_{i} (\Delta C_n)_{ij} = - 2\boldsymbol{x_i}^T \boldsymbol{x_j}\\ G = -\frac{1}{2}C_n \Delta C_n $$To obtain new data matrix $\Phi$ using the transforming function $\phi(\boldsymbol{x}_i)$.
$$ \Phi = (\phi_1, \phi_2, \dots, \phi_n) = (\phi(\boldsymbol{x}_1), \phi(\boldsymbol{x}_2), dots, \phi(\boldsymbol{x}_n)) $$Kernel Trick: inner product of some functions can be computed as:
$$ \phi(\boldsymbol{x}_i)^T \phi(\boldsymbol{x}_j) = k(\boldsymbol{x}_i, \boldsymbol{x}_j) $$uncentered Gram Matrix $G$ of $\Phi$ with elements $(\tilde{G})_{ij}$:
$$ \tilde{G})_{ij} = \phi(\boldsymbol{x}_i)^T \phi(\boldsymbol{x}_j) = k(\boldsymbol{x}_i, \boldsymbol{x}_j) $$Example Kernel:
Then applying methods in Sec 3.1.2 and Sec 3.1.1 to compute PC scores.
Assumption: the numerical values of dissimilarities (e.g. Euclidean distance) carry information.
Optimisation Problem:
\begin{aligned} \underset{\mathbf z_1, \ldots, \mathbf z_n}{\text{minimise}} \quad \sum_{i<j} w_{ij} ( \| \mathbf z_i - \mathbf z_j \| - \delta_{ij})^2 \end{aligned}Assumption: only relationship between $ \delta_{ij} $ matters, i.e., whether $ \delta_{12} > \delta_{13} $ or $ \delta_{12} < \delta_{13} $.
Optimisation Problem:
\begin{aligned} \underset{\boldsymbol{z_1}, \boldsymbol{z_2}, \dots, \boldsymbol{z_n}, f}{\text{minimise}}& & \sum_{i \le j} w_{ij} (||\boldsymbol{z}_i - \boldsymbol{z}_j|| - f(\delta_{ij}))^2 \end{aligned}Assumption: numerical values of $\ \delta_{ij} \ $ matter.
Dissimilarities $\ \delta_{ij} \ $ are (squared) Eucldiean distance between some unknown vectors.
Distance matrix $\ \Delta \ $ is formed by $\ \delta_{ij} \ $
Using the method in Sec 3.1.3:
Compute top k eigen values $\ \sigma_k^2 \ $ and corresponding eigen vectors $\ \boldsymbol{v}_k \ $ of $\ G \ $ and form $\ \tilde{\Lambda}_k = diag(\sigma_1^2, \sigma_2^2, \dots, \sigma_k^2) \ $ and $\ V_k = (\boldsymbol{v}_1, \boldsymbol{v}_2, \dots, \boldsymbol{v}_k)_{n \times k}$
$\underset{k \times n}{Z} = \underset{k \times k}{\sqrt{\tilde{\Lambda}}} \; \underset{k \times n}{V_k^T}$
$\Delta \ $ is not necessary positive semi-definite, thus, some eigen values might be negative.
Solution: choose $\ k \ $small enough to avoid negative eigen values.
Classical MDS solution for $\ k' < k \ $ is directly given by the first $\ k' \ $ corordinates of $\ k \ $ dimensional $\ \boldsymbol{z} \ $.
Steps of Isomap
Geodesic distance is measured by the shortest distance between them when only allowed to travel on the data manifold from one neighbouring data point to the next.
Isomap well represents the circular structure when learned graph is connected.
For prediction function
$$ \mathcal{J}(\hat{h}) = \mathbb{E}_{\boldsymbol{x}, \ y} \left[ \mathcal{L}(\hat{h}(\boldsymbol{x}), \ y) \right] $$For algorithm
$$ \bar{\mathcal{J}}(\mathcal{A}) = \mathbb{E}_{D^{train}}\left[ \mathcal{J}(\hat{h}) \right] = \mathbb{E}_{D^{train}}\left[ \mathcal{J}(\mathcal{A}(D^{train})) \right] $$See DME Lecture Notes for more details.
Overfitting: Reducing the model complexity, the prediction loss decreases.
Underfitting: Increasing the model complexity, the prediction loss decreases.
Solutions: Model Selection or Regularisation .
Regularisation:
\begin{aligned} & \text{minimise} & & \mathcal{J}_{\boldsymbol{\lambda}}(\boldsymbol{\theta}) + \lambda_{reg} R(\boldsymbol{\theta}) \end{aligned}
Either model complexity and size of training data matter generalisation performance, See 4.2.3 Example on DME Lecture Notes.
We typically need to estimate the generalisation performance twice: Once for hyperparameter selection, and once for final performance evaluation.
Prediction function:
\begin{aligned} \hat{h} = \mathcal{A}(D^{train}) \end{aligned}Prediction Loss on Testing/ Validation Sets $\ \tilde{D} \ $.
\begin{aligned} \hat{\mathcal{J}}(\hat{h}: \ \tilde{D}) = \frac{1}{n}\sum_{i=1}^{n}\mathcal{L} \left( \hat{h}(\tilde{\boldsymbol{x}}_i, \ \tilde{y}_i) \right) \end{aligned}K-fold: Construct k pairs of $\ D^{train} \ $ and $\ D^{val} \ $.
\begin{aligned} & D^{train} = D_{i \neq k} & & D^{val} = D_k \end{aligned}
K Prediction functions: obtained by using k training sets .
\begin{aligned} \hat{h}_k = \mathcal{A}(D_{k}^{train}) \end{aligned}
K performance Estimations: evaluated on k validation sets .
\begin{aligned} \hat{\mathcal{J}}_k = \hat{\mathcal{J}}(\hat{h}_k : \ D_k^{val}) \end{aligned}
Cross Validation (CV) Score: averaging all k $\ \hat{\mathcal{J}}_k \ $
\begin{aligned} CV = \frac{1}{K} \sum_{k=1}^{K}\hat{\mathcal{J}}_k \left(\mathcal{A} (D_k^{train}: D_k^{val}) \right) = \hat{{\bar{\mathcal{J}}}} (\mathcal{A}) \end{aligned}
Estimate Variability of CV score
\begin{aligned} Var(CV) \approx \frac{1}{k} Var(\hat{\mathcal{J}}_k), & & Var{\hat{\mathcal{J}}} = \frac{1}{k} = (\hat{\mathcal{J}}_k - CV) ^2 \end{aligned}
Tuning parameters $\ \boldsymbol{\lambda} \ $ on $\ D^{train} \ $, return a set of $\ \hat{\boldsymbol{\lambda}} \ $ .
\begin{aligned} \hat{h}_{\boldsymbol{\lambda}} = \mathcal{A}_{\boldsymbol{\lambda}} (D^{train}) \end{aligned}
Compute prediction loss $\ PL({\boldsymbol{\lambda}}) \ $ on $\ D^{val} \ $.
\begin{aligned} PL(\boldsymbol{\lambda}) = \hat{\mathcal{J}} (\hat{h}_{\boldsymbol{\lambda}}: \ D^{val}) \end{aligned}
and choosing the $\ \boldsymbol{\lambda} \ $ by minimising $\ PL(\boldsymbol{\lambda}) \ $
\begin{aligned} \hat{\boldsymbol{\lambda}} = \underset{\boldsymbol{\lambda}}{\text{argmin }} PL(\boldsymbol{\lambda}) \end{aligned}
Using $\ \hat{\boldsymbol{\lambda}} \ $, re-estimate $\ \boldsymbol{\theta} \ $ on the union of $\ D^{train} \ $ and $\ D^{val} \ $.
\begin{aligned} \hat{h} = \mathcal{A}_{\hat{\boldsymbol{\lambda}}} = \left( D^{train} U D^{val} \right) \end{aligned}
Compute prediction loss on $\ D^{test} \ $.
\begin{aligned} \hat{\mathcal{J}} = \hat{\mathcal{J}}(\hat{h}:\ D^{test}) \end{aligned}
Re-estimate $\ \hat{h} \ $ on all data $\ D \ $
Compute CV score on remaining data $\ D^{train} \ $.
$$ EPL(\boldsymbol{\lambda}) = CV $$
Choose $\ \hat{\boldsymbol{\lambda}} = \underset{\boldsymbol{\lambda}}{\text{argmin }}EPL(\boldsymbol{\lambda}) \ $
Re-estimate $\ \boldsymbol{\theta} \ $ on $\ D^{train} \ $ using $\ \hat{\boldsymbol{\lambda}} \ $.
$$ \hat{h} = \mathcal{A}_{\boldsymbol{\lambda}} (D^{train}) $$
Compute prediction loss on $\ D^{test} \ $.
Assume k different classes, loss function $\ L(\hat{y}, \ y) \ $ can be represented as $\ k \times k \ $ matrix.
$$ L(\hat{y}, \ y) = \begin{bmatrix} L(1,1) & L(1,2) & \dots & L(1,k) \\ L(2,1) & L(2,2) & \dots & L(2,k) \\ \vdots & \vdots & \ddots & \vdots \\ L(k,1) & L(k,2) & \dots & L(k,k) \end{bmatrix} $$Zero-one loss:
, where $\ p(i,\ j) = p(\hat{y} = i, \ y = j) \ $
Binary classification:
$\hat y$ | $y$ | event | joint probability $p(\hat y, y)$ |
shorthand notation | rate |
conditional probability $p(\hat{y}\lvert y)$ |
---|---|---|---|---|---|---|
1 | 1 | True Positive | $\mathbb P(\hat y=1, y=1)$ | $p(1,1)$ | TP rate sensitivity, hit rate, recall |
$\mathbb P(\hat{y}=1 \lvert y=1)$ |
1 | -1 | False Positive | $\mathbb P(\hat y=1, y=-1)$ | $p(1,-1)$ | FP rate type 1 error, fall-out |
$\mathbb P(\hat{y}=1 \lvert y=-1)$ |
-1 | 1 | False Negative | $\mathbb P(\hat y=-1, y=1)$ | $p(-1,1)$ | FN rate type 2 error |
$\mathbb P(\hat{y}=-1 \lvert y=1)$ |
-1 | -1 | True Negative | $\mathbb P(\hat y=-1, y=-1)$ | $p(-1, -1)$ | TN rate specificity |
$\mathbb P(\hat{y}=-1 \lvert y=-1)$ |
Loss function penalising FP and FN rates:
$$\mathbf L_{\hat y, y} = \begin{pmatrix} 0 & \frac{1}{\mathbb P(y=1)} \\ \frac{1}{\mathbb P(y=-1)} & 0 \\ \end{pmatrix}$$Reciver Operating Characteristic curve (ROC curve)
Minimising the false-positive (or false-negative) rate alone is not a very meaningful strategy: The reason is that the trivial classifier $\ h(x) = \hat{y} = −1 \ $ would be the optimal solution. But for such a classifier the true-positive rate would be zero.
ROC curve visualise a generally a trade-off between true-positive rate (TPR) and false-positive rates (FPR).
For simplicity, we consider here binary classification only. Let us assume that $\ \hat{y} ∈{−1,1} \ $ is given by
$$ \hat{y}(\boldsymbol{x}) = sign(h(\boldsymbol{x})) $$, where $\ h(\boldsymbol{x})\ $ is real-valued.
$$ \text{correct classification of } \boldsymbol{x} ⇐⇒ yh(\boldsymbol{x}) > 0. $$[1]: Michael E Tipping and Christopher M Bishop. “Probabilistic principal component analysis”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61.3 (1999), pp. 611–622
[2]: T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements of Statistical Learning. Springer, 2009.
Author: Zhanhang Zeng, Chito Wong
Licensing:
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.