统计学习 - Statistical Learning

2019-05-31Edit: 2020-05-17Zhanhang (Matthew) ZENG 34 minutes read (About 5078 words)

统计学习方法笔记总结。haven’t finished yet

1. k近邻法（k-Nearest Neighbors）

直观理解:

分类：在数据中找到与某个点（目标）最近的k个点，把该点（目标）的类分为k个点中多数的类。
回归：在数据中找到与某个点（目标）最近的k个点，k个点的均值为目标点的预测值。
优点：
- $k$ 近邻法是个非参数学习算法，它没有任何参数（ $k$ 是超参数，而不是需要学习的参数）。
- 近邻模型具有非常高的容量，这使得它在训练样本数量较大时能获得较高的精度。
缺点：
1. 计算成本很高。因为需要构建一个 $N \times N$ 的距离矩阵，其计算量为 $O(N^2)$，其中 $N$ 为训练样本的数量。
2. 当数据集是几十亿个样本时，计算量是不可接受的。
3. 在训练集较小时，泛化能力很差，非常容易陷入过拟合。
4. 无法判断特征的重要性。

1.1 k近邻模型

模型由三个基本要素——距离度量、k值的选择和分类决策规则决定。

#Data Mining Machine Learning Statistical Learning

DME - Data Mining and Exploration (INFR 11007) Review

2019-05-14Edit: 2020-07-21Zhanhang (Matthew) ZENG 34 minutes read (About 5027 words)

Big Data / Data Mining / University of Edinburgh Data Mining CoursesReview

This is my review note of the DME course (Data Mining and Exploration (INFR11007), 2019) at the University of Edinburgh. The note include every steps to develop machine learning models and related knowledge, e.g., Exploratory Data Analysis (EDA), Data Preprocessing, Modeling and Model Evaluations. Remeber to read the ‘Lab’ section of each chapter

1. Exploratory Data Analysis

1.1 Numberical Data Description

1.1.1 Location

Non-robust Measure
- Sample Mean (arithmetic mean or average): $\hat{x} = \frac{1}{n}\sum_{i=1}^{n} x_{i}$
  - for random variable: $\mathbb{E}[x] = \int xp(x) dx$
Robust Measure
- Median:
  $$ median(x) = \begin{cases} x_{[(n+1)\mathbin{/}2]}& \text{; if $n$ is odd}\\ \frac{1}{2}[x_{(n\mathbin{/}2)}+x_{(n\mathbin{/}2)+1}]& \text{; if $n$ is even} \end{cases} $$
- Mode: Value that occurs most frequent
- $\alpha_{th}$ Sample Quantile (rough data point, i.e. $q_{\alpha} \approx x_{([n\alpha])}$)
  - $Q_{1} = q_{0.25}$, $Q_{2} = q_{0.5}$, $Q_{3} = q_{0.75}$

#Data Mining CoursesReview

Numpy&Pandas Tutorial

2019-05-10Edit: 2020-05-16Zhanhang (Matthew) ZENG a minute read (About 198 words)

Big Data / Data Mining Python Cheat Sheet Data Mining Big Data

Numpy和Pandas对python中的数据处理很重要。尤其对于数据分析/挖掘，Pandas几乎不可或缺。写tutorial的起因是因为一次面试中被问到numpy中去重用哪个函数，发现自己对numpy的不熟悉，所以希望以此加深印象…(haven’t started yet)

#Python Cheat Sheet Data Mining Big Data

统计学习 - Statistical Learning

1. k近邻法（k-Nearest Neighbors）

1.1 k近邻模型

DME - Data Mining and Exploration (INFR 11007) Review

1. Exploratory Data Analysis

1.1 Numberical Data Description

1.1.1 Location

Numpy&Pandas Tutorial

Categories

Recent

Archives

Tags

Links

Subscribe to Updates

Your browser is out-of-date!