Train XGBoost with Spark

2021-01-16Edit: 2021-01-16Zhanhang (Matthew) ZENG a few seconds read (About 55 words)

Big Data / Machine Learning / Spark / Statistical Learning Big Data,Spark,Machine Learning

# XGB training script
# run spark-shell on cluster

spark-shell --name xxx --num-executors 15 --executor-cores 4 --executor-memory 20G --jars /tmp/xgboost4j-0.82.jar,/tmp/xgboost4j-spark-0.82.jar --driver-class-path /tmp/xgboost4j-0.82.jar,/tmp/xgboost4j-spark-0.82.jar

# import dependencies
import org.apache.spark.sql.types._
import scala.collection.mutable.ArrayBuilder

#Big Data,Spark,Machine Learning

Building K-Means with Spark

2020-12-18Edit: 2021-01-05Zhanhang (Matthew) ZENG 8 minutes read (About 1238 words)

Big Data / Machine Learning / Spark / Statistical Learning Big Data,Spark,Machine Learning

Industry applications of machine learning generally require us to have the ability to deal with massive datasets. Spark provides a machine learning library named mllib allowing us to build machine learning models efficiently and parallelly.

This post is going to start with a Spark ML modelling example based on pyspark on Python, K-Means, and to explain some basic steps as well as the usage of Spark APIs when building an ML model on Spark.

For the complete code of the K-Means example, please refer to Sec2. Spark K-Means code summarization.

#Big Data,Spark,Machine Learning

统计学习 - Statistical Learning

2019-05-31Edit: 2020-05-17Zhanhang (Matthew) ZENG 34 minutes read (About 5078 words)

Big Data / Machine Learning / Data Mining / Statistical Learning Data Mining Machine Learning Statistical Learning

统计学习方法笔记总结。haven’t finished yet

1. k近邻法（k-Nearest Neighbors）

直观理解:

分类：在数据中找到与某个点（目标）最近的k个点，把该点（目标）的类分为k个点中多数的类。
回归：在数据中找到与某个点（目标）最近的k个点，k个点的均值为目标点的预测值。
优点：
- $k$ 近邻法是个非参数学习算法，它没有任何参数（ $k$ 是超参数，而不是需要学习的参数）。
- 近邻模型具有非常高的容量，这使得它在训练样本数量较大时能获得较高的精度。
缺点：
1. 计算成本很高。因为需要构建一个 $N \times N$ 的距离矩阵，其计算量为 $O(N^2)$，其中 $N$ 为训练样本的数量。
2. 当数据集是几十亿个样本时，计算量是不可接受的。
3. 在训练集较小时，泛化能力很差，非常容易陷入过拟合。
4. 无法判断特征的重要性。

1.1 k近邻模型

模型由三个基本要素——距离度量、k值的选择和分类决策规则决定。

#Data Mining Machine Learning Statistical Learning

Train XGBoost with Spark

Building K-Means with Spark

统计学习 - Statistical Learning

1. k近邻法（k-Nearest Neighbors）

1.1 k近邻模型

Categories

Recent

Archives

Tags

Links

Subscribe to Updates

Your browser is out-of-date!