Train XGBoost with Spark

1
2
3
4
5
6
7
8
# XGB training script
# run spark-shell on cluster

spark-shell --name xxx --num-executors 15 --executor-cores 4 --executor-memory 20G --jars /tmp/xgboost4j-0.82.jar,/tmp/xgboost4j-spark-0.82.jar --driver-class-path /tmp/xgboost4j-0.82.jar,/tmp/xgboost4j-spark-0.82.jar

# import dependencies
import org.apache.spark.sql.types._
import scala.collection.mutable.ArrayBuilder

Building K-Means with Spark

Industry applications of machine learning generally require us to have the ability to deal with massive datasets. Spark provides a machine learning library named mllib allowing us to build machine learning models efficiently and parallelly.

This post is going to start with a Spark ML modelling example based on pyspark on Python, K-Means, and to explain some basic steps as well as the usage of Spark APIs when building an ML model on Spark.

For the complete code of the K-Means example, please refer to Sec2. Spark K-Means code summarization.


统计学习 - Statistical Learning

统计学习方法笔记总结。haven’t finished yet

1. k近邻法(k-Nearest Neighbors)


直观理解:
  • 分类:在数据中找到与某个点(目标)最近的k个点,把该点(目标)的类分为k个点中多数的类。
  • 回归:在数据中找到与某个点(目标)最近的k个点,k个点的均值为目标点的预测值。

  • 优点:

    • $k$ 近邻法是个非参数学习算法,它没有任何参数( $k$ 是超参数,而不是需要学习的参数)。
    • 近邻模型具有非常高的容量,这使得它在训练样本数量较大时能获得较高的精度。
  • 缺点:

    1. 计算成本很高。因为需要构建一个 $N \times N$ 的距离矩阵,其计算量为 $O(N^2)$,其中 $N$ 为训练样本的数量。
    2. 当数据集是几十亿个样本时,计算量是不可接受的。
    3. 在训练集较小时,泛化能力很差,非常容易陷入过拟合。
    4. 无法判断特征的重要性。

1.1 k近邻模型

  • 模型由三个基本要素——距离度量k值的选择分类决策规则决定。

Your browser is out-of-date!

Update your browser to view this website correctly.&npsb;Update my browser now

×