Train XGBoost with Spark

2021-01-16Edit: 2021-01-16Zhanhang (Matthew) ZENG a few seconds read (About 55 words)

Big Data / Machine Learning / Spark / Statistical Learning Big Data,Spark,Machine Learning

# XGB training script
# run spark-shell on cluster

spark-shell --name xxx --num-executors 15 --executor-cores 4 --executor-memory 20G --jars /tmp/xgboost4j-0.82.jar,/tmp/xgboost4j-spark-0.82.jar --driver-class-path /tmp/xgboost4j-0.82.jar,/tmp/xgboost4j-spark-0.82.jar

# import dependencies
import org.apache.spark.sql.types._
import scala.collection.mutable.ArrayBuilder

#Big Data,Spark,Machine Learning

Building K-Means with Spark

2020-12-18Edit: 2021-01-05Zhanhang (Matthew) ZENG 8 minutes read (About 1238 words)

Big Data / Machine Learning / Spark / Statistical Learning Big Data,Spark,Machine Learning

Industry applications of machine learning generally require us to have the ability to deal with massive datasets. Spark provides a machine learning library named mllib allowing us to build machine learning models efficiently and parallelly.

This post is going to start with a Spark ML modelling example based on pyspark on Python, K-Means, and to explain some basic steps as well as the usage of Spark APIs when building an ML model on Spark.

For the complete code of the K-Means example, please refer to Sec2. Spark K-Means code summarization.

#Big Data,Spark,Machine Learning

Keras 笔记

2019-06-12Edit: 2020-05-17Zhanhang (Matthew) ZENG 6 minutes read (About 836 words)

Machine Learning / Deep Learning / Python Deep Learning Python

To take notes about the essential Keras elements to build basic neural networks. The explainations of each section haven’t finished yet.

1. Single Layer Neural Network (Linear Regression)

单层神经网络相当于（非）线性回归模型，第一个例子是构建一个最简单一元线性回归模型。

创建数据
单层神经网络模型需要数据进行训练，因此我们使用 numpy 创建一些人造数据，且我们的 $y$ 为 $y = ax+b$ 。

import numpy as np
import tensorflow as tf
from tensorflow.keras import layers
import matplotlib.pyplot as plt
plt.style.use('seaborn')

# create data
X = np.linspace(-1, 1, 200)
np.random.shuffle(X) #randomize the data
Y = 2*X + 10 + np.random.normal(0, 0.05, (200,))

# plot data
plt.scatter(X, Y)
plt.show()

#Deep Learning Python

统计学习 - Statistical Learning

2019-05-31Edit: 2020-05-17Zhanhang (Matthew) ZENG 34 minutes read (About 5078 words)

Big Data / Machine Learning / Data Mining / Statistical Learning Data Mining Machine Learning Statistical Learning

统计学习方法笔记总结。haven’t finished yet

1. k近邻法（k-Nearest Neighbors）

直观理解:

分类：在数据中找到与某个点（目标）最近的k个点，把该点（目标）的类分为k个点中多数的类。
回归：在数据中找到与某个点（目标）最近的k个点，k个点的均值为目标点的预测值。
优点：
- $k$ 近邻法是个非参数学习算法，它没有任何参数（ $k$ 是超参数，而不是需要学习的参数）。
- 近邻模型具有非常高的容量，这使得它在训练样本数量较大时能获得较高的精度。
缺点：
1. 计算成本很高。因为需要构建一个 $N \times N$ 的距离矩阵，其计算量为 $O(N^2)$，其中 $N$ 为训练样本的数量。
2. 当数据集是几十亿个样本时，计算量是不可接受的。
3. 在训练集较小时，泛化能力很差，非常容易陷入过拟合。
4. 无法判断特征的重要性。

1.1 k近邻模型

模型由三个基本要素——距离度量、k值的选择和分类决策规则决定。

#Data Mining Machine Learning Statistical Learning

Train XGBoost with Spark

Building K-Means with Spark

Keras 笔记

1. Single Layer Neural Network (Linear Regression)

统计学习 - Statistical Learning

1. k近邻法（k-Nearest Neighbors）

1.1 k近邻模型

Categories

Recent

Archives

Tags

Links

Subscribe to Updates

Your browser is out-of-date!