Train XGBoost with Spark

1
2
3
4
5
6
7
8
# XGB training script
# run spark-shell on cluster

spark-shell --name xxx --num-executors 15 --executor-cores 4 --executor-memory 20G --jars /tmp/xgboost4j-0.82.jar,/tmp/xgboost4j-spark-0.82.jar --driver-class-path /tmp/xgboost4j-0.82.jar,/tmp/xgboost4j-spark-0.82.jar

# import dependencies
import org.apache.spark.sql.types._
import scala.collection.mutable.ArrayBuilder

Building K-Means with Spark

Industry applications of machine learning generally require us to have the ability to deal with massive datasets. Spark provides a machine learning library named mllib allowing us to build machine learning models efficiently and parallelly.

This post is going to start with a Spark ML modelling example based on pyspark on Python, K-Means, and to explain some basic steps as well as the usage of Spark APIs when building an ML model on Spark.

For the complete code of the K-Means example, please refer to Sec2. Spark K-Means code summarization.


Useful Trick with Linux Command

Notes of some useful Linux command usage I encountered.

1. Download Google Drive files with wget

Example Google Drive shared file:

https://drive.google.com/open?id=[ThisIsFileID]


For general usage (not a big file)

Example command:

1
wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=FILEID' -O "**FILENAME**"

Where, the FILEID is the [ThisISFiledID] shwon above.


Download big files

Command for download any big file from google drive (for big file we need confirm download)

1
2
3
4
5
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget \
--quiet \
--save-cookies /tmp/cookies.txt \
--keep-session-cookies \
--no-check-certificate 'https://docs.google.com/uc?export=download&id=FILEID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=FILEID" -O "FILENAME" && rm -rf /tmp/cookies.txt

Also, subsititute FILEID and FILENAME. Note that tere are 2 FILEID.


Using Python with Docker

Notes for running deep learning model with python inside Docker containers, and basic usage of Docker.

1. Running Python (Deep Learning) with Docker

A common headache in software projects is ensuring the correct versions of all dependencies are available on the current development system. Often you may be working on several distinct projects simultaneously each with its own potentially conflicting dependencies on external libraries. Additionally you may be working across multiple different machines (for example a personal laptop and University computers) with possibly different operating systems. Further, you may not have root-level access to a system you are working on and so not be able to install software at a system-wide level and system updates may cause library versions to be changed to incompatible versions.

One way of overcoming these issues is to use project-specific virtual environments. In this context a virtual environment or machine are isolated development environments where the external dependencies of a project can be installed and managed independent of the system-wide versions (and those of the environments of other projects).

Here, we introduce how to use Docker to create a virtual machine to run python 3.


大数据技术原理与应用 - (11). 流计算

【第三篇】 - 大数据处理与分析, 《大数据技术原理与应用, 林子雨》

本篇介绍大数据处理与分析的相关技术,包括

大数据包括批量计算和流计算,不同于批数据处理,流式计算 (处理) 要求对数据流进行计算,要求更低的时延或实时结果输出。


大数据技术原理与应用 - (10). Spark

【第三篇】 - 大数据处理与分析, 《大数据技术原理与应用, 林子雨》

本篇介绍大数据处理与分析的相关技术,包括

Spark 最初诞生于伯克利大学的 APM 实验室,是一个可应用于大规模数据处理的快速、通用引擎,如今是 Apache 软件基金会下的顶级开源项目之一。Spark 在借鉴Hadoop MapReduce 优点的同时,很好地解决了 MapReduce 所面临的问题。


大数据技术原理与应用 - (9). Hadoop 的优化与发展

【第三篇】 - 大数据处理与分析, 《大数据技术原理与应用, 林子雨》

本篇介绍大数据处理与分析的相关技术,包括

介绍 Hadoop 2.0 对 1.0 不足与局限的解决方案,介绍 Hadoop 2.0 的新特性以及新一代资源管理调度框架 YARN 框架。


大数据技术原理与应用 - (8). Hive - 基于 Hadoop 的数据仓库

【第三篇】 - 大数据处理与分析, 《大数据技术原理与应用, 林子雨》

本篇介绍大数据处理与分析的相关技术,包括

Hive 是一个基于 Hadoop 的数据仓库平台。通过 Hive,我们可以方便地进行 ETL 的工作。Hive 定义了一个类似于 SQL 的查询语言: HiveQL,能够将用户编写的 HiveQL 转化为相应的 Mapreduce 程序基于 Hadoop 执行,可以说 Hive 实质就是一款基于 HDFS 的 MapReduce 计算框架,对存储在 HDFS 中的数据进行分析和管理。


大数据技术原理与应用 - (7). MapReduce

【第三篇】 - 大数据处理与分析, 《大数据技术原理与应用, 林子雨》

本篇介绍大数据处理与分析的相关技术,包括

MapReduce 是一种并行编程模型,用于大规模数据集 (大于 1 TB) 的并行运算,它将复杂的、运行于大规模集群上的并行计算过程高度抽象到两个函数: MapReduce


大数据技术原理与应用 - (5). NoSQL 数据库

【第二篇】 - 大数据存储与管理, 《大数据技术原理与应用, 林子雨》

本篇介绍大数据存储与管理相关技术的概念与原理,包括

NoSQL (Not only SQL) 是一种不同于关系数据库的数据库管理系统设计方式。


Your browser is out-of-date!

Update your browser to view this website correctly.&npsb;Update my browser now

×