Train XGBoost with Spark

2021-01-16Edit: 2021-01-16Zhanhang (Matthew) ZENG a few seconds read (About 55 words)

Big Data / Machine Learning / Spark / Statistical Learning Big Data,Spark,Machine Learning

# XGB training script
# run spark-shell on cluster

spark-shell --name xxx --num-executors 15 --executor-cores 4 --executor-memory 20G --jars /tmp/xgboost4j-0.82.jar,/tmp/xgboost4j-spark-0.82.jar --driver-class-path /tmp/xgboost4j-0.82.jar,/tmp/xgboost4j-spark-0.82.jar

# import dependencies
import org.apache.spark.sql.types._
import scala.collection.mutable.ArrayBuilder

#Big Data,Spark,Machine Learning

Building K-Means with Spark

2020-12-18Edit: 2021-01-05Zhanhang (Matthew) ZENG 8 minutes read (About 1238 words)

Big Data / Machine Learning / Spark / Statistical Learning Big Data,Spark,Machine Learning

Industry applications of machine learning generally require us to have the ability to deal with massive datasets. Spark provides a machine learning library named mllib allowing us to build machine learning models efficiently and parallelly.

This post is going to start with a Spark ML modelling example based on pyspark on Python, K-Means, and to explain some basic steps as well as the usage of Spark APIs when building an ML model on Spark.

For the complete code of the K-Means example, please refer to Sec2. Spark K-Means code summarization.

#Big Data,Spark,Machine Learning

Useful Trick with Linux Command

2020-08-10Edit: 2020-12-29Zhanhang (Matthew) ZENG 3 minutes read (About 398 words)

Tech Linux

Notes of some useful Linux command usage I encountered.

1. Download Google Drive files with wget

Example Google Drive shared file:

https://drive.google.com/open?id=[ThisIsFileID]

For general usage (not a big file)

Example command:

1	wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=FILEID' -O "FILENAME"

Where, the FILEID is the [ThisISFiledID] shwon above.

Download big files

Command for download any big file from google drive (for big file we need confirm download)

wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget \
    --quiet \
    --save-cookies /tmp/cookies.txt \ 
    --keep-session-cookies \
    --no-check-certificate 'https://docs.google.com/uc?export=download&id=FILEID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=FILEID" -O "FILENAME" && rm -rf /tmp/cookies.txt

Also, subsititute FILEID and FILENAME. Note that tere are 2 FILEID.

#Linux

Using Python with Docker

2020-07-14Edit: 2020-12-29Zhanhang (Matthew) ZENG 11 minutes read (About 1639 words)

Tech Linux

Notes for running deep learning model with python inside Docker containers, and basic usage of Docker.

1. Running Python (Deep Learning) with Docker

A common headache in software projects is ensuring the correct versions of all dependencies are available on the current development system. Often you may be working on several distinct projects simultaneously each with its own potentially conflicting dependencies on external libraries. Additionally you may be working across multiple different machines (for example a personal laptop and University computers) with possibly different operating systems. Further, you may not have root-level access to a system you are working on and so not be able to install software at a system-wide level and system updates may cause library versions to be changed to incompatible versions.

One way of overcoming these issues is to use project-specific virtual environments. In this context a virtual environment or machine are isolated development environments where the external dependencies of a project can be installed and managed independent of the system-wide versions (and those of the environments of other projects).

Here, we introduce how to use Docker to create a virtual machine to run python 3.

#Linux

大数据技术原理与应用 - (11). 流计算

2020-06-06Edit: 2020-12-29Zhanhang (Matthew) ZENG 24 minutes read (About 3601 words)

Big Data / Spark Big Data

【第三篇】 - 大数据处理与分析, 《大数据技术原理与应用, 林子雨》

本篇介绍大数据处理与分析的相关技术，包括

第7章 - MapReduce
第8章 - Hive - 基于 Hadoop 的数据仓库
第9章 - Hadoop 的优化与发展
第10章 - Spark
第11章 - 流计算
第12章 - 图计算

大数据包括批量计算和流计算，不同于批数据处理，流式计算 (处理) 要求对数据流进行计算，要求更低的时延或实时结果输出。

#Big Data

大数据技术原理与应用 - (10). Spark

2020-06-03Edit: 2020-12-29Zhanhang (Matthew) ZENG 26 minutes read (About 3859 words)

Big Data / Spark Big Data

【第三篇】 - 大数据处理与分析, 《大数据技术原理与应用, 林子雨》

本篇介绍大数据处理与分析的相关技术，包括

第7章 - MapReduce
第8章 - Hive - 基于 Hadoop 的数据仓库
第9章 - Hadoop 的优化与发展
第10章 - Spark
第11章 - 流计算
第12章 - 图计算

Spark 最初诞生于伯克利大学的 APM 实验室，是一个可应用于大规模数据处理的快速、通用引擎，如今是 Apache 软件基金会下的顶级开源项目之一。Spark 在借鉴Hadoop MapReduce 优点的同时，很好地解决了 MapReduce 所面临的问题。

#Big Data

大数据技术原理与应用 - (9). Hadoop 的优化与发展

2020-06-01Edit: 2020-12-29Zhanhang (Matthew) ZENG 27 minutes read (About 3982 words)

Big Data / Spark Big Data

【第三篇】 - 大数据处理与分析, 《大数据技术原理与应用, 林子雨》

本篇介绍大数据处理与分析的相关技术，包括

第7章 - MapReduce
第8章 - Hive - 基于 Hadoop 的数据仓库
第9章 - Hadoop 的优化与发展
第10章 - Spark
第11章 - 流计算
第12章 - 图计算

介绍 Hadoop 2.0 对 1.0 不足与局限的解决方案，介绍 Hadoop 2.0 的新特性以及新一代资源管理调度框架 YARN 框架。

#Big Data

大数据技术原理与应用 - (8). Hive - 基于 Hadoop 的数据仓库

2020-05-29Edit: 2020-12-29Zhanhang (Matthew) ZENG 26 minutes read (About 3908 words)

Big Data / Spark Big Data

【第三篇】 - 大数据处理与分析, 《大数据技术原理与应用, 林子雨》

本篇介绍大数据处理与分析的相关技术，包括

第7章 - MapReduce
第8章 - Hive - 基于 Hadoop 的数据仓库
第9章 - Hadoop 的优化与发展
第10章 - Spark
第11章 - 流计算
第12章 - 图计算

Hive 是一个基于 Hadoop 的数据仓库平台。通过 Hive，我们可以方便地进行 ETL 的工作。Hive 定义了一个类似于 SQL 的查询语言: HiveQL，能够将用户编写的 HiveQL 转化为相应的 Mapreduce 程序基于 Hadoop 执行，可以说 Hive 实质就是一款基于 HDFS 的 MapReduce 计算框架，对存储在 HDFS 中的数据进行分析和管理。

#Big Data