Train XGBoost with Spark

2021-01-16Edit: 2021-01-16Zhanhang (Matthew) ZENG a few seconds read (About 55 words)

Big Data / Machine Learning / Spark / Statistical Learning Big Data,Spark,Machine Learning

# XGB training script
# run spark-shell on cluster

spark-shell --name xxx --num-executors 15 --executor-cores 4 --executor-memory 20G --jars /tmp/xgboost4j-0.82.jar,/tmp/xgboost4j-spark-0.82.jar --driver-class-path /tmp/xgboost4j-0.82.jar,/tmp/xgboost4j-spark-0.82.jar

# import dependencies
import org.apache.spark.sql.types._
import scala.collection.mutable.ArrayBuilder

#Big Data,Spark,Machine Learning

Building K-Means with Spark

2020-12-18Edit: 2021-01-05Zhanhang (Matthew) ZENG 8 minutes read (About 1238 words)

Big Data / Machine Learning / Spark / Statistical Learning Big Data,Spark,Machine Learning

Industry applications of machine learning generally require us to have the ability to deal with massive datasets. Spark provides a machine learning library named mllib allowing us to build machine learning models efficiently and parallelly.

This post is going to start with a Spark ML modelling example based on pyspark on Python, K-Means, and to explain some basic steps as well as the usage of Spark APIs when building an ML model on Spark.

For the complete code of the K-Means example, please refer to Sec2. Spark K-Means code summarization.

#Big Data,Spark,Machine Learning

大数据技术原理与应用 - (11). 流计算

2020-06-06Edit: 2020-12-29Zhanhang (Matthew) ZENG 24 minutes read (About 3601 words)

Big Data / Spark Big Data

【第三篇】 - 大数据处理与分析, 《大数据技术原理与应用, 林子雨》

本篇介绍大数据处理与分析的相关技术，包括

第7章 - MapReduce
第8章 - Hive - 基于 Hadoop 的数据仓库
第9章 - Hadoop 的优化与发展
第10章 - Spark
第11章 - 流计算
第12章 - 图计算

大数据包括批量计算和流计算，不同于批数据处理，流式计算 (处理) 要求对数据流进行计算，要求更低的时延或实时结果输出。

#Big Data

大数据技术原理与应用 - (10). Spark

2020-06-03Edit: 2020-12-29Zhanhang (Matthew) ZENG 26 minutes read (About 3859 words)

Big Data / Spark Big Data

【第三篇】 - 大数据处理与分析, 《大数据技术原理与应用, 林子雨》

本篇介绍大数据处理与分析的相关技术，包括

第7章 - MapReduce
第8章 - Hive - 基于 Hadoop 的数据仓库
第9章 - Hadoop 的优化与发展
第10章 - Spark
第11章 - 流计算
第12章 - 图计算

Spark 最初诞生于伯克利大学的 APM 实验室，是一个可应用于大规模数据处理的快速、通用引擎，如今是 Apache 软件基金会下的顶级开源项目之一。Spark 在借鉴Hadoop MapReduce 优点的同时，很好地解决了 MapReduce 所面临的问题。

#Big Data

大数据技术原理与应用 - (9). Hadoop 的优化与发展

2020-06-01Edit: 2020-12-29Zhanhang (Matthew) ZENG 27 minutes read (About 3982 words)

Big Data / Spark Big Data

【第三篇】 - 大数据处理与分析, 《大数据技术原理与应用, 林子雨》

本篇介绍大数据处理与分析的相关技术，包括

第7章 - MapReduce
第8章 - Hive - 基于 Hadoop 的数据仓库
第9章 - Hadoop 的优化与发展
第10章 - Spark
第11章 - 流计算
第12章 - 图计算

介绍 Hadoop 2.0 对 1.0 不足与局限的解决方案，介绍 Hadoop 2.0 的新特性以及新一代资源管理调度框架 YARN 框架。

#Big Data

大数据技术原理与应用 - (8). Hive - 基于 Hadoop 的数据仓库

2020-05-29Edit: 2020-12-29Zhanhang (Matthew) ZENG 26 minutes read (About 3908 words)

Big Data / Spark Big Data

【第三篇】 - 大数据处理与分析, 《大数据技术原理与应用, 林子雨》

本篇介绍大数据处理与分析的相关技术，包括

第7章 - MapReduce
第8章 - Hive - 基于 Hadoop 的数据仓库
第9章 - Hadoop 的优化与发展
第10章 - Spark
第11章 - 流计算
第12章 - 图计算

Hive 是一个基于 Hadoop 的数据仓库平台。通过 Hive，我们可以方便地进行 ETL 的工作。Hive 定义了一个类似于 SQL 的查询语言: HiveQL，能够将用户编写的 HiveQL 转化为相应的 Mapreduce 程序基于 Hadoop 执行，可以说 Hive 实质就是一款基于 HDFS 的 MapReduce 计算框架，对存储在 HDFS 中的数据进行分析和管理。

#Big Data