Spark ETL 入门

简介#

Apache Spark是一个快速通用的集群计算系统。它提供Java、Scala、Python和R中的高级API, 以及支持一般执行图的优化引擎。它还支持丰富的高级工具集,包括用于SQL和结构化数据处理的Spark SQL、 用于机器学习的MLlib、用于图形处理的GraphX和Spark Streaming。

单机部署#

Spark SQL 提供了多种聚集函数和SQL查询功能, 可以用来替换传统的ETL

  • Startup master
<code>bin/start-master.sh/<code>
<code>bin/start-slave.sh spark://localhost:7077//<code>

安装依赖包#

Spark 已经集成了很多依赖,如果这里没有你的依赖可以通过 spark-shell --package groupId:artifactId:versiong 来安装相关依赖

<code>spark-2.4.4 jet$ bin/spark-shell --packages com.ibm.db2.jcc:db2jcc:db2jcc4
Ivy Default Cache set to: /Users/jet/.ivy2/cache
The jars for the packages stored in: /Users/jet/.ivy2/jars
:: loading settings :: url = jar:file:/Users/jet/app/spark-2.4.4/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.ibm.db2.jcc#db2jcc added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-50c2736a-0697-41c6-9989-ea5333c30e2e;1.0
\tconfs: [default]
\tfound com.ibm.db2.jcc#db2jcc;db2jcc4 in central
downloading https://repo1.maven.org/maven2/com/ibm/db2/jcc/db2jcc/db2jcc4/db2jcc-db2jcc4.jar ...
\t[SUCCESSFUL ] com.ibm.db2.jcc#db2jcc;db2jcc4!db2jcc.jar (13149ms)
:: resolution report :: resolve 5264ms :: artifacts dl 13155ms
\t:: modules in use:
\tcom.ibm.db2.jcc#db2jcc;db2jcc4 from central in [default]
\t---------------------------------------------------------------------
\t| | modules || artifacts |
\t| conf | number| search|dwnlded|evicted|| number|dwnlded|
\t---------------------------------------------------------------------
\t| default | 1 | 1 | 1 | 0 || 1 | 1 |
\t---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-50c2736a-0697-41c6-9989-ea5333c30e2e
\tconfs: [default]
\t1 artifacts copied, 0 already retrieved (4140kB/17ms)
20/02/11 19:48:20 WARN Utils: Your hostname, jetdeMacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.31.25 instead (on interface en0)
20/02/11 19:48:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/02/11 19:48:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.31.25:4040
Spark context available as 'sc' (master = spark://192.168.31.25:7077, app id = app-20200211194827-0002).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\\ \\/ _ \\/ _ `/ __/ '_/
/___/ .__/\\_,_/_/ /_/\\_\\ version 2.4.4
/_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
Type in expressions to have them evaluated.
Type :help for more information.

scala>/<code>

For Cassandra Package Dependencies#

<code>$SPARK_HOME/bin/spark-shell --conf spark.cassandra.connection.host=127.0.0.1 --packages com.datastax.spark:spark-cassandra-connector_2.11:2.4.2

$SPARK_HOME/bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:2.4.2/<code>

本地调试#

<code> # Connect remote spark cluster
spark = SparkSession \\
.builder.master("spark://192.168.12.21:7077")\\
.appName("Python Spark ETL") \\
.getOrCreate()

# Connect local spark server
spark = SparkSession \\
.builder.master("local[*]")\\
.appName("Python Spark ETL") \\
.getOrCreate()/<code>


分享到:


相關文章: