Spark ETL 入門

簡介#

Apache Spark是一個快速通用的集群計算系統。它提供Java、Scala、Python和R中的高級API, 以及支持一般執行圖的優化引擎。它還支持豐富的高級工具集,包括用於SQL和結構化數據處理的Spark SQL、 用於機器學習的MLlib、用於圖形處理的GraphX和Spark Streaming。

單機部署#

Spark SQL 提供了多種聚集函數和SQL查詢功能, 可以用來替換傳統的ETL

  • Startup master
<code>bin/start-master.sh/<code>
<code>bin/start-slave.sh spark://localhost:7077//<code>

安裝依賴包#

Spark 已經集成了很多依賴,如果這裡沒有你的依賴可以通過 spark-shell --package groupId:artifactId:versiong 來安裝相關依賴

<code>spark-2.4.4 jet$ bin/spark-shell --packages com.ibm.db2.jcc:db2jcc:db2jcc4
Ivy Default Cache set to: /Users/jet/.ivy2/cache
The jars for the packages stored in: /Users/jet/.ivy2/jars
:: loading settings :: url = jar:file:/Users/jet/app/spark-2.4.4/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.ibm.db2.jcc#db2jcc added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-50c2736a-0697-41c6-9989-ea5333c30e2e;1.0
\tconfs: [default]
\tfound com.ibm.db2.jcc#db2jcc;db2jcc4 in central
downloading https://repo1.maven.org/maven2/com/ibm/db2/jcc/db2jcc/db2jcc4/db2jcc-db2jcc4.jar ...
\t[SUCCESSFUL ] com.ibm.db2.jcc#db2jcc;db2jcc4!db2jcc.jar (13149ms)
:: resolution report :: resolve 5264ms :: artifacts dl 13155ms
\t:: modules in use:
\tcom.ibm.db2.jcc#db2jcc;db2jcc4 from central in [default]
\t---------------------------------------------------------------------
\t| | modules || artifacts |
\t| conf | number| search|dwnlded|evicted|| number|dwnlded|
\t---------------------------------------------------------------------
\t| default | 1 | 1 | 1 | 0 || 1 | 1 |
\t---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-50c2736a-0697-41c6-9989-ea5333c30e2e
\tconfs: [default]
\t1 artifacts copied, 0 already retrieved (4140kB/17ms)
20/02/11 19:48:20 WARN Utils: Your hostname, jetdeMacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.31.25 instead (on interface en0)
20/02/11 19:48:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/02/11 19:48:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.31.25:4040
Spark context available as 'sc' (master = spark://192.168.31.25:7077, app id = app-20200211194827-0002).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\\ \\/ _ \\/ _ `/ __/ '_/
/___/ .__/\\_,_/_/ /_/\\_\\ version 2.4.4
/_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
Type in expressions to have them evaluated.
Type :help for more information.

scala>/<code>

For Cassandra Package Dependencies#

<code>$SPARK_HOME/bin/spark-shell --conf spark.cassandra.connection.host=127.0.0.1 --packages com.datastax.spark:spark-cassandra-connector_2.11:2.4.2

$SPARK_HOME/bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.12:2.4.2/<code>

本地調試#

<code> # Connect remote spark cluster
spark = SparkSession \\
.builder.master("spark://192.168.12.21:7077")\\
.appName("Python Spark ETL") \\
.getOrCreate()

# Connect local spark server
spark = SparkSession \\
.builder.master("local[*]")\\
.appName("Python Spark ETL") \\
.getOrCreate()/<code>


分享到:


相關文章: