Spark Usage Guide
Updated at:2025-11-03
Spark
Apache Spark is a high-speed, general-purpose computing engine designed for large-scale data processing. It is a versatile parallel framework, open-sourced by UC Berkeley AMP Lab, similar to Hadoop MapReduce. Spark combines the advantages of Hadoop MapReduce but innovates by storing intermediate output (MapReduce jobs) in memory, avoiding HDFS read/write processes. Therefore, Spark is highly effective for iterative MapReduce algorithms like those used in data mining and machine learning.
Install
1. Spark environment preparation
Bash
1# Download to a path
2wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
3# Unzip
4tar -zxvf spark-3.1.1-bin-hadoop3.2.tgz
5# Set environment variables
6SPARK_HOME=/home/hadoop/spark-3.1.1-bin-hadoop3.2
7PATH=$PATH:$HADOOP_HOME/bin:$SPARK_HOME/bin
2. Add dependency jar packages
Download BOS-HDFS
Bash
1# Unzip and copy to the Spark dependency path
2unzip bos-hdfs-sdk-1.0.3-community.jar.zip
3cp bos-hdfs-sdk-1.0.3-community.jar {SPARK_HOME}/jars
4# Some necessary configurations for accessing BOS
5cp {SPARK_HOME}/conf/spark-defaults.conf.template {SPARK_HOME}/conf/spark-defaults.conf
6vim {SPARK_HOME}/conf/spark-defaults.conf
7...
8cat {SPARK_HOME}/spark-defaults.conf
9spark.hadoop.fs.bos.access.key={your ak}
10spark.hadoop.fs.bos.secret.access.key={your sk}
11spark.hadoop.fs.bos.endpoint=http://bj.bcebos.com {your bucket endpiont}
12spark.hadoop.fs.AbstractFileSystem.bos.impl=org.apache.hadoop.fs.bos.BOS
13spark.hadoop.fs.bos.impl=org.apache.hadoop.fs.bos.BaiduBosFileSystem
Use
Create a demo.py script, read the student.parquet file from my-bucket, and count the number of students older than 22.
Python
1from pyspark.sql import SparkSession
2spark = SparkSession.builder.appName("example_app")\
3 .config("spark.driver.bindAddress", "localhost")\
4 .getOrCreate()
5bosFile = "bos://my-bucket/student"
6 #Write-in
7data = [("abc",22), ("def",17), ("ghi",34)]
8df = spark.createDataFrame(data, ["name", "age"])
9df.write.parquet(bosFile)
10df = spark.read.parquet(bosFile)
11df.printSchema()
12df.createOrReplaceTempView("students")
13sqlDF = spark.sql("SELECT * FROM students WHERE age > 22 LIMIT 10")
14sqlDF.show()
Run
Bash
1spark-submit demo.py
