Spark Usage Guide
Updated at:2025-11-03
Spark
Apache Spark is a high-speed, general-purpose computing engine built for large-scale data processing. It is a versatile parallel framework, open-sourced by the UC Berkeley AMP Lab, and is similar to Hadoop MapReduce. Spark retains the strengths of Hadoop MapReduce but differs in its ability to store intermediate output (MapReduce jobs) in memory, eliminating the need for HDFS read/write operations. This makes Spark particularly suitable for iterative MapReduce algorithms used in areas such as data mining and machine learning.
Install
1. Spark environment preparation
Bash
1# Download to a path
2wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
3# Unzip
4tar -zxvf spark-3.1.1-bin-hadoop3.2.tgz
5# Set environment variables
6SPARK_HOME=/home/hadoop/spark-3.1.1-bin-hadoop3.2
7PATH=$PATH:$HADOOP_HOME/bin:$SPARK_HOME/bin
2. Add dependency jar packages
Download BOS-HDFS
Bash
1# Unzip and copy to the Spark dependency path
2unzip bos-hdfs-sdk-1.0.3-community.jar.zip
3cp bos-hdfs-sdk-1.0.3-community.jar {SPARK_HOME}/jars
4# Some necessary configurations for accessing BOS
5cp {SPARK_HOME}/conf/spark-defaults.conf.template {SPARK_HOME}/conf/spark-defaults.conf
6vim {SPARK_HOME}/conf/spark-defaults.conf
7...
8cat {SPARK_HOME}/spark-defaults.conf
9spark.hadoop.fs.bos.access.key={your ak}
10spark.hadoop.fs.bos.secret.access.key={your sk}
11spark.hadoop.fs.bos.endpoint=http://bj.bcebos.com {your bucket endpiont}
12spark.hadoop.fs.AbstractFileSystem.bos.impl=org.apache.hadoop.fs.bos.BOS
13spark.hadoop.fs.bos.impl=org.apache.hadoop.fs.bos.BaiduBosFileSystem
Use
Create a demo.py script, read the student.parquet file from my-bucket, and count the number of students older than 22.
Python
1from pyspark.sql import SparkSession
2spark = SparkSession.builder.appName("example_app")\
3 .config("spark.driver.bindAddress", "localhost")\
4 .getOrCreate()
5bosFile = "bos://my-bucket/student"
6 #Write-in
7data = [("abc",22), ("def",17), ("ghi",34)]
8df = spark.createDataFrame(data, ["name", "age"])
9df.write.parquet(bosFile)
10df = spark.read.parquet(bosFile)
11df.printSchema()
12df.createOrReplaceTempView("students")
13sqlDF = spark.sql("SELECT * FROM students WHERE age > 22 LIMIT 10")
14sqlDF.show()
Run
Bash
1spark-submit demo.py
