Spark Usage Guide

Updated at：2025-11-03

Spark

Apache Spark is a high-speed, general-purpose computing engine designed for large-scale data processing. It is a versatile parallel framework, open-sourced by UC Berkeley AMP Lab, similar to Hadoop MapReduce. Spark combines the advantages of Hadoop MapReduce but innovates by storing intermediate output (MapReduce jobs) in memory, avoiding HDFS read/write processes. Therefore, Spark is highly effective for iterative MapReduce algorithms like those used in data mining and machine learning.

Install

1. Spark environment preparation

                Bash
                
            

                # Download to a path
wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
# Unzip
tar -zxvf spark-3.1.1-bin-hadoop3.2.tgz
# Set environment variables
SPARK_HOME=/home/hadoop/spark-3.1.1-bin-hadoop3.2
PATH=$PATH:$HADOOP_HOME/bin:$SPARK_HOME/bin
            

2. Add dependency jar packages

Download BOS-HDFS

                Bash
                
            

                # Unzip and copy to the Spark dependency path
unzip bos-hdfs-sdk-1.0.3-community.jar.zip
cp bos-hdfs-sdk-1.0.3-community.jar {SPARK_HOME}/jars
# Some necessary configurations for accessing BOS
cp {SPARK_HOME}/conf/spark-defaults.conf.template {SPARK_HOME}/conf/spark-defaults.conf
vim {SPARK_HOME}/conf/spark-defaults.conf
...
cat {SPARK_HOME}/spark-defaults.conf
spark.hadoop.fs.bos.access.key={your ak}
spark.hadoop.fs.bos.secret.access.key={your sk}
spark.hadoop.fs.bos.endpoint=http://bj.bcebos.com {your bucket endpiont}
spark.hadoop.fs.AbstractFileSystem.bos.impl=org.apache.hadoop.fs.bos.BOS
spark.hadoop.fs.bos.impl=org.apache.hadoop.fs.bos.BaiduBosFileSystem
            

Use

Create a demo.py script, read the student.parquet file from my-bucket, and count the number of students older than 22.

                Python
                
            

                from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example_app")\
    .config("spark.driver.bindAddress", "localhost")\
    .getOrCreate()
bosFile = "bos://my-bucket/student"
 #Write-in
data = [("abc",22), ("def",17), ("ghi",34)]
df = spark.createDataFrame(data, ["name", "age"])
df.write.parquet(bosFile)
df = spark.read.parquet(bosFile)
df.printSchema()
df.createOrReplaceTempView("students")
sqlDF = spark.sql("SELECT * FROM students WHERE age > 22 LIMIT 10")
sqlDF.show()
            

Run

Bash

1spark-submit demo.py

Presto Usage Guide

Flume Data Storage to BOS