Spark Usage Guide

Updated at：2025-11-03

Spark

Apache Spark is a high-speed, general-purpose computing engine built for large-scale data processing. It is a versatile parallel framework, open-sourced by the UC Berkeley AMP Lab, and is similar to Hadoop MapReduce. Spark retains the strengths of Hadoop MapReduce but differs in its ability to store intermediate output (MapReduce jobs) in memory, eliminating the need for HDFS read/write operations. This makes Spark particularly suitable for iterative MapReduce algorithms used in areas such as data mining and machine learning.

Install

1. Spark environment preparation

                Bash
                
            

                # Download to a path
wget https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
# Unzip
tar -zxvf spark-3.1.1-bin-hadoop3.2.tgz
# Set environment variables
SPARK_HOME=/home/hadoop/spark-3.1.1-bin-hadoop3.2
PATH=$PATH:$HADOOP_HOME/bin:$SPARK_HOME/bin
            

2. Add dependency jar packages

Download BOS-HDFS

                Bash
                
            

                # Unzip and copy to the Spark dependency path
unzip bos-hdfs-sdk-1.0.3-community.jar.zip
cp bos-hdfs-sdk-1.0.3-community.jar {SPARK_HOME}/jars
# Some necessary configurations for accessing BOS
cp {SPARK_HOME}/conf/spark-defaults.conf.template {SPARK_HOME}/conf/spark-defaults.conf
vim {SPARK_HOME}/conf/spark-defaults.conf
...
cat {SPARK_HOME}/spark-defaults.conf
spark.hadoop.fs.bos.access.key={your ak}
spark.hadoop.fs.bos.secret.access.key={your sk}
spark.hadoop.fs.bos.endpoint=http://bj.bcebos.com {your bucket endpiont}
spark.hadoop.fs.AbstractFileSystem.bos.impl=org.apache.hadoop.fs.bos.BOS
spark.hadoop.fs.bos.impl=org.apache.hadoop.fs.bos.BaiduBosFileSystem
            

Use

Create a demo.py script, read the student.parquet file from my-bucket, and count the number of students older than 22.

                Python
                
            

                from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example_app")\
    .config("spark.driver.bindAddress", "localhost")\
    .getOrCreate()
bosFile = "bos://my-bucket/student"
 #Write-in
data = [("abc",22), ("def",17), ("ghi",34)]
df = spark.createDataFrame(data, ["name", "age"])
df.write.parquet(bosFile)
df = spark.read.parquet(bosFile)
df.printSchema()
df.createOrReplaceTempView("students")
sqlDF = spark.sql("SELECT * FROM students WHERE age > 22 LIMIT 10")
sqlDF.show()
            

Run

Bash

1spark-submit demo.py

Presto Usage Guide

Flume Data Storage to BOS