Apache Spark Local Setup Guide
In this article, we’re going to talk through setting up Apache Spark on your local machine, along with the development environment to follow along with this course.
If you didn’t read our previous post on what is Apache Spark and should you be using it, check it out here
What we are setting up
We are focussed on writing Apache Spark code with Python in this guide. That means we’re going to install the Python Library. We’re going to install Anaconda to manage this process and make it super easy. Then we’re going to make use of Jupyter notebooks to write and run our code.
Anaconda is a free and open-source distribution of the Python and R programming languages for scientific computing, data science, and machine learning. It includes a wide range of packages and tools for data analysis and visualization, as well as popular libraries such as NumPy, Pandas, scikit-learn, and TensorFlow.
One of the main benefits of Anaconda is that it simplifies the process of installing and managing packages and libraries. Instead of installing each package individually, you can use Anaconda to install everything you need in one go. Anaconda also comes with a package manager called conda, which allows you to easily install, update, and remove packages, as well as create and manage virtual environments.
Download Anaconda: The first step is to download Anaconda from the Anaconda website (https://www.anaconda.com/products/individual). You should select the version that is compatible with your operating system (e.g., Windows, macOS, or Linux).
Install Anaconda: Once the download is complete, open the installation file and follow the prompts to install Anaconda. You may be asked to choose which components to install and where to install Anaconda. It is recommended to accept the default options.
Launch Anaconda Navigator: After the installation is complete, launch Anaconda Navigator. This will open a window that allows you to manage your Anaconda installation and launch various applications, including Jupyter notebooks.
Once you’ve completed the steps above, you should see a screen like this
Now that you have Anaconda, we can install PySpark really simply.
Open up Anaconda Prompt
Enter the following command,
pip install pyspark
Once that’s done, we’re ready to start writing some code.
We’re going to be running our code in Jupyter notebooks. One of the main benefits of Jupyter notebooks is that they allow you to write and run code in a flexible and interactive way. You can mix code blocks with text, equations, and visualizations, and you can run the code blocks one at a time or all at once. This makes it easy to test and debug code, as well as to document and share your work.
To get started, simply hit launch where you see Jupyter in Anaconda Navigator.
This will launch a terminal session, along with a window in your web browser where you can navigate.
If you just want to launch a Jupyter notebook, you can also just search for
Jupyter and launch from there
Create a new notebook
I’ve created a folder called
Pyspark Tutorial which I will be using to store all of the files for this course.
From there, in the top right you should see a button for
new. Simply hit new and choose the Python Notebook.
This should launch another web browser window, and you’re now ready to start writing some code!
If you’re not familiar with Jupyter notebooks, you can check out our introduction to Jupyter notebooks guide.
In this guide, we will go over some key functionality of Jupyter notebooks that are essential to learn.
Test PySpark is working
We will talk about this more, however let’s just validate that this is working
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() print(type(spark))
The output should be
Let’s look at what this is doing
Creating a Spark Session
from pyspark.sql import SparkSession: This imports the SparkSession class from the pyspark.sql module. A SparkSession is used to create a connection to a Spark cluster, and to create DataFrames and Datasets (which are data structures used in Spark).
spark = SparkSession.builder.getOrCreate(): This creates a SparkSession object. The builder attribute is used to create a SparkSession.Builder, which can be used to configure the SparkSession. The getOrCreate() method creates a SparkSession, or if one already exists, returns the active one.
So, in summary, this code creates a SparkSession object, which is used to create a connection to a Spark cluster and to create DataFrames and Datasets. You will write this code at the beginning of each notebook if you’re developing in a local environment. If you’re working in an integrated environment such as DataBricks, this should already be setup for you and configured on the cluster so you won’t need this code.
Now that you have everything setup and working, we’re ready to start using Apache Spark with Python.
Apache Spark - Complete guideBy: Adam Richardson
Learn everything you need to know about Apache Spark with this comprehensive guide. We will cover Apache spark basics, all the way to advanced.
Renaming columns with Apache Spark (PySpark)By: Adam Richardson
In this post, you will learn how to rename columns of a Dataframe with PySpark
Learn all about Apache Spark Data TypesBy: Adam Richardson
In this blog post, we will explore the different data types available in PySpark and how to use them effectively in your data processing tasks.
Learn How to Read and Write CSV Files with Apache Spark.By: Adam Richardson
In this post, we will cover reading and writing csv files with Apace Spark (PySpark)