Spark for Machine Learning AI

Prem Vishnoi(cloudvala)
6 min readJan 14, 2024

--

Learning objectives
Machine learning workflows
Organizing data in DataFrames
Preprocessing and data preparation steps for machine learning
Clustering data
Classification algorithms
Regression methods available in Spark MLlib
Common approaches to designing recommendation systems

Hi, I’m Pv and in this course I’ll be describing how to use the Apache Spark Platform for machine learning.

We’ll start by installing Spark and reviewing the basics of the data frame data structure. We’ll cover how to pre-process both numeric and text data so that’s ready to use with Spark’s MLlib machine learning library.

We’ll describe multiple algorithms for clustering, classification, and regression.

We’ll briefly describe a recommendation system as well. Through the course we’ll see a common pattern of pre-processing, model training, and evaluation that helps streamline the building of machine learning pipelines. So let’s dive into to learning with Spark MLlib.

Steps in the machine learning process

There are three broad steps in the machine learning process.
The first is preprocessing, which includes collecting, reformatting, and transforming data, so that it’s readily used by machine learning algorithms.
The second step is model building, in which we apply machine learning algorithms to training data to build models.
Models are pieces of code that capture the information implicit in training data.
The last step is validation, in which we measure how well models are performing.

There are multiple ways to measure performance.
The preprocessing phase includes extracting, transforming, and loading data.
This is similar to the ETL process used in business intelligence and data warehousing.
It’s a good practice to review data at this stage to determine if there are any missing or invalid values.
If values are missing or invalid, you may want to set those values to some default value or ignore those records.
The best way to deal with missing or invalid data will depend upon your use case.

Normalizing and scaling data is the process of changing the range of values in a feature or attribute.
We’ll discuss normalizing and scaling in an upcoming video.
Another good preprocessing practice is to standardize categorical variables.
For example, you could ensure that all country names in your data set are designated by three-letter ISO codes.
In the model building stage, we may experiment with different algorithms to see which works well with our data and use case.
Applying an algorithm to a data set is called fitting a model to the data.
Some algorithms require us to specify parameters, such as how many levels to have in a decision tree.
These are called hyperparameters, and we often need to experiment to find optimal hyperparameters.
Now, just a note about terminology, when we set a parameter to the machine learning algorithm, we call those parameters hyperparameters.
When the machine learning algorithm learns the value of the parameter from training data, then we call those simply parameters.
The last step in the machine learning process is validating models.
In this step, we’re trying to understand how well our models work with data they have not seen before.

We can use metrics, like accuracy, precision, and sensitivity.
These three steps constitute the basic steps in building machine learning models.
In practice, we typically do these steps repeatedly.
Each iteration often gives us new information about our data and our assumptions and help us hone our models.

Install Spark
Let’s install Spark.
I’ve opened a browser, and navigated to the Spark download site, at spark.
apache.
org/downloads.
html.
I’m going to use all the default values for various options, and directly download a compressed file.
I’ll just Click on the File, and that starts the download.
The Microsoft Edge browser may not download the TAR GZ file correctly.
I suggest using Chrome, or another browser.
Now let’s open up the downloaded folder, and see the Spark Installation File.
Since it’s a TAR G-zipped file, and I can tell this by the .
tgz extension, I can decompress this file to create a Spark directory.
I decompress in MacOS, by Double-clicking.
This creates a directory with the full name of the Spark version.
To minimize typing, I’m going to simply Rename this directory to Spark.
And now I’m going to Move the Spark Folder into my Home Directory.
Now let’s Open a Terminal Window.
Spark requires that we have Java installed.
You can check if Java is installed, by typing java -version.
And you should see some message about which version is running.
Now, if you do not have Java installed, you can Navigate to this Java Download Site, and Download the Appropriate Java Installation File for your operating system.
You will also need to have Python installed.
Now Mac and Linux both have Python installed, but Windows does not.
I recommend using the Anaconda Python Distribution, which is available for download.
Now, I have Java installed.
I’m going to Navigate to the Spark Directory, and I’m going to List the Files in this directory.
I use the LS command in the MacOS, but Windows users will have to use the DIR command.
I’d like to point out two directories here.
The Data Directory contains sample datasets, including some from Machine Learning.
The BIN Directory, which I will move into right now, contains binary files for running Spark command-line processors.
We will use pyspark, which allows us to use Python.
Now if you’d like to work with Skyla, you can use the spark-shell command.
If you’d like to use R for Data Science and Machine Learning, you might want to try the sparkR command interpreter.
So let’s start the pyspark interpreter by typing .
/pyspark.
On Windows, type .
\pyspark.
And by default, Spark will display some logging messages, and once you see the banner for Spark, you will have successfully installed and started Spark.
And so at this point I’ll just Exit.
And that concludes the installation.

--

--

Prem Vishnoi(cloudvala)
Prem Vishnoi(cloudvala)

Written by Prem Vishnoi(cloudvala)

Head of Data and ML experienced in designing, implementing, and managing large-scale data infrastructure. Skilled in ETL, data modeling, and cloud computing

No responses yet