Apache Spark MLlib Tutorial – Learn about Spark’s Scalable Machine Learning Library
MLlib is the machine learning library that comes with Apache Spark. It provides scalable algorithms and utilities for building machine learning workflows on distributed data.
In practice, Spark MLlib is used when the training data is too large for a single machine or when the machine learning pipeline must run near existing Spark ETL, SQL, streaming, or data lake workloads. It supports common machine learning tasks such as classification, regression, clustering, recommendation, feature transformation, model evaluation, and pipeline construction.
What Spark MLlib means in modern Apache Spark projects
Apache Spark documentation uses MLlib as the overall name for Spark’s machine learning library. There are two package families to understand:
spark.mlis the DataFrame-based machine learning API and is the primary API for new Spark MLlib work.spark.mllibis the older RDD-based API. It is in maintenance mode, so it is mainly used for older codebases that already depend on RDD-based MLlib classes.
So, MLlib itself is not simply a removed library. For new development, prefer the DataFrame-based spark.ml API because it supports ML pipelines and integrates well with Spark SQL DataFrames.
Reference: Apache Spark MLlib Main Guide
Programming languages supported by Spark MLlib
MLlib could be developed using Java (Spark’s APIs).
Spark MLlib can also be used from Scala, Python, and R, depending on the Spark API selected for the project. Python users commonly use PySpark MLlib with DataFrames, while Java and Scala are often used in production Spark applications. R users can work with SparkR and related Spark machine learning APIs where supported.
With latest Spark releases, MLlib is inter-operable with Python’s Numpy libraries and R libraries.
Data sources used in Apache Spark MLlib workflows
Using MLlib, one can access HDFS(Hadoop Data File System) and HBase, in addition to local files. This enables MLlib to be easily plugged into Hadoop workflows.
In current Spark applications, MLlib usually reads data through Spark DataFrames. The source may be a local file, HDFS, object storage, Hive table, JDBC source, Parquet file, CSV file, JSON file, or another Spark-supported data source. After loading the data, the usual next steps are cleaning columns, converting categorical columns, assembling feature vectors, training a model, and evaluating the output.
Why Spark MLlib fits scalable machine learning workloads
Spark’s framework excels at iterative computation. This enables the iterative parts of MLlib algorithms to run fast. And also MLlib contains high quality algorithms for Classification, Regression, Recommendation, Clustering, Topic Modelling, etc.
Machine learning often repeats the same operations many times while fitting a model. Spark can distribute those computations across a cluster and can reuse cached intermediate data. This is useful when feature preparation, training, and evaluation must run over large datasets.

Spark MLlib machine learning pipeline components
The DataFrame-based Spark MLlib API is organized around pipeline stages. These stages make a model workflow easier to read, save, reload, and run again on similar data.
- DataFrame: The tabular input data used by the ML pipeline.
- Transformer: A stage that converts one DataFrame into another, such as adding a features column.
- Estimator: A stage that learns from data and produces a model. Logistic regression is an example of an estimator.
- Pipeline: A sequence of transformers and estimators executed in order.
- Evaluator: A component used to measure model quality with metrics such as accuracy, area under ROC, or regression error.
- Parameter tuning: Cross-validation and train-validation split utilities help compare parameter combinations.
PySpark MLlib example with a simple classification pipeline
The following small PySpark example shows the basic shape of a Spark MLlib classification workflow. It creates a DataFrame, assembles numeric feature columns, trains a logistic regression model, and evaluates predictions.
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
spark = SparkSession.builder.appName("SparkMLlibExample").getOrCreate()
data = spark.createDataFrame([
(1.0, 2.0, 0.0),
(2.0, 1.0, 0.0),
(3.0, 4.0, 1.0),
(4.0, 3.0, 1.0)
], ["feature_one", "feature_two", "label"])
assembler = VectorAssembler(
inputCols=["feature_one", "feature_two"],
outputCol="features"
)
logistic_regression = LogisticRegression(
featuresCol="features",
labelCol="label"
)
pipeline = Pipeline(stages=[assembler, logistic_regression])
model = pipeline.fit(data)
predictions = model.transform(data)
evaluator = BinaryClassificationEvaluator(
labelCol="label",
rawPredictionCol="rawPrediction"
)
area_under_roc = evaluator.evaluate(predictions)
print(area_under_roc)
This example is intentionally small so that the pipeline structure is clear. In a real Spark MLlib project, the input data usually comes from a file, table, or data lake, and the feature engineering stage may include missing-value handling, string indexing, one-hot encoding, scaling, and other transformations.
Apache Spark MLlib algorithms for scalable model building
Following are some of the examples to MLlib algorithms, with step by step understanding of ML Pipeline construction and model building :
- Classification using Logistic Regression
- Classification using Naive Bayes
- Generalized Regression
- Survival Regression
- Decision Trees
- Random Forests
- Gradient Boosted Trees
- Recommendation using Alternating Least Squares (ALS)
- Clustering using KMeans
- Clustering using Gaussian Mixtures
- Topic Modelling using Latent Dirichlet Conditions
- Frequent Itemsets
- Association Rules
- Sequential Pattern Mining
At a high level, these algorithms fit into common machine learning categories:
- Classification: Predicts a category, such as spam or not spam.
- Regression: Predicts a numeric value, such as price or demand.
- Clustering: Groups similar records when labels are not available.
- Recommendation: Suggests items or content based on user-item patterns.
- Frequent pattern mining: Finds common itemsets, association rules, and sequence patterns.
MLlib utilities for feature engineering, evaluation, and tuning
MLlib provides following workflow utilities :
- Feature Transformation
- ML Pipeline construction
- Model Evaluation
- Hyper-parameter tuning
- Saving and loading of models and pipelines
- Distributed Linear Algebra
- Statistics
These utilities are important because model quality depends not only on the algorithm, but also on how the data is prepared and evaluated. For example, categorical features may need to be indexed, multiple numeric columns may need to be assembled into a single vector column, and model parameters may need to be compared using cross-validation.
Spark MLlib model training steps from data to prediction
A typical Spark MLlib workflow follows these steps:
- Load data into a Spark DataFrame. Read the training data from a file, table, or supported storage system.
- Prepare input columns. Clean missing values, convert labels, encode categorical columns, and assemble features.
- Split data for training and testing. Keep a portion of data aside for evaluation.
- Build a Spark ML pipeline. Add feature transformers and the selected algorithm as pipeline stages.
- Fit the model. Train the model on the training DataFrame.
- Evaluate predictions. Use a metric suitable for the problem, such as accuracy, area under ROC, RMSE, or MAE.
- Save the model or pipeline. Persist the fitted model when it has to be reused for batch scoring or production inference.
Difference between Spark ML and Spark MLlib
The terms Spark ML and MLlib are often used together, so the naming can be confusing. MLlib is the machine learning library in Apache Spark. The spark.ml package is the DataFrame-based API inside MLlib and is the recommended API for most new work. The spark.mllib package is the older RDD-based API and is mainly kept for compatibility and maintenance.
| Term | Meaning in Spark machine learning |
|---|---|
| MLlib | The overall Spark machine learning library. |
spark.ml | DataFrame-based API used for modern Spark ML pipelines. |
spark.mllib | Older RDD-based API that is in maintenance mode. |
| ML Pipeline | A sequence of feature transformations and model stages. |
When to use Apache Spark MLlib for machine learning
Spark MLlib is a good fit when the data is already in Spark or when preprocessing and training must be distributed across a cluster. It is also useful when the same platform handles ETL, feature preparation, training, and batch scoring.
For small datasets that fit comfortably on one machine, a single-node machine learning library may be simpler. For large distributed datasets, Spark MLlib reduces data movement by allowing feature engineering and model training to run in the Spark environment.
Common mistakes in Spark MLlib tutorials and beginner projects
- Using the RDD-based API for new code without a reason: Prefer
spark.mlDataFrame pipelines for new projects. - Skipping feature preparation: Many algorithms expect a single vector column named
featuresand a numeric label column. - Training and testing on the same data: Always separate training and evaluation data when checking model performance.
- Ignoring categorical feature handling: String columns usually need indexing or encoding before training.
- Choosing Spark MLlib for very small problems only: Spark has overhead, so it is most useful when scale or integration with Spark data workflows matters.
QA checklist for this Spark MLlib scalable machine learning tutorial
- Confirm that the tutorial explains MLlib as the overall Spark machine learning library.
- Check that the difference between
spark.mlandspark.mllibis stated clearly. - Verify that new examples use the DataFrame-based
spark.mlAPI. - Confirm that existing internal links to Java MLlib examples remain unchanged.
- Check that the PySpark example includes a feature vector column and a label column.
- Review whether the page explains algorithms, utilities, data sources, pipelines, evaluation, and tuning.
FAQs on Apache Spark MLlib scalable machine learning library
Is Spark MLlib deprecated?
No, MLlib as Spark’s machine learning library is not deprecated. The older RDD-based spark.mllib API is in maintenance mode, while the DataFrame-based spark.ml API is the primary API for current Spark machine learning work.
Which library is used for machine learning in Apache Spark?
Apache Spark uses MLlib for machine learning. For new projects, use the DataFrame-based spark.ml API, which provides transformers, estimators, pipelines, evaluators, and tuning utilities.
What is the difference between Spark ML and MLlib?
MLlib is the overall machine learning library in Spark. Spark ML usually refers to the DataFrame-based spark.ml API inside MLlib. The older spark.mllib package is RDD-based and is mainly maintained for compatibility.
What are the main types of ML models supported by Spark MLlib?
Spark MLlib supports common model types such as classification, regression, clustering, recommendation, frequent pattern mining, and topic modelling. It also provides feature transformation, evaluation, and tuning utilities around these models.
Should beginners learn PySpark MLlib or Java Spark MLlib first?
Beginners can start with either language depending on their project. PySpark is convenient for data science workflows, while Java and Scala are common in production Spark applications. The main concept to learn is the DataFrame-based ML pipeline structure.
Conclusion
In this Apache Spark Tutorial – Spark MLlib Tutorial, we have learnt about different machine learning algorithms available in Spark MLlib and different utilities MLlib provides. We also covered the modern difference between spark.ml and spark.mllib, the typical model training workflow, and a simple PySpark MLlib pipeline example.
TutorialKart.com