Apache Spark RDD (Resilient Distributed Dataset)
In Apache Spark, an RDD, or Resilient Distributed Dataset, is an immutable, fault-tolerant collection of elements that can be processed in parallel across a cluster.
RDD is the foundational low-level data abstraction in Spark. It represents data split into partitions, where each partition can be processed on different worker nodes. Spark can create RDDs from external storage systems, such as HDFS or local files, or from an existing collection in the driver program.
Spark RDD can contain objects of any type. For example, an RDD may contain strings, numbers, rows parsed from a file, key-value pairs, or custom objects in a Spark application.
Meaning of Resilient, Distributed, and Dataset in Spark RDD
The name RDD describes the main behavior of this Spark data structure.
- Resilient means Spark can recover lost partitions by using lineage information instead of always storing every intermediate result on disk.
- Distributed means the data is split into partitions and stored or processed across multiple nodes in a cluster.
- Dataset means it is a collection of records or objects that Spark can transform and compute on.
This combination makes RDDs useful when you need fine-grained control over distributed data processing in Spark.
How Apache Spark Creates an RDD
There are two common ways to create an RDD in Spark.
- Parallelize an existing collection from the driver program.
- Load data from an external source, such as a text file.
The following PySpark example creates an RDD from a Python list.
numbers = sc.parallelize([1, 2, 3, 4, 5])
squares = numbers.map(lambda x: x * x)
print(squares.collect())
[1, 4, 9, 16, 25]
In this example, parallelize() creates the RDD, map() defines a transformation, and collect() is an action that brings the result back to the driver.
Spark RDD Operations: Transformations and Actions
There are two types of RDD operations in Apache Spark.
- Transformations: Create a new RDD from an existing RDD.
- Actions: Run a computation on the RDD and return a value to the driver program or write data to an external storage system.
Common RDD transformations include map(), filter(), flatMap(), distinct(), union(), reduceByKey(), and groupByKey(). Common RDD actions include collect(), count(), first(), take(), reduce(), foreach(), and saveAsTextFile().
The following example filters even numbers from an RDD and then counts them.
numbers = sc.parallelize([10, 15, 20, 25, 30])
even_numbers = numbers.filter(lambda x: x % 2 == 0)
print(even_numbers.count())
3
Here, filter() is a transformation because it creates a new RDD. The count() operation is an action because it triggers computation and returns a result.
RDD Transformations Are Lazy in Spark
In Spark, transformations are lazy. This means Spark does not immediately execute a transformation when it appears in the program. Instead, Spark records the transformation in a lineage graph and waits until an action is called.
For an RDD, all transformations are kept as a logical plan. When an action is encountered, Spark evaluates the required transformations and executes the job.
words = sc.parallelize(["spark", "rdd", "spark", "dataset"])
spark_words = words.filter(lambda word: word == "spark")
# Nothing is computed until this action runs.
print(spark_words.count())
2
This lazy evaluation helps Spark optimize execution and avoid unnecessary work. For example, if only a count is required, Spark computes only what is needed for that action.
Why RDDs Are Fault Tolerant in Apache Spark
RDDs are fault tolerant because Spark tracks how each RDD was derived from its source data. This tracking information is called lineage.
In Spark, data is stored as RDD partitions. Transformations can be applied on RDDs, and new RDDs can be created from existing RDDs. Spark does not need to write every intermediate dataset to disk. Instead, it remembers the sequence of transformations used to build an RDD.
If a partition is lost because a worker node fails, Spark can recompute that partition from the original input data by using the lineage. This is how Spark provides fault tolerance without depending only on replicated intermediate data.
RDD Partitions and Parallel Processing in Spark
An RDD is divided into partitions. Each partition is a logical chunk of data that can be processed independently. Spark schedules tasks for these partitions and distributes the work across available executors.
The number of partitions affects parallelism. Too few partitions may underuse the cluster, while too many small partitions may add scheduling overhead. Spark can infer partitions from input data, but you can also set or adjust partitioning when needed.
data = sc.parallelize(range(1, 11), 4)
print(data.getNumPartitions())
4
In this example, the RDD is created with four partitions.
When to Use RDD Instead of DataFrame or Dataset
In modern Spark applications, DataFrames and Datasets are often preferred for structured data because they provide higher-level APIs and query optimization. However, RDDs are still useful in specific cases.
- Use RDDs when you need low-level control over each record.
- Use RDDs when your data is unstructured and does not fit naturally into rows and columns.
- Use RDDs when you need custom partitioning or transformations that are difficult to express with DataFrame operations.
- Use RDDs when you are learning Spark internals, because many higher-level Spark APIs are built on top of RDD concepts.
For SQL-like processing, aggregations on structured data, and most analytics workloads, DataFrames are usually easier to write and optimize. For record-level distributed programming, RDDs provide more direct control.
Small RDD Example with Key-Value Pairs
RDDs are commonly used with key-value pairs. In PySpark, a key-value RDD can be represented as an RDD of tuples.
items = sc.parallelize([
("apple", 2),
("banana", 1),
("apple", 3),
("banana", 4)
])
totals = items.reduceByKey(lambda a, b: a + b)
print(totals.collect())
[('apple', 5), ('banana', 5)]
The reduceByKey() transformation groups values with the same key and combines them. It is commonly used for counts, sums, and other aggregations on key-value data.
Useful RDD Concepts to Remember
- An RDD is immutable. A transformation creates a new RDD instead of changing the existing one.
- An RDD is distributed. Its partitions can be processed on different nodes.
- An RDD is lazy. Transformations are evaluated only when an action is called.
- An RDD is fault tolerant. Spark can recompute lost partitions using lineage.
- An RDD can be cached or persisted when the same data is reused across multiple actions.
Apache Spark RDD FAQ
What is an RDD in Apache Spark?
An RDD in Apache Spark is an immutable, distributed collection of elements that Spark can process in parallel. RDD stands for Resilient Distributed Dataset.
Why are RDD transformations lazy?
RDD transformations are lazy because Spark records them first and executes them only when an action is called. This allows Spark to build an execution plan and avoid unnecessary computation.
How does Spark RDD fault tolerance work?
Spark RDD fault tolerance works through lineage. If a partition is lost, Spark can recompute it from the original data by replaying the transformations used to create that partition.
What is the difference between RDD and DataFrame in Spark?
An RDD is a low-level distributed collection of objects, while a DataFrame is a higher-level structured API with named columns and query optimization. DataFrames are usually preferred for structured data, while RDDs are useful for low-level control.
Are RDDs still used in Spark?
RDDs are still part of Spark and are useful for low-level transformations, custom data processing, and understanding Spark internals. For many structured data tasks, DataFrames or Spark SQL are more commonly used.
Editorial QA Checklist for This Spark RDD Tutorial
- Confirm that the tutorial defines RDD as Resilient Distributed Dataset and explains immutable, distributed, and fault-tolerant behavior.
- Check that transformation and action examples correctly show lazy evaluation in Spark.
- Verify that PySpark examples use valid RDD methods such as
parallelize(),map(),filter(),count(), andreduceByKey(). - Ensure that RDD vs DataFrame guidance does not imply that RDDs are removed or unsupported.
- Keep the examples small enough for beginners to run in a Spark shell or notebook.
Conclusion
In this Spark Tutorial, we learned about Apache Spark RDD, how RDD transformations and actions work, why RDDs are fault tolerant, and when RDDs are useful compared with higher-level Spark APIs.
TutorialKart.com