Spark RDD reduce()

In this Spark Tutorial, we shall learn to reduce an RDD to a single element. Reduce is an action operation that aggregates the elements of an RDD by repeatedly applying a two-argument function.

Spark RDD Reduce

RDD.reduce() is commonly used for sums, minimum values, maximum values, logical operations, and other aggregations where a collection can be combined into one result. Because Spark evaluates RDD partitions in parallel, the reducing function must be safe to run in different grouping and ordering patterns.

How Spark RDD reduce() works across partitions

When reduce() is called on an RDD, Spark first combines elements within each partition and then combines those partial results to produce one final value at the driver. The function receives two values at a time and returns one value of the same type.

For example, for the RDD [14, 21, 88, 99, 455], the function (a, b) -> a + b may be applied in different internal groupings, but the final result is still 677 because addition is suitable for distributed reduction.

Required reduce() function properties: associative and commutative

Following are the two important properties that an aggregation function should have:

  1. Commutative    A+B = B+A  – ensuring that the result would be independent of the order of elements in the RDD being aggregated.
  2. Associative    (A+B)+C = A+(B+C) – ensuring that any two elements associated in the aggregation at a time does not affect the final result.

Examples of such functions include addition, multiplication, logical OR, logical AND, and XOR.

Avoid using reduce() with functions that depend on a fixed order, such as subtraction, division, string formatting that assumes sequence, or functions with side effects. In a distributed job, Spark is free to combine partitions in a way that may not match the order in which you see elements in a small local list.

Syntax of RDD.reduce()

The syntax of RDD reduce() method is

</>
Copy
RDD.reduce(<function>)

<function> is the aggregation function. It could be passed as an argument or you may use lambda function to define the aggregation function.

In Java RDD code, reduce() accepts a function such as (a, b) -> a + b. In PySpark RDD code, reduce() accepts a callable such as lambda a, b: a + b. The return type should match the element type because the returned value may be combined again with another element or partial result.

Java example of Spark RDD reduce() for sum of integers

In this example, we will take an RDD of Integers and reduce them to their sum using RDD.reduce() method.

RDDreduceExample.java

</>
Copy
import java.util.Arrays;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class RDDreduceExample {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD")
											.setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);

		// read text file to RDD
		JavaRDD<Integer> numbers = sc.parallelize(Arrays.asList(14,21,88,99,455));

		// aggregate numbers using addition operator
		int sum = numbers.reduce((a,b)->a+b); 

		System.out.println("Sum of numbers is : "+sum);
	}

}

Run the above Spark Java application, and you would get the following output in console.

17/11/29 11:26:42 INFO DAGScheduler: ResultStage 0 (reduce at RDDreduceExample.java:20) finished in 0.330 s
17/11/29 11:26:43 INFO DAGScheduler: Job 0 finished: reduce at RDDreduceExample.java:20, took 0.943121 s
Sum of numbers is : 677
17/11/29 11:26:43 INFO SparkContext: Invoking stop() from shutdown hook

The Java example creates a local Spark context, parallelizes a list of integers, and then reduces the RDD with addition. The important line is numbers.reduce((a,b)->a+b); Spark calls this function until only one integer remains.

PySpark RDD reduce() example for sum of integers

In this example, we will implement the same use case of reducing integers in RDD to their sum, but we shall do that using Python.

</>
Copy
import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Read Text to RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text file to RDD
  numbers = sc.parallelize([1,7,8,9,5,77,48])

  # aggregate RDD elements using addition function
  sum = numbers.reduce(lambda a,b:a+b)

  print "sum is : " + str(sum)

Run the above Spark RDD reduce Python Example using spark-submit

$ spark-submit spark-rdd-reduce-example.py

You will get the following output in console.

17/11/29 11:39:06 INFO DAGScheduler: ResultStage 0 (reduce at /home/arjun/workspace/spark/spark-rdd-reduce-example.py:15) finished in 0.960 s
17/11/29 11:39:06 INFO DAGScheduler: Job 0 finished: reduce at /home/arjun/workspace/spark/spark-rdd-reduce-example.py:15, took 1.552233 s
sum is : 155
17/11/29 11:39:06 INFO SparkContext: Invoking stop() from shutdown hook

The above PySpark example uses the older Python 2 style print statement because the original sample was written for that environment. In Python 3, write the final line as shown below.

</>
Copy
print("sum is : " + str(sum))

Finding minimum and maximum values with RDD reduce()

RDD.reduce() can also be used to find the minimum or maximum element in an RDD. These operations are suitable because the final result does not depend on a fixed element order.

</>
Copy
numbers = sc.parallelize([14, 21, 88, 99, 455])

minimum = numbers.reduce(lambda a, b: a if a < b else b)
maximum = numbers.reduce(lambda a, b: a if a > b else b)

print("Minimum:", minimum)
print("Maximum:", maximum)
Minimum: 14
Maximum: 455

Why subtraction is a poor fit for Spark RDD reduce()

Subtraction is not associative or commutative, so it is not a reliable reducing function for a distributed RDD. The following syntax may run, but the answer can vary depending on how Spark groups values across partitions.

</>
Copy
numbers.reduce(lambda a, b: a - b)

Use reduce() only when the function represents a valid distributed aggregation. For ordered operations, sort or collect only when the dataset is small enough for the driver, or use an API that explicitly preserves the logic you need.

Handling empty RDDs before calling reduce()

RDD.reduce() needs at least one element. If the RDD is empty, reduce() cannot produce a final value because there is no neutral starting value. Check for empty data when there is a chance that filters or input conditions may remove all records.

</>
Copy
numbers = sc.parallelize([])

if numbers.isEmpty():
    total = 0
else:
    total = numbers.reduce(lambda a, b: a + b)

print(total)

When you need a default value for an empty RDD, consider whether fold() or aggregate() is a better choice, because they let you provide a zero value and are often clearer for more complex aggregations.

RDD reduce() compared with fold(), aggregate(), and reduceByKey()

Use the following comparison to choose the right Spark RDD aggregation method.

Spark RDD methodBest use caseKey point
reduce()Combine all elements into one valueNo explicit zero value is supplied
fold()Combine elements with a known neutral valueWorks with a supplied zero value
aggregate()Build a result of a different type or run separate sequence and combine logicUseful for custom aggregations
reduceByKey()Aggregate values per key in a pair RDDReturns one result per key instead of one result for the whole RDD

For key-value RDDs, do not use reduce() if you want a total per key. Use reduceByKey() so Spark can combine values for each key and return a pair RDD with the grouped results.

Common mistakes in Spark RDD reduce() examples

  • Using a non-associative function such as subtraction and expecting a stable result.
  • Calling reduce() on an RDD that may be empty without checking the input first.
  • Using reduce() for per-key aggregation instead of reduceByKey().
  • Changing external variables inside the reducing function. The function should return the combined value, not depend on side effects.
  • Collecting a large RDD to the driver and reducing locally instead of letting Spark reduce the data in parallel.

Official Spark reference for RDD reduce()

For additional reference, see the Apache Spark RDD programming guide. The official guide explains RDD actions and transformations, including how actions return values to the driver program.

Frequently asked questions about Spark RDD reduce()

Is Spark RDD reduce() a transformation or an action?

Spark RDD reduce() is an action. It triggers computation and returns one final value to the driver program.

Can RDD.reduce() return an RDD?

No. RDD.reduce() returns a single value, not another RDD. If you need an RDD after aggregation by key, use methods such as reduceByKey() on a pair RDD.

What happens when reduce() is called on an empty RDD?

reduce() cannot produce a value from an empty RDD. Check for empty input first or use an aggregation method where a zero value is appropriate.

Why should the reduce() function be associative and commutative?

Spark may combine elements within partitions and then combine partial results across partitions. Associative and commutative functions help keep the final result independent of grouping and ordering.

Should I use reduce() or reduceByKey() for key-value RDDs?

Use reduce() when you need one final value for the whole RDD. Use reduceByKey() when you need one reduced value for each key.

Editorial QA checklist for this Spark RDD reduce() tutorial

  • Confirms that RDD.reduce() is an action that returns a single value to the driver.
  • Explains why the reducing function should be associative and commutative for distributed execution.
  • Shows Java and PySpark examples without changing the original sample code blocks.
  • Warns about empty RDDs, non-associative functions, and per-key aggregation mistakes.
  • Mentions related Spark RDD methods such as fold(), aggregate(), and reduceByKey() where they help choose the correct API.

Spark RDD reduce() summary

In this Spark Tutorial, we learned the syntax and examples for RDD.reduce() method. Spark RDD reduce() is useful when all elements of an RDD must be combined into one value, such as a sum, minimum, maximum, or logical result. Use it with a reducing function that is safe for distributed execution, and choose fold(), aggregate(), or reduceByKey() when those APIs better match the aggregation requirement.