Spark RDD foreach action
Spark RDD foreach is an action used to apply a function to every element of an RDD. It is commonly used for side-effect operations such as printing records in local mode, writing each record to an external system, or running custom logic for every item.
Spark RDD foreach is used to apply a function for each element of an RDD. In this tutorial, we shall learn the usage of RDD.foreach() method with example Spark applications.
A key point to remember is that foreach() does not create a new RDD. It returns no value to the driver program. If you want to transform each element and keep the result as another RDD, use map() instead of foreach().
Syntax of Spark RDD foreach in Java
public void foreach(scala.Function1<T,scala.runtime.BoxedUnit> f)
Argument could be a lambda function or use org.apache.spark.api.java.function VoidFunction functional interface as the assignment target for a lambda expression or method reference.
foreach method does not modify the contents of RDD.
In Java Spark applications, the function passed to foreach() is executed for each element in the RDD. Since this is an action, it triggers Spark job execution if the RDD has pending transformations before it.
Example – Spark RDD foreach to print each element
In this example, we will take an RDD with strings as elements. We shall use RDD.foreach() on this RDD, and for each item in the RDD, we shall print the item.
RDDforEach.java
import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class RDDforEach {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Spark RDD foreach Example")
.setMaster("local[2]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// read list to RDD
List<String> data = Arrays.asList("Learn","Apache","Spark","with","Tutorial Kart");
JavaRDD<String> items = sc.parallelize(data,1);
// apply a function for each element of RDD
items.foreach(item -> {
System.out.println("* "+item);
});
}
}
Output
* Learn
* Apache
* Spark
* with
* Tutorial Kart
The example uses one partition with sc.parallelize(data, 1), so the printed output appears in the same order as the input list. With multiple partitions, the order of printed records is not guaranteed because Spark may execute partitions in parallel.
RDD foreach runs on executors, not as a driver-side loop
foreach() looks like a normal loop, but it is not the same as iterating over a Java collection in the driver program. Spark sends the function to worker nodes, and the function runs against RDD elements in partitions.
- In local mode,
System.out.println()usually appears in the console where you run the application. - In cluster mode, print statements usually appear in executor logs, not necessarily in the driver console.
- Do not depend on
foreach()output order unless your RDD has a single partition and the operation is purely local. - Avoid updating normal driver-side variables inside
foreach(); use Spark actions such ascount(),reduce(), or accumulators where appropriate.
You can refer to the official Apache Spark RDD foreach API for the PySpark API behavior. The same practical idea applies: a function is applied to every RDD element as an action.
Spark RDD foreach versus map for element processing
Use foreach() when you want to perform an action for each element and do not need a returned RDD. Use map() when you want to convert each element into another value and continue processing the transformed RDD.
| Method | Type | Returns | Use case |
|---|---|---|---|
foreach() | Action | No new RDD | Run side-effect logic for every element |
map() | Transformation | New RDD | Convert each element and keep the result |
foreachPartition() | Action | No new RDD | Run side-effect logic once per partition |
mapPartitions() | Transformation | New RDD | Transform records partition by partition |
The following Java snippet shows the difference between foreach() and map().
// foreach() performs an action and returns void
items.foreach(item -> System.out.println(item));
// map() transforms each element and returns another RDD
JavaRDD<String> upperCaseItems = items.map(item -> item.toUpperCase());
Use foreachPartition when the Spark RDD action needs external connections
If each element has to be written to an external system, opening a connection inside foreach() for every record can be inefficient. In such cases, foreachPartition() is usually a better fit because you can create one connection per partition and reuse it for all records in that partition.
items.foreachPartition(partition -> {
while (partition.hasNext()) {
String item = partition.next();
System.out.println(item);
}
});
The snippet above still prints records, but the pattern is useful when the per-partition block is used to create and close resources such as database connections or client objects. Make sure any external write operation is safe for retries, because Spark may rerun failed tasks.
PySpark RDD foreach equivalent
If you are using PySpark, the equivalent API is also named foreach. The function is applied to every element of the RDD.
items = sc.parallelize(["Learn", "Apache", "Spark"])
items.foreach(lambda item: print(item))
For learning and debugging small RDDs, collect() followed by a local print loop can be easier to inspect. Do not use collect() on large RDDs because it brings all data to the driver.
for item in items.collect():
print(item)
Common mistakes with Spark RDD foreach
- Expecting a returned RDD:
foreach()returns no transformed dataset. Usemap()when you need a result RDD. - Expecting ordered output: RDD partitions may run in parallel, so printed records may not appear in input order.
- Looking only at driver logs: In cluster mode, executor-side prints and errors may be in executor logs.
- Changing driver variables: Updates to local variables inside
foreach()do not work like updates in a normal local loop. - Opening resources per row: For external writes, prefer a partition-level pattern with
foreachPartition().
FAQs on Spark RDD foreach
What is foreach in Spark RDD?
foreach() is an RDD action that applies a given function to every element of the RDD. It is used when you need side effects such as printing, logging, or writing records, not when you need to create a new RDD.
What is the difference between RDD foreach and RDD map?
foreach() is an action and returns no new RDD. map() is a transformation and returns a new RDD by applying a function to every element.
Why does Spark RDD foreach output not appear in order?
RDD data is split into partitions, and Spark can process partitions in parallel. Because of this, output from foreach() may appear in a different order unless the data is processed in a single partition in local conditions.
When should I use foreachPartition instead of foreach?
Use foreachPartition() when setup work can be shared across records in the same partition, such as creating one external connection per partition instead of one connection per row.
Can Spark RDD foreach update a variable in the driver program?
Do not rely on normal driver-side variable updates inside foreach(). The function runs on worker-side tasks. Use Spark actions, accumulators, or a different design depending on the result you need.
QA checklist for Spark RDD foreach examples
- Confirm that the article explains
foreach()as a Spark action, not a transformation. - Verify that the Java example uses
items.foreach(...)only for per-element side-effect logic. - Check that readers are warned not to expect a new RDD from
foreach(). - Confirm that the difference between
foreach(),map(),foreachPartition(), andmapPartitions()is clear. - Check that cluster-mode logging and unordered output behavior are mentioned for practical debugging.
Conclusion
In this Spark Tutorial – RDD foreach, we have learnt to apply a function for each of the elements in RDD using RDD.foreach() method. We also covered when to use map() for transformations, when foreachPartition() is more suitable, and why printed output may behave differently in local and cluster execution.
TutorialKart.com