Spark RDD foreach

Spark RDD foreach action

Spark RDD foreach is an action used to apply a function to every element of an RDD. It is commonly used for side-effect operations such as printing records in local mode, writing each record to an external system, or running custom logic for every item.

Spark RDD foreach is used to apply a function for each element of an RDD. In this tutorial, we shall learn the usage of RDD.foreach() method with example Spark applications.

A key point to remember is that foreach() does not create a new RDD. It returns no value to the driver program. If you want to transform each element and keep the result as another RDD, use map() instead of foreach().

Syntax of Spark RDD foreach in Java

</>

Copy

public void foreach(scala.Function1<T,scala.runtime.BoxedUnit> f)

Argument could be a lambda function or use org.apache.spark.api.java.function VoidFunction functional interface as the assignment target for a lambda expression or method reference.

foreach method does not modify the contents of RDD.

In Java Spark applications, the function passed to foreach() is executed for each element in the RDD. Since this is an action, it triggers Spark job execution if the RDD has pending transformations before it.

Example – Spark RDD foreach to print each element

In this example, we will take an RDD with strings as elements. We shall use RDD.foreach() on this RDD, and for each item in the RDD, we shall print the item.

RDDforEach.java

</>

Copy

import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class RDDforEach {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Spark RDD foreach Example")
				.setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);

		// read list to RDD
		List<String> data = Arrays.asList("Learn","Apache","Spark","with","Tutorial Kart"); 
		JavaRDD<String> items = sc.parallelize(data,1);

		// apply a function for each element of RDD
		items.foreach(item -> {
			System.out.println("* "+item); 
		});
	}
}

Output

* Learn
* Apache
* Spark
* with
* Tutorial Kart

The example uses one partition with sc.parallelize(data, 1), so the printed output appears in the same order as the input list. With multiple partitions, the order of printed records is not guaranteed because Spark may execute partitions in parallel.

RDD foreach runs on executors, not as a driver-side loop

foreach() looks like a normal loop, but it is not the same as iterating over a Java collection in the driver program. Spark sends the function to worker nodes, and the function runs against RDD elements in partitions.

In local mode, System.out.println() usually appears in the console where you run the application.
In cluster mode, print statements usually appear in executor logs, not necessarily in the driver console.
Do not depend on foreach() output order unless your RDD has a single partition and the operation is purely local.
Avoid updating normal driver-side variables inside foreach(); use Spark actions such as count(), reduce(), or accumulators where appropriate.

You can refer to the official Apache Spark RDD foreach API for the PySpark API behavior. The same practical idea applies: a function is applied to every RDD element as an action.

Spark RDD foreach versus map for element processing

Use foreach() when you want to perform an action for each element and do not need a returned RDD. Use map() when you want to convert each element into another value and continue processing the transformed RDD.

Method	Type	Returns	Use case
`foreach()`	Action	No new RDD	Run side-effect logic for every element
`map()`	Transformation	New RDD	Convert each element and keep the result
`foreachPartition()`	Action	No new RDD	Run side-effect logic once per partition
`mapPartitions()`	Transformation	New RDD	Transform records partition by partition

The following Java snippet shows the difference between foreach() and map().

</>

Copy

// foreach() performs an action and returns void
items.foreach(item -> System.out.println(item));

// map() transforms each element and returns another RDD
JavaRDD<String> upperCaseItems = items.map(item -> item.toUpperCase());

Use foreachPartition when the Spark RDD action needs external connections

If each element has to be written to an external system, opening a connection inside foreach() for every record can be inefficient. In such cases, foreachPartition() is usually a better fit because you can create one connection per partition and reuse it for all records in that partition.

</>

Copy

items.foreachPartition(partition -> {
    while (partition.hasNext()) {
        String item = partition.next();
        System.out.println(item);
    }
});

The snippet above still prints records, but the pattern is useful when the per-partition block is used to create and close resources such as database connections or client objects. Make sure any external write operation is safe for retries, because Spark may rerun failed tasks.

PySpark RDD foreach equivalent

If you are using PySpark, the equivalent API is also named foreach. The function is applied to every element of the RDD.

</>

Copy

items = sc.parallelize(["Learn", "Apache", "Spark"])
items.foreach(lambda item: print(item))

For learning and debugging small RDDs, collect() followed by a local print loop can be easier to inspect. Do not use collect() on large RDDs because it brings all data to the driver.

</>

Copy

for item in items.collect():
    print(item)

Common mistakes with Spark RDD foreach

Expecting a returned RDD: foreach() returns no transformed dataset. Use map() when you need a result RDD.
Expecting ordered output: RDD partitions may run in parallel, so printed records may not appear in input order.
Looking only at driver logs: In cluster mode, executor-side prints and errors may be in executor logs.
Changing driver variables: Updates to local variables inside foreach() do not work like updates in a normal local loop.
Opening resources per row: For external writes, prefer a partition-level pattern with foreachPartition().

FAQs on Spark RDD foreach

What is foreach in Spark RDD?

foreach() is an RDD action that applies a given function to every element of the RDD. It is used when you need side effects such as printing, logging, or writing records, not when you need to create a new RDD.

What is the difference between RDD foreach and RDD map?

foreach() is an action and returns no new RDD. map() is a transformation and returns a new RDD by applying a function to every element.

Why does Spark RDD foreach output not appear in order?

RDD data is split into partitions, and Spark can process partitions in parallel. Because of this, output from foreach() may appear in a different order unless the data is processed in a single partition in local conditions.

When should I use foreachPartition instead of foreach?

Use foreachPartition() when setup work can be shared across records in the same partition, such as creating one external connection per partition instead of one connection per row.

Can Spark RDD foreach update a variable in the driver program?

Do not rely on normal driver-side variable updates inside foreach(). The function runs on worker-side tasks. Use Spark actions, accumulators, or a different design depending on the result you need.

QA checklist for Spark RDD foreach examples

Confirm that the article explains foreach() as a Spark action, not a transformation.
Verify that the Java example uses items.foreach(...) only for per-element side-effect logic.
Check that readers are warned not to expect a new RDD from foreach().
Confirm that the difference between foreach(), map(), foreachPartition(), and mapPartitions() is clear.
Check that cluster-mode logging and unordered output behavior are mentioned for practical debugging.

Conclusion

In this Spark Tutorial – RDD foreach, we have learnt to apply a function for each of the elements in RDD using RDD.foreach() method. We also covered when to use map() for transformations, when foreachPartition() is more suitable, and why printed output may behave differently in local and cluster execution.

TutorialKart.com

Spark RDD foreach – Example

Spark RDD foreach action

Syntax of Spark RDD foreach in Java

Example – Spark RDD foreach to print each element

RDD foreach runs on executors, not as a driver-side loop

Spark RDD foreach versus map for element processing

Use foreachPartition when the Spark RDD action needs external connections

PySpark RDD foreach equivalent

Common mistakes with Spark RDD foreach

FAQs on Spark RDD foreach

What is foreach in Spark RDD?

What is the difference between RDD foreach and RDD map?

Why does Spark RDD foreach output not appear in order?

When should I use foreachPartition instead of foreach?

Can Spark RDD foreach update a variable in the driver program?

QA checklist for Spark RDD foreach examples

Conclusion

Popular Courses

SAP

CRM

SAP Resources

Apache

GUI

Programming

Databases

Mobile

Linux

Web & Server

Testing

Learning