Spark RDD print contents using collect(), take(), and foreach()

RDD (Resilient Distributed Dataset) is a fault-tolerant collection of elements that can be operated on in parallel.

To print RDD contents in Spark, you can use actions such as collect(), take(), foreach(), or toLocalIterator(). The best choice depends on whether you are printing a small sample for debugging or trying to inspect records from a large distributed dataset.

RDD.collect() returns all elements of the dataset to the driver program as an array or list. You can then use a normal loop to print each element. This is simple, but it should be used only when the RDD is small enough to fit in driver memory.

RDD.foreach(f) runs a function f on each element of the dataset. In local mode, the output often appears in the console. In cluster mode, the print output may appear in executor logs instead of the driver console.

In this tutorial, we will go through Java and Python examples for printing RDD contents using collect() and foreach(), and then cover safer options such as take() for previewing only a few records. For the official behavior of Spark RDD actions, refer to the Apache Spark RDD programming guide.

Quick choice: which Spark RDD print method should you use?

MethodUse it whenImportant note
rdd.collect()You know the RDD is small and want to print every element from the driver.Brings all data to the driver, so it can cause memory issues on large RDDs.
rdd.take(n)You want to preview the first few elements for debugging.Usually safer than collect() for quick inspection.
rdd.foreach(f)You want to run a function for each element across partitions.Print output may go to executor logs in cluster mode.
rdd.toLocalIterator()You want to iterate through records on the driver without creating one full list at once.Still sends data to the driver, so use it carefully.

RDD.collect() – Print RDD – Java Example

In the following example, we will write a Java program, where we load RDD from a text file, and print the contents of RDD to console using RDD.collect().

This method is easy to understand because printing happens in the driver program after the RDD elements are collected. Use this only for small files or sample datasets.

PrintRDD.java

</>
Copy
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class PrintRDD {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD")
		                                .setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// read text files to RDD
		JavaRDD<String> lines = sc.textFile("data/rdd/input/file1.txt");
		
		// collect RDD for printing
		for(String line:lines.collect()){
		    System.out.println("* "+line);
		}
	}
}

file1.txt

Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD

Output

18/02/10 16:31:33 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
18/02/10 16:31:33 INFO DAGScheduler: ResultStage 0 (collect at PrintRDD.java:18) finished in 0.513 s
18/02/10 16:31:33 INFO DAGScheduler: Job 0 finished: collect at PrintRDD.java:18, took 0.726936 s
* Welcome to TutorialKart
* Learn Apache Spark
* Learn to work with RDD
18/02/10 16:31:33 INFO SparkContext: Invoking stop() from shutdown hook
18/02/10 16:31:33 INFO SparkUI: Stopped Spark web UI at http://192.168.1.104:4040
18/02/10 16:31:33 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

The lines prefixed with * are the actual RDD elements printed by the program. The other lines are Spark log messages. In real applications, you may configure Spark logging separately if you want cleaner console output.

RDD.collect() – Print RDD – Python Example

In the following example, we will write a Python program, where we load RDD from a text file, and print the contents of RDD to console using RDD.collect().

In PySpark, collect() returns the RDD elements as a Python list in the driver process. You can then print each element with a standard Python for loop.

print-rdd.py

</>
Copy
import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Print Contents of RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text file to RDD
  rdd = sc.textFile("data/rdd/input/file1.txt")

  # collect the RDD to a list
  list_elements = rdd.collect()

  # print the list
  for element in list_elements:
    print(element)

Run this Python program from terminal/command-prompt as shown below.

$ spark-submit print-rdd.py

Output

18/02/10 16:37:05 INFO DAGScheduler: ResultStage 0 (collect at /home/arjun/workspace/spark/readToRDD/print-rdd.py:15) finished in 0.378 s
18/02/10 16:37:05 INFO DAGScheduler: Job 0 finished: collect at /home/arjun/workspace/spark/readToRDD/print-rdd.py:15, took 0.546189 s
This is File 1
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
18/02/10 16:37:05 INFO SparkContext: Invoking stop() from shutdown hook

If the input file is large, replace collect() with take(n) during debugging. This avoids pulling the complete RDD to the driver.

</>
Copy
for element in rdd.take(5):
    print(element)

Print only the first few RDD elements with take() in PySpark

For day-to-day debugging, take() is often safer than collect(). It returns only the first n elements from the RDD to the driver, so you can inspect a sample without loading the complete dataset.

</>
Copy
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("Preview RDD Elements")
sc = SparkContext(conf=conf)

rdd = sc.textFile("data/rdd/input/file1.txt")

for element in rdd.take(3):
    print(element)

The same idea can be used in Java by calling take(n) on the JavaRDD.

</>
Copy
for (String line : lines.take(3)) {
    System.out.println(line);
}

take() is intended for previewing a small number of records. It is not a replacement for saving large results to files or tables.

RDD.foreach() – Print RDD – Java Example

In the following example, we will write a Java program, where we load RDD from a text file, and print the contents of RDD to console using RDD.foreach().

Unlike collect(), foreach() executes the function on the distributed RDD partitions. When the program runs on a cluster, the printed lines may be available in executor logs rather than directly in the driver terminal.

PrintRDD.java

</>
Copy
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;

public class PrintRDD {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD")
		                                .setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// read text files to RDD
		JavaRDD<String> lines = sc.textFile("data/rdd/input/file1.txt");
		
		lines.foreach(new VoidFunction<String>(){ 
	          public void call(String line) {
	              System.out.println("* "+line); 
	    }});
	}
}

Use foreach() when the function you want to run belongs naturally on each RDD element. For simple debugging output from the driver, take() or collect() is usually easier to read.

RDD.foreach() – Print RDD – Python Example

In the following example, we will write a Python program, where we load RDD from a text file, and print the contents of RDD to console using RDD.foreach().

print-rdd.py

</>
Copy
import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Print Contents of RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text file to RDD
  rdd = sc.textFile("data/rdd/input/file1.txt")

  def f(x): print(x)

  # apply f(x) for each element of rdd
  rdd.foreach(f)

When you run this in local mode, you may see the printed elements in the terminal. On a Spark cluster, check executor logs if the output does not appear in the driver console.

Print RDD contents from the driver with toLocalIterator()

toLocalIterator() returns an iterator over the RDD elements to the driver. It can be useful when you want driver-side iteration without building one full list with collect(). However, the data is still transferred to the driver, so it should not be used casually on very large RDDs.

</>
Copy
for element in rdd.toLocalIterator():
    print(element)

If you need only a small preview, prefer take(). If you need to inspect or preserve a large result, write the RDD to storage instead of printing it.

Why RDD.foreach() output may not appear in the Spark driver console

A common confusion is that rdd.foreach(print) does not always print where you expect. Spark runs RDD operations across executors. Therefore, print statements inside foreach() may be written to executor logs. This is normal behavior in distributed mode.

For predictable driver-console output while debugging, collect or take a small amount of data first, then print it in the driver program.

</>
Copy
sample = rdd.take(10)
for element in sample:
    print(element)

Safe practices for printing Spark RDD contents

  • Use take(n) for debugging: preview a small number of elements instead of printing the complete RDD.
  • Avoid collect() on large RDDs: it brings all elements to the driver and can cause driver memory errors.
  • Expect foreach() output in executor logs: this is especially common when running on a cluster.
  • Do not depend on printed order unless you sort first: distributed partitions may not print in the order you expect.
  • Write large output to storage: use actions such as saveAsTextFile() instead of printing thousands or millions of records.

FAQ on printing Spark RDD contents

How do I print all elements of an RDD in Spark?

For a small RDD, call rdd.collect() and print each element in a loop. Do not use this approach for large RDDs because all records are copied to the driver program.

How do I print only the first few elements of a PySpark RDD?

Use rdd.take(n). For example, rdd.take(5) returns the first five elements to the driver, and you can print them with a Python for loop.

Why does rdd.foreach(print) not show output in my console?

foreach() runs on Spark executors. In cluster mode, the print output may be written to executor logs instead of the driver console. Use take() or collect() on a small sample if you want output in the driver terminal.

Is collect() safe for printing RDD contents?

collect() is safe only for small RDDs. It returns the complete dataset to the driver, so it can fail or slow down the application when the RDD is large.

Can I print RDD elements in a fixed order?

RDD output order is not always useful for debugging distributed data. If order matters, apply a suitable sort operation before printing a small sample. Remember that sorting can be expensive on large datasets.

Editorial QA checklist for Spark RDD print examples

  • Confirm that every printing example clearly says whether output is printed from the driver or from executors.
  • Keep warnings about collect() near examples that bring all RDD data to the driver.
  • Use take(n) for any newly added sample-preview code instead of suggesting full-data printing.
  • Check that PySpark examples use language-python and Java examples use language-java in WordPress code blocks.
  • Do not imply that foreach() output will always appear in the same terminal when the application runs on a Spark cluster.

Conclusion: printing RDD contents without overloading the Spark driver

In this Spark Tutorial – Print Contents of RDD, we have learnt to print elements of RDD using collect and foreach RDD actions with the help of Java and Python examples. For small RDDs, collect() followed by a loop is simple. For debugging larger RDDs, prefer take(). Use foreach() when you understand that the print function runs on executors and the output may be found in executor logs.