Spark Create RDD Examples in Java

In Apache Spark, an RDD, or Resilient Distributed Dataset, is a distributed collection of elements that can be processed in parallel across a cluster. You can create an RDD by parallelizing an existing collection in the driver program, by loading data from an external storage system such as a text file, or by converting another Spark abstraction such as a DataFrame into an RDD.

This tutorial shows how to create Spark RDDs in Java using three practical inputs:

  1. Create RDD from List<T> using JavaSparkContext.parallelize().
  2. Create RDD from a text file using JavaSparkContext.textFile().
  3. Create RDD from a JSON file by reading JSON as a DataFrame and converting it to JavaRDD<Row>.

The examples use local mode so that you can run them on a development machine. The same RDD creation methods work in a Spark cluster when the input paths and Spark configuration are adjusted for that environment.

Before Creating Spark RDDs in Java

For the first two examples, the program uses JavaSparkContext. For the JSON example, the program uses SparkSession because Spark reads JSON as a structured DataFrame before converting it to an RDD. In all cases, the RDD is not fully computed when it is created. Spark evaluates RDD transformations lazily and runs work when an action such as foreach() or collect() is called.

The examples below use these common imports and Spark classes:

  • SparkConf to set the application name, master, and other Spark properties.
  • JavaSparkContext to create Java RDDs from local collections and files.
  • JavaRDD<T> to represent the distributed dataset.
  • SparkSession to read JSON data through Spark SQL.

Create Spark RDD from List<T> using parallelize()

Use parallelize() when you already have a small local collection in the driver program and want to distribute it as an RDD. This method is useful for examples, tests, lookup data, and small in-memory inputs. It is not meant for loading large production datasets from the driver memory.

In the following Java program, a List<String> is converted into a JavaRDD<String>. The second argument in sc.parallelize(data, 1) sets the number of partitions to one for this simple example.

RDDfromList.java

</>
Copy
import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class RDDfromList {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Spark RDD foreach Example")
				.setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);

		// read list to RDD
		List<String> data = Arrays.asList("Learn","Apache","Spark","with","Tutorial Kart"); 
		JavaRDD<String> items = sc.parallelize(data,1);

		// apply a function for each element of RDD
		items.foreach(item -> {
			System.out.println("* "+item); 
		});
	}
}

A typical output from the local run is shown below. In a real cluster, output from foreach() is written from executor processes, so it may appear in executor logs instead of the driver console.

* Learn
* Apache
* Spark
* with
* Tutorial Kart

Create Spark RDD from Text File using textFile()

Use textFile() when the input data is stored as plain text. Spark reads the file and creates an RDD where each element is one line from the file. The path can point to a local file while testing, or to a distributed storage location such as HDFS, S3-compatible storage, or another supported file system in a cluster setup.

For this example, assume that data/rdd/input/sample.txt contains a few lines of text.

</>
Copy
Learn Apache Spark
Create RDD from text file
Run Java examples

ReadTextToRDD.java

</>
Copy
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class ReadTextToRDD {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD")
										.setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// provide path to input text file
		String path = "data/rdd/input/sample.txt";
		
		// read text file to RDD
		JavaRDD<String> lines = sc.textFile(path);
		
		// collect RDD for printing
		for(String line:lines.collect()){
			System.out.println(line);
		}
	}
}

The collect() action brings all RDD elements back to the driver. It is convenient for a small tutorial file, but for large files use actions such as take(), count(), saveAsTextFile(), or transformations followed by distributed writes.

Learn Apache Spark
Create RDD from text file
Run Java examples

Create Spark RDD from JSON File using SparkSession

JSON is structured data, so Spark usually reads it into a DataFrame first. You can then convert the DataFrame to JavaRDD<Row> with toJavaRDD(). This approach is useful when you want to use Spark SQL for schema inference or structured processing and still need an RDD for low-level operations.

Assume the JSON file is available at data/employees.json. A newline-delimited JSON file is commonly used with Spark, as shown below.

</>
Copy
{"id": 1, "name": "Ravi", "department": "Engineering"}
{"id": 2, "name": "Meera", "department": "Finance"}

JSONtoRDD.java

</>
Copy
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class JSONtoRDD {

	public static void main(String[] args) {
		// configure spark
		SparkSession spark = SparkSession
				.builder()
				.appName("Spark Example - Read JSON to RDD")
				.master("local[2]")
				.getOrCreate();

		// read list to RDD
		String jsonPath = "data/employees.json";
		JavaRDD<Row> items = spark.read().json(jsonPath).toJavaRDD();

		items.foreach(item -> {
			System.out.println(item); 
		});
	}
}

The RDD elements are Row objects, not raw JSON strings. You can access fields from each row by column name or position, depending on how you want to process the data.

[Engineering,1,Ravi]
[Finance,2,Meera]

Choosing the Right Spark RDD Creation Method

Input sourceRecommended methodRDD element typeBest used for
Java List or collectionsc.parallelize(data)Type of list item, such as StringSmall local data, examples, tests, and lookup values
Plain text filesc.textFile(path)String, one element per lineLine-based logs, CSV-like text, and simple file input
JSON filespark.read().json(path).toJavaRDD()RowStructured JSON data that may also need Spark SQL processing

If your work is mainly structured data processing, a DataFrame or Dataset is usually easier to use than an RDD. Use an RDD when you need lower-level control over distributed elements, custom transformations, or APIs that specifically require RDDs. You can also refer to the official Apache Spark RDD programming guide for the core RDD creation patterns.

Common Spark RDD Creation Errors and Fixes

ProblemLikely reasonFix
File path not foundThe path is relative to the working directory or unavailable to executors.Use an absolute path while testing, or a shared storage path in a cluster.
Unexpected order in printed outputRDD partitions can execute in parallel.Do not depend on foreach() print order. Use ordered processing only when required.
Driver memory error after collect()Too many records are brought back to the driver.Use take() for sampling or write results to storage instead of collecting everything.
JSON values appear as Row outputspark.read().json() creates a DataFrame before conversion.Read fields from the Row object or process the data as a DataFrame.

Spark Create RDD FAQs

What is the easiest way to create an RDD in Spark Java?

The easiest way is to use JavaSparkContext.parallelize() on a small Java collection. For file-based data, use methods such as textFile() or read the data with SparkSession and convert it to an RDD when needed.

How does textFile() create an RDD in Spark?

textFile() reads a text file and returns a JavaRDD<String>. Each element in the RDD represents one line from the input file.

Can Spark create an RDD directly from a JSON file?

In Java, the common approach is to read JSON with SparkSession as a DataFrame and then call toJavaRDD(). The resulting RDD contains Row objects.

Should I use RDD or DataFrame for JSON processing in Spark?

For most structured JSON processing, a DataFrame is more convenient because it supports schema-aware operations and Spark SQL. Convert to an RDD only when you need low-level RDD transformations or an API that requires an RDD.

Spark RDD Tutorial QA Checklist

  • Verify that every Java example uses the correct Spark class: JavaSparkContext for list and text input, and SparkSession for JSON input.
  • Confirm that file paths such as data/rdd/input/sample.txt and data/employees.json exist before running the examples.
  • Use output blocks only for expected program results, not for Java syntax examples.
  • Check that examples using collect() are clearly described as small-data tutorial examples.
  • Keep the distinction clear between JavaRDD<String> from text input and JavaRDD<Row> from JSON input.

Summary of Spark RDD Creation Examples

In this Spark Tutorial, we learnt how to create Spark RDD from a Java List, how to read a text file as an RDD, and how to read a JSON file as a DataFrame and convert it into a Java RDD. For small local data, use parallelize(). For line-based files, use textFile(). For JSON data, use SparkSession and convert to an RDD only when RDD-level processing is required.