Spark - Read JSON file to RDD

Spark – Read JSON file to RDD

JSON is commonly used for application logs, API exports, event streams, and data exchange between services. In Apache Spark, JSON files are usually read through the SQL reader first, because Spark can infer a schema and represent each JSON object as a row.

In this tutorial, we shall learn how to read a JSON file to an RDD with the help of SparkSession, DataFrameReader, and Dataset<Row>.toJavaRDD(). We will also look at the expected JSON file format, how to access values from the resulting Row objects, and when a plain text RDD is a better choice.

Expected JSON format before converting Spark JSON data to RDD

By default, Spark expects a JSON Lines style file, where each line contains one complete JSON object. This format works well for distributed reading because Spark can process different lines in parallel.

For this tutorial, the input file contains one employee record per line.

employees.json

{"name":"Michael", "salary":3000}
{"name":"Andy", "salary":4500}
{"name":"Justin", "salary":3500}
{"name":"Berta", "salary":4000}
{"name":"Raju", "salary":3000}

If your input is a single pretty-printed JSON document or a JSON array spanning multiple lines, use the multiLine reader option before converting the dataset to an RDD.

</>

Copy

JavaRDD<Row> items = spark.read()
        .option("multiLine", "true")
        .json(jsonPath)
        .toJavaRDD();

Steps to read JSON file into JavaRDD<Row> in Spark

To read JSON file Spark RDD,

Create a SparkSession.

SparkSession spark = SparkSession
		.builder()
		.appName("Spark Example - Write Dataset to JSON File")
		.master("local[2]")
		.getOrCreate();

Get DataFrameReader of the SparkSession.
spark.read()
Use DataFrameReader.json(String jsonFilePath) to read the contents of JSON to Dataset<Row>.
spark.read().json(jsonPath)
Use Dataset<Row>.toJavaRDD() to convert Dataset<Row> to JavaRDD<Row>.
spark.read().json(jsonPath).toJavaRDD()

The important point is that Spark reads the JSON file as a structured dataset first. The RDD you get from toJavaRDD() is an RDD of Row objects, not an RDD of raw JSON strings.

Java example to read JSON file as Spark RDD

Following is a Java Program to read JSON file to Spark RDD and print the contents of it.

JSONtoRDD.java

</>

Copy

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class JSONtoRDD {
	public static void main(String[] args) {
		// configure spark
		SparkSession spark = SparkSession
				.builder()
				.appName("Spark Example - Read JSON to RDD")
				.master("local[2]")
				.getOrCreate();

		// read list to RDD
		String jsonPath = "data/employees.json";
		JavaRDD<Row> items = spark.read().json(jsonPath).toJavaRDD();

		items.foreach(item -> {
			System.out.println(item); 
		});
	}
}

Output

[Michael,3000]
[Andy,4500]
[Justin,3500]
[Berta,4000]
[Raju,3000]

Access fields from Row after reading JSON file to RDD

Printing a Row is useful for checking the result, but in a real Spark job you usually need to read individual fields. Since the JSON reader creates structured rows, you can access values by column name or by index.

</>

Copy

JavaRDD<String> employeeNames = items.map(row -> row.getAs("name"));

JavaRDD<Integer> employeeSalaries = items.map(row -> row.getAs("salary"));

You can also inspect the inferred schema before converting the dataset to an RDD. This helps you confirm field names and data types.

</>

Copy

spark.read().json(jsonPath).printSchema();

root
 |-- name: string (nullable = true)
 |-- salary: long (nullable = true)

In many Spark versions, an integer-looking JSON value may be inferred as long. If you need a specific type, cast it in a Dataset/DataFrame step before converting to RDD, or handle the value type carefully in your RDD transformation.

Read raw JSON lines as RDD instead of Row RDD

If you want an RDD where each element is the original JSON text line, do not use spark.read().json(). Use textFile() instead. This is useful when you want to pass each raw JSON string to a custom parser or keep malformed records for separate handling.

</>

Copy

JavaRDD<String> jsonLines = spark.sparkContext()
        .textFile(jsonPath, 1)
        .toJavaRDD();

The difference is simple: spark.read().json(jsonPath).toJavaRDD() gives JavaRDD<Row>, while textFile(jsonPath) gives JavaRDD<String>.

Requirement	Recommended Spark API	RDD element type
Read valid JSON records with schema inference	`spark.read().json(jsonPath).toJavaRDD()`	`Row`
Read each JSON line as plain text	`spark.sparkContext().textFile(jsonPath).toJavaRDD()`	`String`
Read pretty-printed or multi-line JSON	`spark.read().option("multiLine", "true").json(jsonPath)`	`Row`

Common issues while reading JSON file to Spark RDD

RDD contains Row objects, not JSON strings: Use row.getAs("columnName") to access values, or use textFile() if raw JSON text is required.
Fields appear as null: Check whether the JSON keys are spelled consistently across all records.
Multi-line JSON does not load as expected: Add .option("multiLine", "true") when the file is not in JSON Lines format.
Numeric type mismatch: Spark may infer numeric values as Long or Double depending on the input. Inspect the schema before mapping values.
Large JSON files are slow with inference: Provide a schema when the structure is known, so Spark does not need to infer it from the data.

Providing schema before converting JSON dataset to RDD

For production jobs, it is often better to provide the JSON schema explicitly. This avoids surprises from schema inference and makes the output RDD easier to work with.

</>

Copy

import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

StructType employeeSchema = DataTypes.createStructType(new StructField[] {
        DataTypes.createStructField("name", DataTypes.StringType, true),
        DataTypes.createStructField("salary", DataTypes.IntegerType, true)
});

JavaRDD<Row> employees = spark.read()
        .schema(employeeSchema)
        .json(jsonPath)
        .toJavaRDD();

This keeps the salary field as an integer according to the schema you provide, instead of depending on automatic inference.

FAQ on Spark reading JSON file to RDD

Does Spark read JSON directly into an RDD?

Spark usually reads JSON through spark.read().json(), which returns a Dataset/DataFrame. You can then call toJavaRDD() to convert it to JavaRDD<Row>.

Why does spark.read().json().toJavaRDD() return Row instead of String?

The JSON reader parses the file as structured data and creates columns from JSON fields. Therefore, each RDD element is a Row. Use textFile() when you need each input line as a raw JSON string.

How do I read a multi-line JSON file into Spark RDD?

Use .option("multiLine", "true") with the JSON reader, and then call toJavaRDD(). This is needed when a single JSON object or JSON array spans multiple lines.

Should I use DataFrame or RDD for JSON processing in Spark?

Use DataFrame or Dataset APIs for most JSON processing because they support schema handling, column operations, and query optimization. Convert to RDD only when you need low-level transformations or custom processing that is easier with RDD functions.

How can I avoid wrong data types when reading JSON in Spark?

Inspect the inferred schema with printSchema(). For stable jobs, provide a StructType schema before reading the JSON file, especially when numeric fields must have exact types.

QA checklist for Spark JSON to RDD examples

Confirm that the sample JSON file uses one complete JSON object per line unless multiLine is shown.
Check that the tutorial explains the output type as JavaRDD<Row>, not JavaRDD<String>.
Verify that field access examples use actual JSON key names such as name and salary.
Include a note about schema inference and numeric type handling for salary-like fields.
Show textFile() separately when the requirement is to read raw JSON lines.

Conclusion

In this Spark Tutorial, we have learnt to read JSON file to Spark RDD with the help of an example Java program. The usual approach is to read JSON as a structured dataset using spark.read().json() and then convert it to JavaRDD<Row>. If the requirement is to keep each JSON record as raw text, use Spark’s textFile() API instead.

TutorialKart.com

Spark – Read JSON file to RDD – Example