Spark – Read JSON file to RDD
JSON is commonly used for application logs, API exports, event streams, and data exchange between services. In Apache Spark, JSON files are usually read through the SQL reader first, because Spark can infer a schema and represent each JSON object as a row.
In this tutorial, we shall learn how to read a JSON file to an RDD with the help of SparkSession, DataFrameReader, and Dataset<Row>.toJavaRDD(). We will also look at the expected JSON file format, how to access values from the resulting Row objects, and when a plain text RDD is a better choice.
Expected JSON format before converting Spark JSON data to RDD
By default, Spark expects a JSON Lines style file, where each line contains one complete JSON object. This format works well for distributed reading because Spark can process different lines in parallel.
For this tutorial, the input file contains one employee record per line.
employees.json
{"name":"Michael", "salary":3000}
{"name":"Andy", "salary":4500}
{"name":"Justin", "salary":3500}
{"name":"Berta", "salary":4000}
{"name":"Raju", "salary":3000}
If your input is a single pretty-printed JSON document or a JSON array spanning multiple lines, use the multiLine reader option before converting the dataset to an RDD.
JavaRDD<Row> items = spark.read()
.option("multiLine", "true")
.json(jsonPath)
.toJavaRDD();
Steps to read JSON file into JavaRDD<Row> in Spark
To read JSON file Spark RDD,
- Create a SparkSession.
SparkSession spark = SparkSession .builder() .appName("Spark Example - Write Dataset to JSON File") .master("local[2]") .getOrCreate(); - Get DataFrameReader of the SparkSession.
spark.read() - Use DataFrameReader.json(String jsonFilePath) to read the contents of JSON to Dataset<Row>.
spark.read().json(jsonPath) - Use Dataset<Row>.toJavaRDD() to convert Dataset<Row> to JavaRDD<Row>.
spark.read().json(jsonPath).toJavaRDD()
The important point is that Spark reads the JSON file as a structured dataset first. The RDD you get from toJavaRDD() is an RDD of Row objects, not an RDD of raw JSON strings.
Java example to read JSON file as Spark RDD
Following is a Java Program to read JSON file to Spark RDD and print the contents of it.
JSONtoRDD.java
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class JSONtoRDD {
public static void main(String[] args) {
// configure spark
SparkSession spark = SparkSession
.builder()
.appName("Spark Example - Read JSON to RDD")
.master("local[2]")
.getOrCreate();
// read list to RDD
String jsonPath = "data/employees.json";
JavaRDD<Row> items = spark.read().json(jsonPath).toJavaRDD();
items.foreach(item -> {
System.out.println(item);
});
}
}
Output
[Michael,3000]
[Andy,4500]
[Justin,3500]
[Berta,4000]
[Raju,3000]
Access fields from Row after reading JSON file to RDD
Printing a Row is useful for checking the result, but in a real Spark job you usually need to read individual fields. Since the JSON reader creates structured rows, you can access values by column name or by index.
JavaRDD<String> employeeNames = items.map(row -> row.getAs("name"));
JavaRDD<Integer> employeeSalaries = items.map(row -> row.getAs("salary"));
You can also inspect the inferred schema before converting the dataset to an RDD. This helps you confirm field names and data types.
spark.read().json(jsonPath).printSchema();
root
|-- name: string (nullable = true)
|-- salary: long (nullable = true)
In many Spark versions, an integer-looking JSON value may be inferred as long. If you need a specific type, cast it in a Dataset/DataFrame step before converting to RDD, or handle the value type carefully in your RDD transformation.
Read raw JSON lines as RDD instead of Row RDD
If you want an RDD where each element is the original JSON text line, do not use spark.read().json(). Use textFile() instead. This is useful when you want to pass each raw JSON string to a custom parser or keep malformed records for separate handling.
JavaRDD<String> jsonLines = spark.sparkContext()
.textFile(jsonPath, 1)
.toJavaRDD();
The difference is simple: spark.read().json(jsonPath).toJavaRDD() gives JavaRDD<Row>, while textFile(jsonPath) gives JavaRDD<String>.
| Requirement | Recommended Spark API | RDD element type |
|---|---|---|
| Read valid JSON records with schema inference | spark.read().json(jsonPath).toJavaRDD() | Row |
| Read each JSON line as plain text | spark.sparkContext().textFile(jsonPath).toJavaRDD() | String |
| Read pretty-printed or multi-line JSON | spark.read().option("multiLine", "true").json(jsonPath) | Row |
Common issues while reading JSON file to Spark RDD
- RDD contains Row objects, not JSON strings: Use
row.getAs("columnName")to access values, or usetextFile()if raw JSON text is required. - Fields appear as null: Check whether the JSON keys are spelled consistently across all records.
- Multi-line JSON does not load as expected: Add
.option("multiLine", "true")when the file is not in JSON Lines format. - Numeric type mismatch: Spark may infer numeric values as
LongorDoubledepending on the input. Inspect the schema before mapping values. - Large JSON files are slow with inference: Provide a schema when the structure is known, so Spark does not need to infer it from the data.
Providing schema before converting JSON dataset to RDD
For production jobs, it is often better to provide the JSON schema explicitly. This avoids surprises from schema inference and makes the output RDD easier to work with.
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
StructType employeeSchema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("salary", DataTypes.IntegerType, true)
});
JavaRDD<Row> employees = spark.read()
.schema(employeeSchema)
.json(jsonPath)
.toJavaRDD();
This keeps the salary field as an integer according to the schema you provide, instead of depending on automatic inference.
FAQ on Spark reading JSON file to RDD
Does Spark read JSON directly into an RDD?
Spark usually reads JSON through spark.read().json(), which returns a Dataset/DataFrame. You can then call toJavaRDD() to convert it to JavaRDD<Row>.
Why does spark.read().json().toJavaRDD() return Row instead of String?
The JSON reader parses the file as structured data and creates columns from JSON fields. Therefore, each RDD element is a Row. Use textFile() when you need each input line as a raw JSON string.
How do I read a multi-line JSON file into Spark RDD?
Use .option("multiLine", "true") with the JSON reader, and then call toJavaRDD(). This is needed when a single JSON object or JSON array spans multiple lines.
Should I use DataFrame or RDD for JSON processing in Spark?
Use DataFrame or Dataset APIs for most JSON processing because they support schema handling, column operations, and query optimization. Convert to RDD only when you need low-level transformations or custom processing that is easier with RDD functions.
How can I avoid wrong data types when reading JSON in Spark?
Inspect the inferred schema with printSchema(). For stable jobs, provide a StructType schema before reading the JSON file, especially when numeric fields must have exact types.
QA checklist for Spark JSON to RDD examples
- Confirm that the sample JSON file uses one complete JSON object per line unless
multiLineis shown. - Check that the tutorial explains the output type as
JavaRDD<Row>, notJavaRDD<String>. - Verify that field access examples use actual JSON key names such as
nameandsalary. - Include a note about schema inference and numeric type handling for salary-like fields.
- Show
textFile()separately when the requirement is to read raw JSON lines.
Conclusion
In this Spark Tutorial, we have learnt to read JSON file to Spark RDD with the help of an example Java program. The usual approach is to read JSON as a structured dataset using spark.read().json() and then convert it to JavaRDD<Row>. If the requirement is to keep each JSON record as raw text, use Spark’s textFile() API instead.
TutorialKart.com