Spark read JSON file to Dataset in Java
To read a JSON file to a Spark Dataset in Java, load the file with SparkSession.read().json(path) and then convert the resulting Dataset<Row> to a typed Dataset<Employee> using .as(Encoders.bean(Employee.class)). This gives you Spark SQL’s JSON reader together with compile-time object access through a Java bean class.
Spark’s JSON reader is part of Spark SQL. By default, it expects newline-delimited JSON, where each line is a complete JSON object. If your source file is a normal pretty-printed JSON document or a JSON array spread across multiple lines, read it with the multiLine option. The official Spark JSON data source documentation describes these JSON file expectations and the available reader options at spark.apache.org/docs/latest/sql-data-sources-json.html.
When Spark Dataset is useful for reading JSON objects
A Spark DataFrame is already a Dataset<Row>. Use a typed Dataset when you want each JSON record to be represented as a Java object such as Employee, Customer, or Order. This is useful when later transformations use Java methods, typed encoders, or domain classes instead of only column expressions.
For simple SQL-style selection, filtering, grouping, and joins, a DataFrame is often enough. For typed processing, the usual flow is: JSON file to Dataset<Row>, then Dataset<Row> to Dataset<YourBean>.
JSON file format expected by Spark read().json()
The example in this tutorial uses a newline-delimited file named data/employees.json. Each line is a separate JSON object with field names that match the Java bean properties.
{"name":"Michael","salary":3000}
{"name":"Andy","salary":4500}
{"name":"Justin","salary":3500}
{"name":"Berta","salary":4000}
{"name":"Raju","salary":3000}
{"name":"Chandy","salary":4500}
{"name":"Joey","salary":3500}
{"name":"Mon","salary":4000}
{"name":"Rachel","salary":4000}
If the input is a directory, Spark reads all matching JSON files in that directory. The path may point to a local file system path while testing, or to a distributed storage path such as HDFS, S3, or cloud storage in a cluster setup.
Steps to read JSON file to Dataset in Spark Java
To read a JSON file to a typed Dataset in Spark Java, follow these steps.
- Create a Java bean class that represents one JSON record. Keep property names aligned with JSON field names.
- Create a
SparkSession, because JSON reading is provided by Spark SQL. - Create an
Encoderfor the Java bean class usingEncoders.bean(Employee.class). - Read the JSON file with
spark.read().json(jsonPath). This first creates aDataset<Row>. - Call
.as(employeeEncoder)to convert the rows intoDataset<Employee>. - Use
printSchema()orshow()to verify that Spark read the expected columns and values.
Java example to read JSON file to Dataset<Employee>
Following is a Java example where we shall create an Employee class to define the schema of data in the JSON file, and read JSON file to Dataset.
JSONtoDataSet.java
import java.io.Serializable;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
public class JSONtoDataSet {
public static class Employee implements Serializable{
public String name;
public int salary;
}
public static void main(String[] args) {
// configure spark
SparkSession spark = SparkSession
.builder()
.appName("Read JSON File to DataSet")
.master("local[2]")
.getOrCreate();
// Java Bean (data class) used to apply schema to JSON data
Encoder<Employee> employeeEncoder = Encoders.bean(Employee.class);
String jsonPath = "data/employees.json";
// read JSON file to Dataset
Dataset<Employee> ds = spark.read().json(jsonPath).as(employeeEncoder);
ds.show();
}
}
Output
+-------+------+
| name|salary|
+-------+------+
|Michael| 3000|
| Andy| 4500|
| Justin| 3500|
| Berta| 4000|
| Raju| 3000|
| Chandy| 4500|
| Joey| 3500|
| Mon| 4000|
| Rachel| 4000|
+-------+------+
The statement spark.read().json(jsonPath) performs schema inference and returns rows. The statement .as(employeeEncoder) asks Spark to map those rows to the Employee type. In regular Java projects, prefer a bean style class with a no-argument constructor and getters/setters, especially when the class is reused outside a small tutorial example.
Reading a multi-line JSON file into a Spark Dataset
If your JSON file contains one array or one object spread across multiple lines, set multiLine to true. Without this option, Spark treats each line as a separate JSON record and may mark the input as corrupt or return unexpected null values.
Encoder<Employee> employeeEncoder = Encoders.bean(Employee.class);
Dataset<Employee> employees = spark.read()
.option("multiLine", "true")
.json("data/employees-array.json")
.as(employeeEncoder);
employees.show();
Use multiLine for files such as the following JSON array. For large distributed datasets, newline-delimited JSON is usually easier for Spark to split and process in parallel.
[
{"name":"Michael","salary":3000},
{"name":"Andy","salary":4500},
{"name":"Justin","salary":3500}
]
Reading a directory of JSON files as one Dataset
The JSON path can point to a directory. Spark will read the JSON files under that directory and combine the records into one Dataset, provided the files have compatible fields.
Encoder<Employee> employeeEncoder = Encoders.bean(Employee.class);
String jsonPath = "data/employees/";
Dataset<Employee> employees = spark.read()
.json(jsonPath)
.as(employeeEncoder);
employees.show();
This is commonly used when JSON output is partitioned into many part files. If some files contain additional fields that are not present in the bean class, Spark can read them in the row schema, but they will not be available on the typed Java object unless the bean class also contains matching properties.
Checking Spark JSON schema before converting rows to Dataset objects
When a JSON file does not map correctly to a Dataset, inspect the inferred schema before calling .as(). This helps you catch field-name mismatches, unexpected nested structures, and numeric types inferred differently from the Java class.
Dataset<org.apache.spark.sql.Row> rows = spark.read().json("data/employees.json");
rows.printSchema();
rows.show(false);
Dataset<Employee> employees = rows.as(Encoders.bean(Employee.class));
For nullable numeric fields, use wrapper types such as Integer, Long, or Double in the Java class instead of primitive types such as int or double. This is safer when some JSON records have missing or null values.
Handling malformed JSON records while reading a Dataset
Spark’s JSON reader supports parse modes. The default mode is permissive, which allows Spark to continue reading records and place malformed input in a corrupt-record column when schema inference includes one. You can make the behavior explicit while debugging JSON input files.
Dataset<org.apache.spark.sql.Row> rows = spark.read()
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "_bad_record")
.json("data/employees.json");
rows.show(false);
For strict pipelines, use FAILFAST so that Spark throws an error when it finds malformed JSON. For exploratory work, permissive mode is easier because you can inspect the bad records and fix the source file.
Spark JSON Dataset troubleshooting in Java
| Problem while reading JSON | Likely reason | Practical fix |
|---|---|---|
| Dataset columns are null | JSON field names do not match the Java bean property names, or the inferred type does not match the bean type. | Print the row schema first, then align bean property names and types with the JSON fields. |
| Spark shows a corrupt record column | The file may be pretty-printed JSON, a JSON array, or an invalid JSON line. | Use multiLine for one multi-line JSON document, or fix invalid lines in newline-delimited JSON. |
| Path does not exist error | The path is relative to the Spark application’s working directory, or the cluster cannot access the local path. | Use an absolute path for local testing, or use a shared storage path such as HDFS/S3 in cluster mode. |
| Nested JSON does not fit the bean | The JSON contains nested objects or arrays, but the Java class has only flat fields. | Create nested bean classes, define a schema, or keep the data as Dataset<Row> until nested fields are flattened. |
| Missing numeric values cause mapping issues | Primitive numeric fields cannot represent null safely. | Use wrapper classes such as Integer or Long for fields that may be absent or null. |
FAQs on Spark read JSON file to Dataset
Does Spark read().json() return Dataset<Employee> directly?
No. In Java, spark.read().json(path) returns Dataset<Row>. To get Dataset<Employee>, create an encoder with Encoders.bean(Employee.class) and call .as(employeeEncoder).
Why does Spark expect one JSON object per line?
Spark’s default JSON reader is designed for newline-delimited JSON, where each line is an independent JSON object. This format is easier to split and process in parallel. For a single multi-line JSON document, use .option("multiLine", "true").
Can Spark read a JSON array into a Dataset?
Yes. If the file is a JSON array spread across multiple lines, read it with the multiLine option and then convert the rows with .as(employeeEncoder).
Should I manually define schema when reading JSON to Dataset?
For small examples, schema inference is convenient. For production jobs, an explicit schema or a carefully maintained bean class is safer because it avoids surprises when input data changes.
Can nested JSON be converted to a Java Dataset?
Yes, but the Java type must match the nested structure. For complex nested JSON, it is often simpler to read the file as Dataset<Row>, inspect or flatten the nested columns, and then convert to a typed Dataset.
QA checklist for a Spark JSON to Dataset Java example
- Confirm whether the sample JSON is newline-delimited JSON or a multi-line JSON array.
- Check that JSON field names match the Java bean property names used by
Encoders.bean(). - Run
printSchema()before converting to a typed Dataset when the mapping is unclear. - Use wrapper numeric types in the bean when JSON fields may be null or missing.
- Verify that the JSON path is accessible from the Spark driver and executors in the target environment.
What to remember when Spark reads JSON into a typed Dataset
In this Spark Tutorial – Read JSON file to Dataset, we have learnt to read JSON records with SparkSession.read().json() and convert them to typed Java objects with an encoder. Use the default reader for newline-delimited JSON, use multiLine for a single multi-line JSON document, and inspect the schema whenever the JSON-to-bean mapping is not clear.
TutorialKart.com