Spark RDD with custom class objects
In Apache Spark, an RDD can store elements of a custom Java class, such as Person, Employee, Order, or any domain object used in your application. To create a Spark RDD with custom class objects in Java, make the custom class serializable, create a list of objects, and pass the list to SparkContext.parallelize() or JavaSparkContext.parallelize().
The important point is serialization. Spark sends functions and data between the driver and executor processes. If a custom object is part of an RDD, Spark must be able to serialize that object before it can distribute work across the cluster.
Requirements for a Java custom class used inside Spark RDD
Before using a Java class as the element type of an RDD, check these basic requirements.
- The custom class should implement
java.io.Serializable. - Fields used inside Spark transformations should also be serializable.
- Avoid storing non-serializable resources, such as open database connections, file handles, or Spark contexts, inside the object.
- Keep the class simple when possible, especially for beginner examples.
- Use transformations such as
map(),filter(), andreduce()on the RDD after it is created.
A minimal custom class for an RDD may look like the following.
class Person implements Serializable {
private static final long serialVersionUID = 1L;
String name;
int age;
Person(String name, int age) {
this.name = name;
this.age = age;
}
}
The serialVersionUID is not mandatory for every small example, but it is commonly added to serializable Java classes to make serialization behavior explicit.
Java Example
Following example demonstrates the creation of RDD with list of class objects.
CustomObjectsRDD.java
import java.io.Serializable;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import com.google.common.collect.ImmutableList;
public class CustomObjectsRDD {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// prepare list of objects
List<Person> personList = ImmutableList.of(
new Person("Arjun", 25),
new Person("Akhil", 2));
// parallelize the list using SparkContext
JavaRDD<Person> perJavaRDD = sc.parallelize(personList);
for(Person person : perJavaRDD.collect()){
System.out.println(person.name);
}
sc.close();
}
}
class Person implements Serializable{
private static final long serialVersionUID = -2685444218382696366L;
String name;
int age;
public Person(String name, int age){
this.name = name;
this.age = age;
}
}
Output
18/02/10 21:59:10 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/02/10 21:59:10 INFO DAGScheduler: ResultStage 0 (collect at CustomObjectsRDD.java:29) finished in 0.223 s
18/02/10 21:59:10 INFO DAGScheduler: Job 0 finished: collect at CustomObjectsRDD.java:29, took 0.661038 s
Arjun
Akhil
18/02/10 21:59:10 INFO SparkContext: Invoking stop() from shutdown hook
18/02/10 21:59:10 INFO SparkUI: Stopped Spark web UI at http://192.168.1.104:4040
18/02/10 21:59:10 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
How the custom object RDD example works
The example first creates a SparkConf object and starts a JavaSparkContext. The Person objects are stored in an immutable list. Then sc.parallelize(personList) converts that local Java list into a distributed JavaRDD<Person>.
The call to collect() is an action. It brings all elements of the RDD back to the driver program, where the example prints the name field of each Person. This is fine for a small demonstration. For large RDDs, avoid collecting all records to the driver because it can use too much driver memory.
Filtering a Spark RDD of custom Person objects
After creating an RDD with custom objects, you can apply normal RDD transformations. The following example filters only the persons whose age is at least 18.
JavaRDD<Person> adultsRDD = perJavaRDD.filter(person -> person.age >= 18);
List<Person> adults = adultsRDD.collect();
for (Person person : adults) {
System.out.println(person.name + " - " + person.age);
}
Arjun - 25
Here, filter() is a transformation. Spark creates a new RDD containing only the matching Person objects. The actual computation happens when collect() is called.
Mapping custom class objects from a Spark RDD
You can also use map() to convert an RDD of custom class objects into another RDD. For example, the following code extracts only names from JavaRDD<Person>.
JavaRDD<String> namesRDD = perJavaRDD.map(person -> person.name);
for (String name : namesRDD.collect()) {
System.out.println(name);
}
Arjun
Akhil
This pattern is common when you need to read custom records, transform them, and then extract fields for further processing.
Avoiding serialization errors with Spark custom objects
The most common issue with an RDD of custom class objects is a serialization error. If the class does not implement Serializable, Spark may fail when it tries to send objects or closures to executor nodes.
A typical error may include java.io.NotSerializableException. To avoid it, make the custom class serializable and keep non-serializable dependencies outside the object.
class Person implements Serializable {
private static final long serialVersionUID = 1L;
String name;
int age;
}
If a field inside Person is another custom type, that nested type should also be serializable, unless the field is marked transient and does not need to travel with the RDD element.
When to use RDD of custom class objects in Spark
An RDD of custom class objects is useful when your processing is object-oriented or when the data does not naturally fit into a structured table. It gives you direct control over each object and the transformations applied to it.
- Use custom object RDDs for low-level transformations on domain objects.
- Use them when you already have Java objects created by application logic.
- Use them when the logic is easier to express with object methods than with table-style columns.
- Consider DataFrames for structured data, SQL queries, and workloads that can benefit from Spark SQL optimization.
For many modern Spark applications, DataFrames are easier for structured processing. However, RDDs remain useful when you need lower-level control over custom Java objects.
Best practices for Spark RDD with Java custom classes
- Make the custom class implement
Serializable. - Do not keep
JavaSparkContext, database connections, or other runtime resources as fields in the custom object. - Use
collect()only for small results that safely fit in driver memory. - Prefer
take()when you only want to inspect a few custom objects. - Use
map(),filter(), and other transformations instead of manually looping over large local collections.
Spark RDD custom class objects FAQ
Can a Spark RDD contain custom Java objects?
Yes. A Spark RDD can contain custom Java objects. The custom class should be serializable so Spark can send the objects between the driver and executor processes.
Why should a custom class implement Serializable for Spark RDD?
Spark distributes data and tasks across worker nodes. A custom class used inside an RDD should implement Serializable so Spark can serialize its objects when required during distributed execution.
What causes NotSerializableException in an RDD of custom objects?
NotSerializableException can occur when the custom class, a nested field, or an object captured by a Spark transformation cannot be serialized. Make the required classes serializable and avoid capturing non-serializable resources in transformations.
Should I use RDD custom objects or Spark DataFrame?
Use an RDD of custom objects when you need low-level object-based processing. Use a DataFrame when the data is structured into rows and columns and you want Spark SQL optimization and column-based operations.
Is collect() safe for an RDD of custom class objects?
collect() is safe only when the result is small enough to fit in driver memory. For large RDDs, use actions such as take(), write the result to storage, or process the data in distributed transformations.
Editorial QA checklist for this Spark custom object RDD tutorial
- Confirm that the Java custom class implements
Serializablebefore being used as an RDD element. - Check that the tutorial explains why Spark needs serialization for custom class objects.
- Verify that the examples use valid Java RDD methods such as
parallelize(),filter(),map(), andcollect(). - Ensure that warnings about
collect()and driver memory are included for large RDDs. - Make sure the RDD vs DataFrame guidance is balanced and does not say that RDDs are unsupported.
Conclusion
In this Spark Tutorial – Spark RDD with custom class objects, we have learnt to initialize RDD from an immutable list of custom objects using SparkContext.parallelize(), with the help of an Example. We also covered why the custom class should be serializable, how to apply RDD transformations on custom objects, and how to avoid common serialization problems.
TutorialKart.com