Spark RDD with custom class objects

In Apache Spark, an RDD can store elements of a custom Java class, such as Person, Employee, Order, or any domain object used in your application. To create a Spark RDD with custom class objects in Java, make the custom class serializable, create a list of objects, and pass the list to SparkContext.parallelize() or JavaSparkContext.parallelize().

The important point is serialization. Spark sends functions and data between the driver and executor processes. If a custom object is part of an RDD, Spark must be able to serialize that object before it can distribute work across the cluster.

Requirements for a Java custom class used inside Spark RDD

Before using a Java class as the element type of an RDD, check these basic requirements.

The custom class should implement java.io.Serializable.
Fields used inside Spark transformations should also be serializable.
Avoid storing non-serializable resources, such as open database connections, file handles, or Spark contexts, inside the object.
Keep the class simple when possible, especially for beginner examples.
Use transformations such as map(), filter(), and reduce() on the RDD after it is created.

A minimal custom class for an RDD may look like the following.

</>

Copy

class Person implements Serializable {
    private static final long serialVersionUID = 1L;

    String name;
    int age;

    Person(String name, int age) {
        this.name = name;
        this.age = age;
    }
}

The serialVersionUID is not mandatory for every small example, but it is commonly added to serializable Java classes to make serialization behavior explicit.

Java Example

Following example demonstrates the creation of RDD with list of class objects.

CustomObjectsRDD.java

</>

Copy

import java.io.Serializable;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import com.google.common.collect.ImmutableList;

public class CustomObjectsRDD {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD")
		                                .setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// prepare list of objects
		List<Person> personList = ImmutableList.of(
			    new Person("Arjun", 25),
			    new Person("Akhil", 2));
		
		// parallelize the list using SparkContext
		JavaRDD<Person> perJavaRDD = sc.parallelize(personList);
		
		for(Person person : perJavaRDD.collect()){
			System.out.println(person.name);
		}
		
		sc.close();
	}
}

class Person implements Serializable{
	private static final long serialVersionUID = -2685444218382696366L;
	String name;
	int age;
	public Person(String name, int age){
		this.name = name;
		this.age = age;
	}
}

Output

18/02/10 21:59:10 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
18/02/10 21:59:10 INFO DAGScheduler: ResultStage 0 (collect at CustomObjectsRDD.java:29) finished in 0.223 s
18/02/10 21:59:10 INFO DAGScheduler: Job 0 finished: collect at CustomObjectsRDD.java:29, took 0.661038 s
Arjun
Akhil
18/02/10 21:59:10 INFO SparkContext: Invoking stop() from shutdown hook
18/02/10 21:59:10 INFO SparkUI: Stopped Spark web UI at http://192.168.1.104:4040
18/02/10 21:59:10 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

How the custom object RDD example works

The example first creates a SparkConf object and starts a JavaSparkContext. The Person objects are stored in an immutable list. Then sc.parallelize(personList) converts that local Java list into a distributed JavaRDD<Person>.

The call to collect() is an action. It brings all elements of the RDD back to the driver program, where the example prints the name field of each Person. This is fine for a small demonstration. For large RDDs, avoid collecting all records to the driver because it can use too much driver memory.

Filtering a Spark RDD of custom Person objects

After creating an RDD with custom objects, you can apply normal RDD transformations. The following example filters only the persons whose age is at least 18.

</>

Copy

JavaRDD<Person> adultsRDD = perJavaRDD.filter(person -> person.age >= 18);

List<Person> adults = adultsRDD.collect();

for (Person person : adults) {
    System.out.println(person.name + " - " + person.age);
}

Arjun - 25

Here, filter() is a transformation. Spark creates a new RDD containing only the matching Person objects. The actual computation happens when collect() is called.

Mapping custom class objects from a Spark RDD

You can also use map() to convert an RDD of custom class objects into another RDD. For example, the following code extracts only names from JavaRDD<Person>.

</>

Copy

JavaRDD<String> namesRDD = perJavaRDD.map(person -> person.name);

for (String name : namesRDD.collect()) {
    System.out.println(name);
}

Arjun
Akhil

This pattern is common when you need to read custom records, transform them, and then extract fields for further processing.

Avoiding serialization errors with Spark custom objects

The most common issue with an RDD of custom class objects is a serialization error. If the class does not implement Serializable, Spark may fail when it tries to send objects or closures to executor nodes.

A typical error may include java.io.NotSerializableException. To avoid it, make the custom class serializable and keep non-serializable dependencies outside the object.

</>

Copy

class Person implements Serializable {
    private static final long serialVersionUID = 1L;

    String name;
    int age;
}

If a field inside Person is another custom type, that nested type should also be serializable, unless the field is marked transient and does not need to travel with the RDD element.

When to use RDD of custom class objects in Spark

An RDD of custom class objects is useful when your processing is object-oriented or when the data does not naturally fit into a structured table. It gives you direct control over each object and the transformations applied to it.

Use custom object RDDs for low-level transformations on domain objects.
Use them when you already have Java objects created by application logic.
Use them when the logic is easier to express with object methods than with table-style columns.
Consider DataFrames for structured data, SQL queries, and workloads that can benefit from Spark SQL optimization.

For many modern Spark applications, DataFrames are easier for structured processing. However, RDDs remain useful when you need lower-level control over custom Java objects.

Best practices for Spark RDD with Java custom classes

Make the custom class implement Serializable.
Do not keep JavaSparkContext, database connections, or other runtime resources as fields in the custom object.
Use collect() only for small results that safely fit in driver memory.
Prefer take() when you only want to inspect a few custom objects.
Use map(), filter(), and other transformations instead of manually looping over large local collections.

Spark RDD custom class objects FAQ

Can a Spark RDD contain custom Java objects?

Yes. A Spark RDD can contain custom Java objects. The custom class should be serializable so Spark can send the objects between the driver and executor processes.

Why should a custom class implement Serializable for Spark RDD?

Spark distributes data and tasks across worker nodes. A custom class used inside an RDD should implement Serializable so Spark can serialize its objects when required during distributed execution.

What causes NotSerializableException in an RDD of custom objects?

NotSerializableException can occur when the custom class, a nested field, or an object captured by a Spark transformation cannot be serialized. Make the required classes serializable and avoid capturing non-serializable resources in transformations.

Should I use RDD custom objects or Spark DataFrame?

Use an RDD of custom objects when you need low-level object-based processing. Use a DataFrame when the data is structured into rows and columns and you want Spark SQL optimization and column-based operations.

Is collect() safe for an RDD of custom class objects?

collect() is safe only when the result is small enough to fit in driver memory. For large RDDs, use actions such as take(), write the result to storage, or process the data in distributed transformations.

Editorial QA checklist for this Spark custom object RDD tutorial

Confirm that the Java custom class implements Serializable before being used as an RDD element.
Check that the tutorial explains why Spark needs serialization for custom class objects.
Verify that the examples use valid Java RDD methods such as parallelize(), filter(), map(), and collect().
Ensure that warnings about collect() and driver memory are included for large RDDs.
Make sure the RDD vs DataFrame guidance is balanced and does not say that RDDs are unsupported.

Conclusion

In this Spark Tutorial – Spark RDD with custom class objects, we have learnt to initialize RDD from an immutable list of custom objects using SparkContext.parallelize(), with the help of an Example. We also covered why the custom class should be serializable, how to apply RDD transformations on custom objects, and how to avoid common serialization problems.

TutorialKart.com

Spark RDD with custom class objects