Spark - RDD.filter() - Filter Elements

Spark – RDD.filter()

Spark RDD Filter : RDD.filter() method returns an RDD with those elements which pass a filter condition (function) that is given as argument to the method. In this tutorial, we learn to filter RDD containing Integers, and an RDD containing Tuples, with example programs.

The filter() operation is a Spark RDD transformation. It does not change the original RDD. Instead, it creates a new RDD of the same element type after applying a boolean condition to every element.

How Spark RDD.filter() works with a boolean condition

For each element in the source RDD, Spark calls the function passed to filter(). If the function returns true, the element is kept in the output RDD. If the function returns false, the element is removed from the output RDD.

</>

Copy

JavaRDD<T> filteredRDD = sourceRDD.filter(element -> condition);

Since filter() is lazy, Spark does not immediately scan the RDD when this line is executed. The filtering is performed only when an action such as collect(), count(), foreach(), or saveAsTextFile() is called.

Steps to apply filter to Spark RDD

To apply filter to Spark RDD,

Create a Filter Function to be applied on an RDD.
Use RDD<T>.filter() method with filter function passed as argument to it. The filter() method returns RDD<T> with elements filtered as per the function provided to it.

In Java Spark applications, the filter condition can be written as a named Function<T, Boolean> or as a lambda expression. Both approaches are shown in the examples below.

Spark RDD.filter() Java example for integer values

In this example, we will take an RDD with integers, and filter them using RDD.filter() method.

FilterRDD.java

</>

Copy

import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

public class FilterRDD {

	public static void main(String[] args) {
		// configure spark
        SparkConf sparkConf = new SparkConf().setAppName("Spark RDD Filter")
                .setMaster("local[2]").set("spark.executor.memory","2g");
        // start a spark context
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
 
        // read list to RDD
        List<Integer> data = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10); 
        JavaRDD<Integer> rdd = sc.parallelize(data,1);
        
        // filter : where the number (rdd element) is exactly divisible by 3
        Function<Integer, Boolean> filter = k -> ( k % 3 == 0);
        
        // apply filter on rdd with filter passed as argument
        JavaRDD<Integer> rddf = rdd.filter(filter);
 
        // print the filtered rdd
        rddf.foreach(element -> {
            System.out.println(element); 
        });
        
        sc.close();
	}
}

Output

3
6
9

The input RDD has numbers from 1 to 10. The predicate k % 3 == 0 returns true only for numbers divisible by 3, so the filtered RDD contains 3, 6, and 9.

Using an inline lambda with JavaRDD.filter()

For short filter conditions, you can pass the lambda expression directly to filter(). The following syntax gives the same result as the integer example above.

</>

Copy

JavaRDD<Integer> rddf = rdd.filter(number -> number % 3 == 0);

A separate Function variable is useful when the filter logic is reused or when the condition needs more than one line. An inline lambda is usually clearer for simple conditions.

Spark RDD.filter() example with Tuple2 pair RDD

In this example, we will take an RDD with Tuples as elements. We will filter this RDD using filter method. We will filter the elements based on condition that the length of string, which is second element in tuple, is equal to 5.

FilterRDD.java

</>

Copy

import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;

import scala.Tuple2;

public class FilterRDD {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Spark RDD filter")
				.setMaster("local[2]")
				.set("spark.executor.memory", "2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);

		// read list to RDD
		List<String> data = Arrays.asList("Learn", "Apache", "Spark", "with", "Tutorial Kart");
		JavaRDD<String> words = sc.parallelize(data, 1);

		// map each word to -> (word, word length)
		JavaPairRDD<String, Integer> wordsRDD = words.mapToPair(s -> new Tuple2<>(s, s.length()));

		// filter : where the second element in tuple is equal to 5. (i.e., word length == 5)
		Function<Tuple2<String, Integer>, Boolean> filterFunction = w -> (w._2 == 5);

		// apply the filter on wordsRDD
		JavaPairRDD<String, Integer> rddf = wordsRDD.filter(filterFunction);

		// print filtered rdd
		rddf.foreach(item -> {
			System.out.println(item);
		});
		
		sc.close();
	}
}

Output

(Learn,5)
(Spark,5)

In a JavaPairRDD<String, Integer>, each element is a Scala Tuple2. The first value is accessed with _1 and the second value is accessed with _2. In this example, w._2 == 5 keeps only the tuples where the word length is 5.

Filtering Spark RDD strings by text condition

The same RDD.filter() method can be used with string RDDs. For example, the following filter keeps only words that start with the letter S.

</>

Copy

JavaRDD<String> wordsStartingWithS = words.filter(word -> word.startsWith("S"));

If the RDD may contain null values, check for null before calling string methods.

</>

Copy

JavaRDD<String> validWords = words.filter(word -> word != null && word.length() > 4);

This avoids a NullPointerException when a null element is present in the RDD.

Chaining multiple filter conditions on Spark RDD

You can combine multiple conditions inside a single filter() call, or chain multiple filter() transformations. For simple checks, combining conditions is compact.

</>

Copy

JavaRDD<Integer> filteredNumbers = rdd.filter(number -> number > 2 && number % 3 == 0);

For longer logic, chaining can make each condition easier to read.

</>

Copy

JavaRDD<Integer> filteredNumbers = rdd
        .filter(number -> number > 2)
        .filter(number -> number % 3 == 0);

Both versions keep numbers that are greater than 2 and exactly divisible by 3.

RDD.filter() return type and behavior in Spark

RDD before filter	Filter condition example	RDD after filter
`JavaRDD<Integer>`	`number % 3 == 0`	`JavaRDD<Integer>`
`JavaRDD<String>`	`word.startsWith("S")`	`JavaRDD<String>`
`JavaPairRDD<String, Integer>`	`tuple._2 == 5`	`JavaPairRDD<String, Integer>`

The element type is preserved because filter() only removes elements. It does not map elements to a new structure. If you want to change each element, use map() or mapToPair() along with filtering where required.

Common mistakes in Spark RDD.filter() conditions

Expecting immediate execution: filter() is a transformation and runs only when an action is called.
Using expensive logic inside the predicate: The filter function runs for many elements, so keep the condition simple where possible.
Forgetting null checks: When filtering strings or objects, check for null before calling methods on the element.
Changing element shape inside filter: A filter predicate must return a boolean. Use map() when the element needs to be converted.
Confusing PairRDD tuple fields: In Tuple2, use _1 for the first value and _2 for the second value.

When to use RDD.filter() instead of DataFrame filter

Use RDD.filter() when the data is already represented as an RDD or when the filtering logic is easier to express with normal Java objects. If your data is structured as rows and columns, Spark DataFrame or Dataset filtering is usually more suitable because Spark can optimize column-based operations.

For more RDD basics, you may also read the Spark RDD tutorial. For the official API behavior, refer to the Apache Spark documentation for RDD transformations.

FAQ on Spark RDD.filter()

What does Spark RDD.filter() return?

RDD.filter() returns a new RDD containing only the elements for which the filter function returns true. The original RDD is not modified.

Is RDD.filter() a transformation or an action?

filter() is a transformation. It is evaluated lazily and runs only when an action such as count(), collect(), or foreach() is executed.

Can I use Spark RDD.filter() with PairRDD tuples?

Yes. You can call filter() on a JavaPairRDD. The predicate receives a Tuple2, so you can filter using _1 for the key or _2 for the value.

Does RDD.filter() change the type of elements?

No. filter() only keeps or removes elements. It does not transform an element into another type. To change element values or structure, use map(), flatMap(), or mapToPair().

How do I filter null values from a Spark RDD?

Use a predicate such as rdd.filter(element -> element != null). For strings, perform the null check before calling methods like length() or startsWith().

QA checklist for Spark RDD.filter() tutorial examples

Verify that every filter predicate returns a boolean value.
Confirm that the example explains filter() as a lazy Spark transformation, not an action.
Check that integer, string, and tuple filter examples preserve the input RDD element type.
Make sure PairRDD examples use _1 and _2 correctly for Tuple2 fields.
Include null-safe filtering when examples call methods on string or object elements.

Conclusion

In this Spark Tutorial – Spark RDD.filter(), we have learnt to filter elements of Spark RDD with example programs. The filter() transformation keeps only the elements that satisfy the given condition, while preserving the RDD element type. It can be used with simple RDDs, string RDDs, and PairRDD tuples.

TutorialKart.com

Spark – RDD.filter() – Filter Elements

Spark – RDD.filter()

How Spark RDD.filter() works with a boolean condition

Steps to apply filter to Spark RDD

Spark RDD.filter() Java example for integer values

Using an inline lambda with JavaRDD.filter()

Spark RDD.filter() example with Tuple2 pair RDD

Filtering Spark RDD strings by text condition

Chaining multiple filter conditions on Spark RDD

RDD.filter() return type and behavior in Spark

Common mistakes in Spark RDD.filter() conditions

When to use RDD.filter() instead of DataFrame filter

FAQ on Spark RDD.filter()

QA checklist for Spark RDD.filter() tutorial examples

Conclusion

Popular Courses

SAP

CRM

SAP Resources

Apache

GUI

Programming

Databases

Mobile

Linux

Web & Server

Testing

Learning