Spark – RDD.filter()
Spark RDD Filter : RDD.filter() method returns an RDD with those elements which pass a filter condition (function) that is given as argument to the method. In this tutorial, we learn to filter RDD containing Integers, and an RDD containing Tuples, with example programs.
The filter() operation is a Spark RDD transformation. It does not change the original RDD. Instead, it creates a new RDD of the same element type after applying a boolean condition to every element.
How Spark RDD.filter() works with a boolean condition
For each element in the source RDD, Spark calls the function passed to filter(). If the function returns true, the element is kept in the output RDD. If the function returns false, the element is removed from the output RDD.
JavaRDD<T> filteredRDD = sourceRDD.filter(element -> condition);
Since filter() is lazy, Spark does not immediately scan the RDD when this line is executed. The filtering is performed only when an action such as collect(), count(), foreach(), or saveAsTextFile() is called.
Steps to apply filter to Spark RDD
To apply filter to Spark RDD,
- Create a Filter Function to be applied on an RDD.
- Use RDD<T>.filter() method with filter function passed as argument to it. The filter() method returns RDD<T> with elements filtered as per the function provided to it.
In Java Spark applications, the filter condition can be written as a named Function<T, Boolean> or as a lambda expression. Both approaches are shown in the examples below.
Spark RDD.filter() Java example for integer values
In this example, we will take an RDD with integers, and filter them using RDD.filter() method.
FilterRDD.java
import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
public class FilterRDD {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Spark RDD Filter")
.setMaster("local[2]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// read list to RDD
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
JavaRDD<Integer> rdd = sc.parallelize(data,1);
// filter : where the number (rdd element) is exactly divisible by 3
Function<Integer, Boolean> filter = k -> ( k % 3 == 0);
// apply filter on rdd with filter passed as argument
JavaRDD<Integer> rddf = rdd.filter(filter);
// print the filtered rdd
rddf.foreach(element -> {
System.out.println(element);
});
sc.close();
}
}
Output
3
6
9
The input RDD has numbers from 1 to 10. The predicate k % 3 == 0 returns true only for numbers divisible by 3, so the filtered RDD contains 3, 6, and 9.
Using an inline lambda with JavaRDD.filter()
For short filter conditions, you can pass the lambda expression directly to filter(). The following syntax gives the same result as the integer example above.
JavaRDD<Integer> rddf = rdd.filter(number -> number % 3 == 0);
A separate Function variable is useful when the filter logic is reused or when the condition needs more than one line. An inline lambda is usually clearer for simple conditions.
Spark RDD.filter() example with Tuple2 pair RDD
In this example, we will take an RDD with Tuples as elements. We will filter this RDD using filter method. We will filter the elements based on condition that the length of string, which is second element in tuple, is equal to 5.
FilterRDD.java
import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
public class FilterRDD {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Spark RDD filter")
.setMaster("local[2]")
.set("spark.executor.memory", "2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// read list to RDD
List<String> data = Arrays.asList("Learn", "Apache", "Spark", "with", "Tutorial Kart");
JavaRDD<String> words = sc.parallelize(data, 1);
// map each word to -> (word, word length)
JavaPairRDD<String, Integer> wordsRDD = words.mapToPair(s -> new Tuple2<>(s, s.length()));
// filter : where the second element in tuple is equal to 5. (i.e., word length == 5)
Function<Tuple2<String, Integer>, Boolean> filterFunction = w -> (w._2 == 5);
// apply the filter on wordsRDD
JavaPairRDD<String, Integer> rddf = wordsRDD.filter(filterFunction);
// print filtered rdd
rddf.foreach(item -> {
System.out.println(item);
});
sc.close();
}
}
Output
(Learn,5)
(Spark,5)
In a JavaPairRDD<String, Integer>, each element is a Scala Tuple2. The first value is accessed with _1 and the second value is accessed with _2. In this example, w._2 == 5 keeps only the tuples where the word length is 5.
Filtering Spark RDD strings by text condition
The same RDD.filter() method can be used with string RDDs. For example, the following filter keeps only words that start with the letter S.
JavaRDD<String> wordsStartingWithS = words.filter(word -> word.startsWith("S"));
If the RDD may contain null values, check for null before calling string methods.
JavaRDD<String> validWords = words.filter(word -> word != null && word.length() > 4);
This avoids a NullPointerException when a null element is present in the RDD.
Chaining multiple filter conditions on Spark RDD
You can combine multiple conditions inside a single filter() call, or chain multiple filter() transformations. For simple checks, combining conditions is compact.
JavaRDD<Integer> filteredNumbers = rdd.filter(number -> number > 2 && number % 3 == 0);
For longer logic, chaining can make each condition easier to read.
JavaRDD<Integer> filteredNumbers = rdd
.filter(number -> number > 2)
.filter(number -> number % 3 == 0);
Both versions keep numbers that are greater than 2 and exactly divisible by 3.
RDD.filter() return type and behavior in Spark
| RDD before filter | Filter condition example | RDD after filter |
|---|---|---|
JavaRDD<Integer> | number % 3 == 0 | JavaRDD<Integer> |
JavaRDD<String> | word.startsWith("S") | JavaRDD<String> |
JavaPairRDD<String, Integer> | tuple._2 == 5 | JavaPairRDD<String, Integer> |
The element type is preserved because filter() only removes elements. It does not map elements to a new structure. If you want to change each element, use map() or mapToPair() along with filtering where required.
Common mistakes in Spark RDD.filter() conditions
- Expecting immediate execution:
filter()is a transformation and runs only when an action is called. - Using expensive logic inside the predicate: The filter function runs for many elements, so keep the condition simple where possible.
- Forgetting null checks: When filtering strings or objects, check for null before calling methods on the element.
- Changing element shape inside filter: A filter predicate must return a boolean. Use
map()when the element needs to be converted. - Confusing PairRDD tuple fields: In
Tuple2, use_1for the first value and_2for the second value.
When to use RDD.filter() instead of DataFrame filter
Use RDD.filter() when the data is already represented as an RDD or when the filtering logic is easier to express with normal Java objects. If your data is structured as rows and columns, Spark DataFrame or Dataset filtering is usually more suitable because Spark can optimize column-based operations.
For more RDD basics, you may also read the Spark RDD tutorial. For the official API behavior, refer to the Apache Spark documentation for RDD transformations.
FAQ on Spark RDD.filter()
What does Spark RDD.filter() return?
RDD.filter() returns a new RDD containing only the elements for which the filter function returns true. The original RDD is not modified.
Is RDD.filter() a transformation or an action?
filter() is a transformation. It is evaluated lazily and runs only when an action such as count(), collect(), or foreach() is executed.
Can I use Spark RDD.filter() with PairRDD tuples?
Yes. You can call filter() on a JavaPairRDD. The predicate receives a Tuple2, so you can filter using _1 for the key or _2 for the value.
Does RDD.filter() change the type of elements?
No. filter() only keeps or removes elements. It does not transform an element into another type. To change element values or structure, use map(), flatMap(), or mapToPair().
How do I filter null values from a Spark RDD?
Use a predicate such as rdd.filter(element -> element != null). For strings, perform the null check before calling methods like length() or startsWith().
QA checklist for Spark RDD.filter() tutorial examples
- Verify that every filter predicate returns a boolean value.
- Confirm that the example explains
filter()as a lazy Spark transformation, not an action. - Check that integer, string, and tuple filter examples preserve the input RDD element type.
- Make sure PairRDD examples use
_1and_2correctly forTuple2fields. - Include null-safe filtering when examples call methods on string or object elements.
Conclusion
In this Spark Tutorial – Spark RDD.filter(), we have learnt to filter elements of Spark RDD with example programs. The filter() transformation keeps only the elements that satisfy the given condition, while preserving the RDD element type. It can be used with simple RDDs, string RDDs, and PairRDD tuples.
TutorialKart.com