Spark parallelize() creates an RDD from a local collection that already exists in the driver program. In this tutorial, we will see how SparkContext.parallelize() works, how to set the number of partitions, and when this method is a good fit in Apache Spark applications.
Spark parallelize() for creating an RDD from a driver collection
To parallelize Collections in Driver program, Spark provides SparkContext.parallelize() method. When spark parallelize method is applied on a Collection (with elements), a new distributed data set is created with specified number of partitions and the elements of the collection are copied to the distributed dataset (RDD).
Use parallelize() when the source data is already available in the driver as a small list, array, sequence, or collection. Common uses include learning Spark, building test RDDs, distributing a small list of configuration values, or running a simple example without reading from a file system.
Do not use parallelize() to load large external datasets into Spark. If the data is in files, tables, object storage, or databases, use Spark readers such as textFile(), spark.read, or the DataFrame API so that Spark can read the data in a distributed way.
SparkContext.parallelize() syntax in Java and Scala
Following is the syntax of SparkContext’s parallelize() method.
public <T> RDD<T> parallelize(scala.collection.Seq<T> seq, int numSlices, scala.reflect.ClassTag<T> evidence$1)
In Java Spark applications, you normally call the method through JavaSparkContext.
JavaRDD<T> rdd = sc.parallelize(collection);
JavaRDD<T> rddWithPartitions = sc.parallelize(collection, numSlices);
In Scala Spark applications, the call is similar.
val rdd = sc.parallelize(collection)
val rddWithPartitions = sc.parallelize(collection, numSlices)
| Parameter | Description |
|---|---|
| seq | Mandatory. Collection of items to parallelize. |
| numSlices | Optional. An integer value. The number of partitions the data would be parallelized to. |
| evidence$1 | Optional. |
Spark parallelize() method creates N number of partitions if N is specified, else Spark would set N based on the Spark Cluster the driver program is running on.
parallelize() method returns an RDD.
Note: It is important to note that parallelize() method acts lazy. Meaning parallelize() method is not actually acted upon until there is an action on the RDD. If there is any modification done to the collection(which we are parallelizing) before the action on the RDD, then when the RDD is acted upon, the modified Collection would be parallelized to RDD, not the Collection with the state you would expect at the program line SparkContext.parallelize(Collection).
Use parallelize() method only when the index of elements does not matter, because once parallelized to partitions, any transformation are done parallelly on partitions.
How numSlices controls Spark parallelize partitions
The numSlices argument controls the number of partitions created for the RDD. Each partition becomes a unit of work that Spark can schedule as a task. For example, sc.parallelize(collection, 2) creates two partitions, while sc.parallelize(collection, 4) creates four partitions.
If numSlices is not specified, Spark uses its default parallelism. In local mode, this is closely related to the number of worker threads configured in local[n]. In a cluster, it depends on the Spark configuration and available execution resources.
JavaRDD<Integer> twoPartitions = sc.parallelize(collection, 2);
JavaRDD<Integer> fourPartitions = sc.parallelize(collection, 4);
System.out.println(twoPartitions.getNumPartitions());
System.out.println(fourPartitions.getNumPartitions());
2
4
Choosing more partitions does not always make a job faster. Too few partitions may underuse the cluster, while too many partitions may add scheduling overhead. For small collections used with parallelize(), choose a simple partition count that matches the work you want to distribute.
Examples – Spark Parallelize
In the following examples we shall parallelize a Collection of elements to RDD with specified number of partitions.
Following is a Java Application to demonstrate SparkContext.parallelize()
SparkParallelizeExample.java
import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;
public class SparkParallelizeExample {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// sample collection
List<Integer> collection = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
// parallelize the collection to two partitions
JavaRDD<Integer> rdd = sc.parallelize(collection, 2);
System.out.println("Number of partitions : "+rdd.getNumPartitions());
rdd.foreach(new VoidFunction<Integer>(){
public void call(Integer number) {
System.out.println(number);
}});
}
}
Output
Number of partitions : 2
6
1
7
2
8
3
9
4
10
5
Please observe in the output that, when printing elements of RDD with two partitions, the partitions are acted upon parallelly.
The printed order can differ between runs because foreach() runs on partitions in parallel. If the order of records matters for display or testing, collect the data carefully or use transformations that explicitly handle ordering.
In the following example, we will not specify the number of partitions to parallelize() method.
ParallelizeExample.java
import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;
public class ParallelizeExample {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD")
.setMaster("local[4]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// sample collection
List<Integer> collection = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
// parallelize the collection to two partitions
JavaRDD<Integer> rdd = sc.parallelize(collection);
System.out.println("Number of partitions : "+rdd.getNumPartitions());
rdd.foreach(new VoidFunction<Integer>(){
public void call(Integer number) {
System.out.println(number);
}});
}
}
When Number of Partitions is not specified, it takes into account, the number of threads you mentioned while configuring your Spark Master. (setMaster(local[4]) where master can use 4 threads) .
PySpark parallelize() equivalent for Python users
The same idea is available in PySpark through SparkContext.parallelize(). The first argument is the Python collection, and the optional second argument is the number of partitions.
from pyspark import SparkContext
sc = SparkContext("local[2]", "PySpark Parallelize Example")
data = [1, 2, 3, 4, 5, 6]
rdd = sc.parallelize(data, 2)
print("Number of partitions:", rdd.getNumPartitions())
print("Collected values:", rdd.collect())
sc.stop()
Number of partitions: 2
Collected values: [1, 2, 3, 4, 5, 6]
Use collect() only for small results that can safely fit in driver memory. For large RDDs, write the output to distributed storage or use actions such as take() to inspect a limited number of records.
When Spark parallelize() is useful and when to avoid it
| Use case | Recommended approach |
|---|---|
| Small in-memory list used for a Spark example | Use sc.parallelize(list). |
| Unit test data for an RDD transformation | Use parallelize() with a small collection and known partition count. |
| Large CSV, JSON, Parquet, or text file | Use Spark file readers instead of collecting data to the driver and parallelizing it. |
| Database or table data | Use Spark SQL, JDBC, or connector-based reads where possible. |
| Many independent external API calls | Parallelize only a small control list, and handle rate limits, retries, and side effects carefully. |
The important rule is to avoid making the driver a bottleneck. parallelize() starts with data that is already present in the driver program, so it is not a replacement for distributed data loading.
Spark parallelize() checklist for Java and PySpark examples
- Use
parallelize()only for small local collections or controlled test data. - Pass
numSliceswhen you want a specific number of RDD partitions. - Check partition count with
getNumPartitions()while learning or debugging. - Do not assume
foreach()output will be printed in collection order. - Avoid
collect()on large RDDs because it brings data back to the driver. - Close or stop the Spark context after a standalone example program completes.
Common mistakes while using SparkContext.parallelize()
- Parallelizing large data from the driver: this can overload driver memory. Use Spark readers for large datasets.
- Expecting ordered console output: Spark tasks run in parallel, so output from
foreach()may appear in a different order. - Using too many partitions for tiny data: extra partitions can add overhead without useful parallelism.
- Using too few partitions for expensive work: limited partitions may reduce the amount of work Spark can run at the same time.
- Calling actions repeatedly during debugging: each action can trigger Spark jobs unless the RDD is cached where appropriate.
FAQs on Spark parallelize()
What does SparkContext.parallelize() do?
SparkContext.parallelize() creates an RDD from a local collection in the driver program. Spark splits the collection into partitions so that transformations and actions can run across those partitions.
What is numSlices in Spark parallelize()?
numSlices is the requested number of partitions for the RDD. For example, sc.parallelize(data, 4) creates an RDD with four partitions.
Should I use parallelize() for large files in Spark?
No. For large files, use Spark readers such as textFile() or spark.read. parallelize() is meant for data that is already available as a local driver-side collection.
Does Spark parallelize() preserve the order of printed output?
The RDD may contain the values from the original collection, but actions such as foreach() run on partitions in parallel. Therefore, printed output should not be expected to appear in the original list order.
What is the PySpark equivalent of JavaSparkContext parallelize?
In PySpark, use sc.parallelize(data) or sc.parallelize(data, numSlices), where data is a Python collection and numSlices is the optional partition count.
Reference links for Spark parallelize()
For API-level details, refer to the Apache Spark documentation for PySpark SparkContext.parallelize() and the general Spark RDD programming guide.
Conclusion
In this Spark Tutorial – Spark Parallelize, we have learnt how to parallelize a collection to distributed dataset (RDD) in driver program.
Use parallelize() for small driver-side collections, pass numSlices when partition count matters, and prefer Spark’s distributed readers for real datasets stored outside the driver program.
TutorialKart.com