Spark Parallelize

Spark parallelize() creates an RDD from a local collection that already exists in the driver program. In this tutorial, we will see how SparkContext.parallelize() works, how to set the number of partitions, and when this method is a good fit in Apache Spark applications.

Spark parallelize() for creating an RDD from a driver collection

To parallelize Collections in Driver program, Spark provides SparkContext.parallelize() method. When spark parallelize method is applied on a Collection (with elements), a new distributed data set is created with specified number of partitions and the elements of the collection are copied to the distributed dataset (RDD).

Use parallelize() when the source data is already available in the driver as a small list, array, sequence, or collection. Common uses include learning Spark, building test RDDs, distributing a small list of configuration values, or running a simple example without reading from a file system.

Do not use parallelize() to load large external datasets into Spark. If the data is in files, tables, object storage, or databases, use Spark readers such as textFile(), spark.read, or the DataFrame API so that Spark can read the data in a distributed way.

SparkContext.parallelize() syntax in Java and Scala

Following is the syntax of SparkContext’s parallelize() method.

</>

Copy

public <T> RDD<T> parallelize(scala.collection.Seq<T> seq, int numSlices, scala.reflect.ClassTag<T> evidence$1)

In Java Spark applications, you normally call the method through JavaSparkContext.

</>

Copy

JavaRDD<T> rdd = sc.parallelize(collection);
JavaRDD<T> rddWithPartitions = sc.parallelize(collection, numSlices);

In Scala Spark applications, the call is similar.

</>

Copy

val rdd = sc.parallelize(collection)
val rddWithPartitions = sc.parallelize(collection, numSlices)

Parameter	Description
seq	Mandatory. Collection of items to parallelize.
numSlices	Optional. An integer value. The number of partitions the data would be parallelized to.
evidence$1	Optional.

Spark parallelize() method creates N number of partitions if N is specified, else Spark would set N based on the Spark Cluster the driver program is running on.

parallelize() method returns an RDD.

Note: It is important to note that parallelize() method acts lazy. Meaning parallelize() method is not actually acted upon until there is an action on the RDD. If there is any modification done to the collection(which we are parallelizing) before the action on the RDD, then when the RDD is acted upon, the modified Collection would be parallelized to RDD, not the Collection with the state you would expect at the program line SparkContext.parallelize(Collection).

Use parallelize() method only when the index of elements does not matter, because once parallelized to partitions, any transformation are done parallelly on partitions.

How numSlices controls Spark parallelize partitions

The numSlices argument controls the number of partitions created for the RDD. Each partition becomes a unit of work that Spark can schedule as a task. For example, sc.parallelize(collection, 2) creates two partitions, while sc.parallelize(collection, 4) creates four partitions.

If numSlices is not specified, Spark uses its default parallelism. In local mode, this is closely related to the number of worker threads configured in local[n]. In a cluster, it depends on the Spark configuration and available execution resources.

</>

Copy

JavaRDD<Integer> twoPartitions = sc.parallelize(collection, 2);
JavaRDD<Integer> fourPartitions = sc.parallelize(collection, 4);

System.out.println(twoPartitions.getNumPartitions());
System.out.println(fourPartitions.getNumPartitions());

2
4

Choosing more partitions does not always make a job faster. Too few partitions may underuse the cluster, while too many partitions may add scheduling overhead. For small collections used with parallelize(), choose a simple partition count that matches the work you want to distribute.

Examples – Spark Parallelize

In the following examples we shall parallelize a Collection of elements to RDD with specified number of partitions.

Following is a Java Application to demonstrate SparkContext.parallelize()

SparkParallelizeExample.java

</>

Copy

import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;

public class SparkParallelizeExample {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD")
		                                .setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// sample collection
		List<Integer> collection = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
		
		// parallelize the collection to two partitions
		JavaRDD<Integer> rdd = sc.parallelize(collection, 2);
		
		System.out.println("Number of partitions : "+rdd.getNumPartitions());
		
		rdd.foreach(new VoidFunction<Integer>(){ 
	          public void call(Integer number) {
	              System.out.println(number); 
	    }});
	}
}

Output

Number of partitions : 2


6
1
7
2
8
3
9
4
10
5

Please observe in the output that, when printing elements of RDD with two partitions, the partitions are acted upon parallelly.

The printed order can differ between runs because foreach() runs on partitions in parallel. If the order of records matters for display or testing, collect the data carefully or use transformations that explicitly handle ordering.

In the following example, we will not specify the number of partitions to parallelize() method.

ParallelizeExample.java

</>

Copy

import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;

public class ParallelizeExample {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Print Elements of RDD")
		                                .setMaster("local[4]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// sample collection
		List<Integer> collection = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
		
		// parallelize the collection to two partitions
		JavaRDD<Integer> rdd = sc.parallelize(collection);
		
		System.out.println("Number of partitions : "+rdd.getNumPartitions());
		
		rdd.foreach(new VoidFunction<Integer>(){ 
	          public void call(Integer number) {
	              System.out.println(number); 
	    }});
	}
}

When Number of Partitions is not specified, it takes into account, the number of threads you mentioned while configuring your Spark Master. (setMaster(local[4]) where master can use 4 threads) .

PySpark parallelize() equivalent for Python users

The same idea is available in PySpark through SparkContext.parallelize(). The first argument is the Python collection, and the optional second argument is the number of partitions.

</>

Copy

from pyspark import SparkContext

sc = SparkContext("local[2]", "PySpark Parallelize Example")

data = [1, 2, 3, 4, 5, 6]
rdd = sc.parallelize(data, 2)

print("Number of partitions:", rdd.getNumPartitions())
print("Collected values:", rdd.collect())

sc.stop()

Number of partitions: 2
Collected values: [1, 2, 3, 4, 5, 6]

Use collect() only for small results that can safely fit in driver memory. For large RDDs, write the output to distributed storage or use actions such as take() to inspect a limited number of records.

When Spark parallelize() is useful and when to avoid it

Use case	Recommended approach
Small in-memory list used for a Spark example	Use `sc.parallelize(list)`.
Unit test data for an RDD transformation	Use `parallelize()` with a small collection and known partition count.
Large CSV, JSON, Parquet, or text file	Use Spark file readers instead of collecting data to the driver and parallelizing it.
Database or table data	Use Spark SQL, JDBC, or connector-based reads where possible.
Many independent external API calls	Parallelize only a small control list, and handle rate limits, retries, and side effects carefully.

The important rule is to avoid making the driver a bottleneck. parallelize() starts with data that is already present in the driver program, so it is not a replacement for distributed data loading.

Spark parallelize() checklist for Java and PySpark examples

Use parallelize() only for small local collections or controlled test data.
Pass numSlices when you want a specific number of RDD partitions.
Check partition count with getNumPartitions() while learning or debugging.
Do not assume foreach() output will be printed in collection order.
Avoid collect() on large RDDs because it brings data back to the driver.
Close or stop the Spark context after a standalone example program completes.

Common mistakes while using SparkContext.parallelize()

Parallelizing large data from the driver: this can overload driver memory. Use Spark readers for large datasets.
Expecting ordered console output: Spark tasks run in parallel, so output from foreach() may appear in a different order.
Using too many partitions for tiny data: extra partitions can add overhead without useful parallelism.
Using too few partitions for expensive work: limited partitions may reduce the amount of work Spark can run at the same time.
Calling actions repeatedly during debugging: each action can trigger Spark jobs unless the RDD is cached where appropriate.

FAQs on Spark parallelize()

What does SparkContext.parallelize() do?

SparkContext.parallelize() creates an RDD from a local collection in the driver program. Spark splits the collection into partitions so that transformations and actions can run across those partitions.

What is numSlices in Spark parallelize()?

numSlices is the requested number of partitions for the RDD. For example, sc.parallelize(data, 4) creates an RDD with four partitions.

Should I use parallelize() for large files in Spark?

No. For large files, use Spark readers such as textFile() or spark.read. parallelize() is meant for data that is already available as a local driver-side collection.

Does Spark parallelize() preserve the order of printed output?

The RDD may contain the values from the original collection, but actions such as foreach() run on partitions in parallel. Therefore, printed output should not be expected to appear in the original list order.

What is the PySpark equivalent of JavaSparkContext parallelize?

In PySpark, use sc.parallelize(data) or sc.parallelize(data, numSlices), where data is a Python collection and numSlices is the optional partition count.

Reference links for Spark parallelize()

For API-level details, refer to the Apache Spark documentation for PySpark SparkContext.parallelize() and the general Spark RDD programming guide.

Conclusion

In this Spark Tutorial – Spark Parallelize, we have learnt how to parallelize a collection to distributed dataset (RDD) in driver program.

Use parallelize() for small driver-side collections, pass numSlices when partition count matters, and prefer Spark’s distributed readers for real datasets stored outside the driver program.

TutorialKart.com

Spark Parallelize – Example

Spark parallelize() for creating an RDD from a driver collection

SparkContext.parallelize() syntax in Java and Scala

How numSlices controls Spark parallelize partitions

Examples – Spark Parallelize

PySpark parallelize() equivalent for Python users

When Spark parallelize() is useful and when to avoid it

Spark parallelize() checklist for Java and PySpark examples

Common mistakes while using SparkContext.parallelize()

FAQs on Spark parallelize()

Reference links for Spark parallelize()

Conclusion

Popular Courses

SAP

CRM

SAP Resources

Apache

GUI

Programming

Databases

Mobile

Linux

Web & Server

Testing

Learning