Spark - Read multiple text files to single RDD

Spark textFile() to read multiple text files into one RDD

To read multiple text files into a single RDD in Apache Spark, pass a comma-separated list of file paths, a directory path, or a file-name pattern to SparkContext.textFile(). The returned RDD contains one record for each line found in the input text files.

This tutorial shows the common RDD-based ways to read text files in Java and PySpark: selected files, all files in a directory, files from multiple directories, and files that match a pattern. The examples use local paths, but the same idea applies to supported file systems such as HDFS, S3, or other Hadoop-compatible storage when the path is configured correctly.

For DataFrame-based text loading, Spark also provides spark.read.text(). Use textFile() when your processing is already written with RDD transformations, and use the Spark SQL text data source when you want a DataFrame with a value column. Reference: Apache Spark RDD programming guide and Spark text data source documentation.

Examples covered for reading multiple text files to one RDD

Read multiple text files to single RDD [Java Example] [Python Example]
Read all text files in a directory to single RDD [Java Example] [Python Example]
Read all text files in multiple directories to single RDD [Java Example] [Python Example]
Read all text files matching a pattern to single RDD [Java Example] [Python Example]

Spark textFile path formats for multiple input files

The path passed to textFile() is interpreted by the file system used by Spark. These are the formats used most often when reading multiple text files into one RDD.

Requirement	Path format	Example
Specific files	Comma-separated file paths	`file1.txt,file2.txt,file3.txt`
All files in one folder	Directory path	`data/rdd/input`
Files from multiple folders	Comma-separated directory paths	`data/rdd/input,data/rdd/anotherFolder`
Files matching names	Glob pattern	`data/rdd/input/file[0-3].txt`

Do not put spaces around commas in the input path string. For example, use file1.txt,file2.txt, not file1.txt, file2.txt. A space after the comma becomes part of the next path and can make Spark look for a path that does not exist.

Read multiple text files to single RDD using comma-separated paths

In this example, we have three text files to read. We take the file paths of these three files as comma separated valued in a single string literal. Then using textFile() method, we can read the content of all these three text files into a single RDD.

This approach is useful when the input files are known in advance and you do not want to read every file from the directory.

First we shall write this using Java.

FileToRddExample.java

</>

Copy

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class FileToRddExample {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD")
		                                .setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// provide text file paths to be read to RDD, separated by comma
		String files = "data/rdd/input/file1.txt, data/rdd/input/file2.txt, data/rdd/input/file3.txt";
		
		// read text files to RDD
		JavaRDD<String> lines = sc.textFile(files);
		
		// collect RDD for printing
		for(String line:lines.collect()){
		    System.out.println(line);
		}
	}
}

Note : Please take care in providing input file paths. There should not be any space between the path strings except comma.

For example, if the files are in the same directory, write the comma-separated path string without spaces as shown below.

</>

Copy

String files = "data/rdd/input/file1.txt,data/rdd/input/file2.txt,data/rdd/input/file3.txt";

file1.txt

This is File 1
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD

file2.txt

This is File 2
Learn to read multiple text files to a single RDD

file3.txt

This is File 3
Learn to read multiple text files to a single RDD

Output

18/02/10 12:13:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
18/02/10 12:13:26 INFO DAGScheduler: ResultStage 0 (collect at FileToRddExample.java:21) finished in 0.613 s
18/02/10 12:13:26 INFO DAGScheduler: Job 0 finished: collect at FileToRddExample.java:21, took 0.888843 s
This is File 1
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
This is File 2
Learn to read multiple text files to a single RDD
This is File 3
Learn to read multiple text files to a single RDD
18/02/10 12:13:26 INFO SparkContext: Invoking stop() from shutdown hook
18/02/10 12:13:26 INFO SparkUI: Stopped Spark web UI at http://192.168.1.104:4040
18/02/10 12:13:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!

The output contains lines from all three files in one RDD. The exact ordering can depend on how Spark partitions and reads the files, so do not depend on this output order for application logic unless you explicitly add ordering after reading.

Now, we shall use Python programming, and read multiple text files to RDD using textFile() method.

readToRdd.py

</>

Copy

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Read Text to RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text files present in the directory to RDD
  lines = sc.textFile("data/rdd/input/file1.txt,data/rdd/input/file2.txt,data/rdd/input/file3.txt")

  # collect the RDD to a list
  llist = lines.collect()

  # print the list
  for line in llist:
    print(line)

Run this Spark Application using spark-submit by executing the following command.

$ spark-submit readToRdd.py

Note : Please take care in providing input file paths. There should not be any space between the path strings except comma.

In newer PySpark applications, it is also common to create a SparkSession first and then use its sparkContext for RDD operations.

</>

Copy

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Read Multiple Text Files to RDD").getOrCreate()
sc = spark.sparkContext

paths = "data/rdd/input/file1.txt,data/rdd/input/file2.txt,data/rdd/input/file3.txt"
lines = sc.textFile(paths)

for line in lines.take(10):
    print(line)

The example uses take(10) instead of collect() to preview a small number of records. Use collect() only for small examples, because it brings all records from the RDD to the driver program.

Read all text files in a directory to single RDD

Now, we shall write a Spark Application, that reads all the text files in a given directory path, to a single RDD.

When a directory path is passed to textFile(), Spark reads the files under that directory as text input. Use this method when all files in the folder are part of the same logical dataset.

Following is a Spark Application written in Java to read the content of all text files, in a directory, to an RDD.

FileToRddExample.java

</>

Copy

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class FileToRddExample {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD")
		                                .setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// provide path to directory containing text files
		String files = "data/rdd/input";
		
		// read text files to RDD
		JavaRDD<String> lines = sc.textFile(files);
		
		// collect RDD for printing
		for(String line:lines.collect()){
		    System.out.println(line);
		}
	}
}

In the above example, we have given the directory path via variable files.

All the text files inside give directory path, data/rdd/input, shall be read to lines RDD.

Before using a directory path, keep only the files you want Spark to read in that folder, or use a pattern if the directory also contains unrelated files.

Now, we shall write a Spark Application to do the same job of reading data from all text files in a directory to RDD, but using Python programming language.

readToRdd.py

</>

Copy

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Read Text to RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text files present in the directory to RDD
  lines = sc.textFile("data/rdd/input")

  # collect the RDD to a list
  llist = lines.collect()

  # print the list
  for line in llist:
    print(line)

Run the above Python Spark Application, by executing the following command in a console.

 $ spark-submit readToRdd.py

Read all text files in multiple directories to single RDD

This is next level to our previous scenarios. We have seen how to read multiple text files, or all text files in a directory to an RDD. Now, we are going to learn how to read all text files in not one, but all text files in multiple directories.

Use comma-separated directory paths when the same dataset is split across more than one folder. Spark reads the input paths together and creates one RDD of lines.

First we shall write a Java application to write all text files in multiple directories.

FileToRddExample.java

</>

Copy

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class FileToRddExample {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD")
		                                .setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// provide path to directories containing text files seperated by comma
		String directories = "data/rdd/input,data/rdd/anotherFolder";
		
		// read text files to RDD
		JavaRDD<String> lines = sc.textFile(directories);
		
		// collect RDD for printing
		for(String line:lines.collect()){
		    System.out.println(line);
		}
	}
}

All the text files in both the directories, provided in the variable directories, shall be read to RDD. Similarly, you may provide more that two directories.

Let us write the same program in Python.

readToRdd.py

</>

Copy

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Read Text to RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text files present in the directory to RDD
  lines = sc.textFile("data/rdd/input,data/rdd/anotherFolder")

  # collect the RDD to a list
  llist = lines.collect()

  # print the list
  for line in llist:
    print(line)

You may submit this Python application to Spark, by running the following command.

$ spark-submit readToRdd.py

Read text files matching a glob pattern to single RDD

This scenario kind of uses a regular expression to match a pattern of file names. All those files that match the given pattern will be considered for reading into an RDD.

In practice, these patterns are file-system glob patterns rather than full regular expressions. They are useful when file names follow a predictable naming rule, such as file1.txt, file2.txt, and file3.txt.

Let us write a Java application, to read files only that match a given pattern.

FileToRddExample.java

</>

Copy

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class FileToRddExample {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD")
		                                .setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// provide path to directories containing text files seperated by comma
		String files = "data/rdd/input/file[0-3].txt,data/rdd/anotherFolder/file*";
		
		// read text files to RDD
		JavaRDD<String> lines = sc.textFile(files);
		
		// collect RDD for printing
		for(String line:lines.collect()){
		    System.out.println(line);
		}
	}
}

file[0-3].txt would match : file0.txt, file1.txt, file2.txt, file3.txt. Any of these files present, would be taken to RDD.
file* would match the files starting with the string file : Example: file-hello.txt, file2.txt, filehing.txt, etc.

Following is a Python Application that reads files to RDD, whose file name match a specific pattern.

readToRdd.py

</>

Copy

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Read Text to RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text files present in the directory to RDD
  lines = sc.textFile("data/rdd/input/file[0-3].txt,data/rdd/anotherFolder/file*")

  # collect the RDD to a list
  llist = lines.collect()

  # print the list
  for line in llist:
    print(line)

Check which file each RDD line came from with wholeTextFiles()

textFile() returns an RDD of lines and does not keep the file name in each record. If your processing needs the source file name along with file content, use wholeTextFiles(). It returns pairs where the key is the file path and the value is the complete file content.

</>

Copy

file_contents = sc.wholeTextFiles("data/rdd/input/*.txt")

for file_path, content in file_contents.take(3):
    print(file_path)
    print(content.splitlines()[0])

Use wholeTextFiles() carefully for very large files, because each file’s content is loaded as one value. For normal line-by-line processing, textFile() is usually the better RDD API.

Spark RDD text file reading issues and fixes

Spaces in comma-separated paths: remove spaces around commas, because Spark treats the space as part of the next path.
Wrong relative path: when running with spark-submit, relative paths are resolved from the working directory of the driver. Use absolute paths or storage URIs when needed.
Too much data in driver memory: avoid collect() on large RDDs. Use actions such as take(), count(), or write the result to storage.
Unexpected files read from a directory: pass a more specific glob pattern, such as data/rdd/input/*.txt.
Need DataFrame output: use spark.read.text(path) when you want a DataFrame instead of an RDD.

FAQ on Spark reading multiple text files into one RDD

Can Spark read multiple text files into one RDD?

Yes. Spark can read multiple text files into one RDD by using sc.textFile() with comma-separated paths, a directory path, or a glob pattern. Each line in the input files becomes one element in the resulting RDD.

How do I read only selected text files in PySpark?

Pass the selected file paths as one comma-separated string to sc.textFile(), without spaces around commas. Example: sc.textFile("input/a.txt,input/b.txt").

How do I read all text files from a folder in Spark?

Pass the folder path to textFile(), such as sc.textFile("data/rdd/input"). Spark reads the text input from that directory into a single RDD.

Is Spark textFile pattern matching a regular expression?

No. In file paths, Spark commonly uses file-system glob patterns such as * and [0-3], not full regular expressions. The exact pattern behavior depends on the underlying file system.

Should I use sc.textFile() or spark.read.text()?

Use sc.textFile() when you need an RDD. Use spark.read.text() when you want a DataFrame and plan to work with Spark SQL or DataFrame transformations.

Editorial QA checklist for this Spark RDD tutorial

Confirm every textFile() example explains whether the input is selected files, one directory, multiple directories, or a glob pattern.
Check that comma-separated Spark paths do not include spaces in newly added examples.
Keep Java and PySpark examples aligned so that both demonstrate the same Spark RDD behavior.
Mention that collect() is only for small sample output and should not be used to pull large RDDs to the driver.
Use spark.read.text() only as a DataFrame alternative, not as a replacement for the RDD examples.

Conclusion: choosing the right Spark textFile input pattern

In this Spark Tutorial – Read multiple text files to single RDD, we have covered different scenarios of reading multiple files. Use comma-separated paths for known files, a directory path for all files in one folder, comma-separated directories for split input folders, and glob patterns when file names follow a matching rule.

TutorialKart.com

Spark – Read multiple text files to single RDD – Java & Python Examples

Spark textFile() to read multiple text files into one RDD

Examples covered for reading multiple text files to one RDD

Spark textFile path formats for multiple input files

Read multiple text files to single RDD using comma-separated paths

Read all text files in a directory to single RDD

Read all text files in multiple directories to single RDD

Read text files matching a glob pattern to single RDD

Check which file each RDD line came from with wholeTextFiles()

Spark RDD text file reading issues and fixes

FAQ on Spark reading multiple text files into one RDD

Can Spark read multiple text files into one RDD?

How do I read only selected text files in PySpark?

How do I read all text files from a folder in Spark?

Is Spark textFile pattern matching a regular expression?

Should I use sc.textFile() or spark.read.text()?

Editorial QA checklist for this Spark RDD tutorial

Conclusion: choosing the right Spark textFile input pattern

Popular Courses

SAP

CRM

SAP Resources

Apache

GUI

Programming

Databases

Mobile

Linux

Web & Server

Testing

Learning