Spark textFile() to read multiple text files into one RDD
To read multiple text files into a single RDD in Apache Spark, pass a comma-separated list of file paths, a directory path, or a file-name pattern to SparkContext.textFile(). The returned RDD contains one record for each line found in the input text files.
This tutorial shows the common RDD-based ways to read text files in Java and PySpark: selected files, all files in a directory, files from multiple directories, and files that match a pattern. The examples use local paths, but the same idea applies to supported file systems such as HDFS, S3, or other Hadoop-compatible storage when the path is configured correctly.
For DataFrame-based text loading, Spark also provides spark.read.text(). Use textFile() when your processing is already written with RDD transformations, and use the Spark SQL text data source when you want a DataFrame with a value column. Reference: Apache Spark RDD programming guide and Spark text data source documentation.
Examples covered for reading multiple text files to one RDD
- Read multiple text files to single RDD [Java Example] [Python Example]
- Read all text files in a directory to single RDD [Java Example] [Python Example]
- Read all text files in multiple directories to single RDD [Java Example] [Python Example]
- Read all text files matching a pattern to single RDD [Java Example] [Python Example]
Spark textFile path formats for multiple input files
The path passed to textFile() is interpreted by the file system used by Spark. These are the formats used most often when reading multiple text files into one RDD.
| Requirement | Path format | Example |
|---|---|---|
| Specific files | Comma-separated file paths | file1.txt,file2.txt,file3.txt |
| All files in one folder | Directory path | data/rdd/input |
| Files from multiple folders | Comma-separated directory paths | data/rdd/input,data/rdd/anotherFolder |
| Files matching names | Glob pattern | data/rdd/input/file[0-3].txt |
Do not put spaces around commas in the input path string. For example, use file1.txt,file2.txt, not file1.txt, file2.txt. A space after the comma becomes part of the next path and can make Spark look for a path that does not exist.
Read multiple text files to single RDD using comma-separated paths
In this example, we have three text files to read. We take the file paths of these three files as comma separated valued in a single string literal. Then using textFile() method, we can read the content of all these three text files into a single RDD.
This approach is useful when the input files are known in advance and you do not want to read every file from the directory.
First we shall write this using Java.
FileToRddExample.java
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class FileToRddExample {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// provide text file paths to be read to RDD, separated by comma
String files = "data/rdd/input/file1.txt, data/rdd/input/file2.txt, data/rdd/input/file3.txt";
// read text files to RDD
JavaRDD<String> lines = sc.textFile(files);
// collect RDD for printing
for(String line:lines.collect()){
System.out.println(line);
}
}
}
Note : Please take care in providing input file paths. There should not be any space between the path strings except comma.
For example, if the files are in the same directory, write the comma-separated path string without spaces as shown below.
String files = "data/rdd/input/file1.txt,data/rdd/input/file2.txt,data/rdd/input/file3.txt";
file1.txt
This is File 1
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
file2.txt
This is File 2
Learn to read multiple text files to a single RDD
file3.txt
This is File 3
Learn to read multiple text files to a single RDD
Output
18/02/10 12:13:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/02/10 12:13:26 INFO DAGScheduler: ResultStage 0 (collect at FileToRddExample.java:21) finished in 0.613 s
18/02/10 12:13:26 INFO DAGScheduler: Job 0 finished: collect at FileToRddExample.java:21, took 0.888843 s
This is File 1
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
This is File 2
Learn to read multiple text files to a single RDD
This is File 3
Learn to read multiple text files to a single RDD
18/02/10 12:13:26 INFO SparkContext: Invoking stop() from shutdown hook
18/02/10 12:13:26 INFO SparkUI: Stopped Spark web UI at http://192.168.1.104:4040
18/02/10 12:13:26 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
The output contains lines from all three files in one RDD. The exact ordering can depend on how Spark partitions and reads the files, so do not depend on this output order for application logic unless you explicitly add ordering after reading.
Now, we shall use Python programming, and read multiple text files to RDD using textFile() method.
readToRdd.py
import sys
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Read Text to RDD - Python")
sc = SparkContext(conf=conf)
# read input text files present in the directory to RDD
lines = sc.textFile("data/rdd/input/file1.txt,data/rdd/input/file2.txt,data/rdd/input/file3.txt")
# collect the RDD to a list
llist = lines.collect()
# print the list
for line in llist:
print(line)
Run this Spark Application using spark-submit by executing the following command.
$ spark-submit readToRdd.py
Note : Please take care in providing input file paths. There should not be any space between the path strings except comma.
In newer PySpark applications, it is also common to create a SparkSession first and then use its sparkContext for RDD operations.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Read Multiple Text Files to RDD").getOrCreate()
sc = spark.sparkContext
paths = "data/rdd/input/file1.txt,data/rdd/input/file2.txt,data/rdd/input/file3.txt"
lines = sc.textFile(paths)
for line in lines.take(10):
print(line)
The example uses take(10) instead of collect() to preview a small number of records. Use collect() only for small examples, because it brings all records from the RDD to the driver program.
Read all text files in a directory to single RDD
Now, we shall write a Spark Application, that reads all the text files in a given directory path, to a single RDD.
When a directory path is passed to textFile(), Spark reads the files under that directory as text input. Use this method when all files in the folder are part of the same logical dataset.
Following is a Spark Application written in Java to read the content of all text files, in a directory, to an RDD.
FileToRddExample.java
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class FileToRddExample {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// provide path to directory containing text files
String files = "data/rdd/input";
// read text files to RDD
JavaRDD<String> lines = sc.textFile(files);
// collect RDD for printing
for(String line:lines.collect()){
System.out.println(line);
}
}
}
In the above example, we have given the directory path via variable files.
All the text files inside give directory path, data/rdd/input, shall be read to lines RDD.
Before using a directory path, keep only the files you want Spark to read in that folder, or use a pattern if the directory also contains unrelated files.
Now, we shall write a Spark Application to do the same job of reading data from all text files in a directory to RDD, but using Python programming language.
readToRdd.py
import sys
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Read Text to RDD - Python")
sc = SparkContext(conf=conf)
# read input text files present in the directory to RDD
lines = sc.textFile("data/rdd/input")
# collect the RDD to a list
llist = lines.collect()
# print the list
for line in llist:
print(line)
Run the above Python Spark Application, by executing the following command in a console.
$ spark-submit readToRdd.py
Read all text files in multiple directories to single RDD
This is next level to our previous scenarios. We have seen how to read multiple text files, or all text files in a directory to an RDD. Now, we are going to learn how to read all text files in not one, but all text files in multiple directories.
Use comma-separated directory paths when the same dataset is split across more than one folder. Spark reads the input paths together and creates one RDD of lines.
First we shall write a Java application to write all text files in multiple directories.
FileToRddExample.java
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class FileToRddExample {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// provide path to directories containing text files seperated by comma
String directories = "data/rdd/input,data/rdd/anotherFolder";
// read text files to RDD
JavaRDD<String> lines = sc.textFile(directories);
// collect RDD for printing
for(String line:lines.collect()){
System.out.println(line);
}
}
}
All the text files in both the directories, provided in the variable directories, shall be read to RDD. Similarly, you may provide more that two directories.
Let us write the same program in Python.
readToRdd.py
import sys
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Read Text to RDD - Python")
sc = SparkContext(conf=conf)
# read input text files present in the directory to RDD
lines = sc.textFile("data/rdd/input,data/rdd/anotherFolder")
# collect the RDD to a list
llist = lines.collect()
# print the list
for line in llist:
print(line)
You may submit this Python application to Spark, by running the following command.
$ spark-submit readToRdd.py
Read text files matching a glob pattern to single RDD
This scenario kind of uses a regular expression to match a pattern of file names. All those files that match the given pattern will be considered for reading into an RDD.
In practice, these patterns are file-system glob patterns rather than full regular expressions. They are useful when file names follow a predictable naming rule, such as file1.txt, file2.txt, and file3.txt.
Let us write a Java application, to read files only that match a given pattern.
FileToRddExample.java
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
public class FileToRddExample {
public static void main(String[] args) {
// configure spark
SparkConf sparkConf = new SparkConf().setAppName("Read Multiple Text Files to RDD")
.setMaster("local[2]").set("spark.executor.memory","2g");
// start a spark context
JavaSparkContext sc = new JavaSparkContext(sparkConf);
// provide path to directories containing text files seperated by comma
String files = "data/rdd/input/file[0-3].txt,data/rdd/anotherFolder/file*";
// read text files to RDD
JavaRDD<String> lines = sc.textFile(files);
// collect RDD for printing
for(String line:lines.collect()){
System.out.println(line);
}
}
}
- file[0-3].txt would match : file0.txt, file1.txt, file2.txt, file3.txt. Any of these files present, would be taken to RDD.
- file* would match the files starting with the string file : Example: file-hello.txt, file2.txt, filehing.txt, etc.
Following is a Python Application that reads files to RDD, whose file name match a specific pattern.
readToRdd.py
import sys
from pyspark import SparkContext, SparkConf
if __name__ == "__main__":
# create Spark context with Spark configuration
conf = SparkConf().setAppName("Read Text to RDD - Python")
sc = SparkContext(conf=conf)
# read input text files present in the directory to RDD
lines = sc.textFile("data/rdd/input/file[0-3].txt,data/rdd/anotherFolder/file*")
# collect the RDD to a list
llist = lines.collect()
# print the list
for line in llist:
print(line)
Check which file each RDD line came from with wholeTextFiles()
textFile() returns an RDD of lines and does not keep the file name in each record. If your processing needs the source file name along with file content, use wholeTextFiles(). It returns pairs where the key is the file path and the value is the complete file content.
file_contents = sc.wholeTextFiles("data/rdd/input/*.txt")
for file_path, content in file_contents.take(3):
print(file_path)
print(content.splitlines()[0])
Use wholeTextFiles() carefully for very large files, because each file’s content is loaded as one value. For normal line-by-line processing, textFile() is usually the better RDD API.
Spark RDD text file reading issues and fixes
- Spaces in comma-separated paths: remove spaces around commas, because Spark treats the space as part of the next path.
- Wrong relative path: when running with
spark-submit, relative paths are resolved from the working directory of the driver. Use absolute paths or storage URIs when needed. - Too much data in driver memory: avoid
collect()on large RDDs. Use actions such astake(),count(), or write the result to storage. - Unexpected files read from a directory: pass a more specific glob pattern, such as
data/rdd/input/*.txt. - Need DataFrame output: use
spark.read.text(path)when you want a DataFrame instead of an RDD.
FAQ on Spark reading multiple text files into one RDD
Can Spark read multiple text files into one RDD?
Yes. Spark can read multiple text files into one RDD by using sc.textFile() with comma-separated paths, a directory path, or a glob pattern. Each line in the input files becomes one element in the resulting RDD.
How do I read only selected text files in PySpark?
Pass the selected file paths as one comma-separated string to sc.textFile(), without spaces around commas. Example: sc.textFile("input/a.txt,input/b.txt").
How do I read all text files from a folder in Spark?
Pass the folder path to textFile(), such as sc.textFile("data/rdd/input"). Spark reads the text input from that directory into a single RDD.
Is Spark textFile pattern matching a regular expression?
No. In file paths, Spark commonly uses file-system glob patterns such as * and [0-3], not full regular expressions. The exact pattern behavior depends on the underlying file system.
Should I use sc.textFile() or spark.read.text()?
Use sc.textFile() when you need an RDD. Use spark.read.text() when you want a DataFrame and plan to work with Spark SQL or DataFrame transformations.
Editorial QA checklist for this Spark RDD tutorial
- Confirm every
textFile()example explains whether the input is selected files, one directory, multiple directories, or a glob pattern. - Check that comma-separated Spark paths do not include spaces in newly added examples.
- Keep Java and PySpark examples aligned so that both demonstrate the same Spark RDD behavior.
- Mention that
collect()is only for small sample output and should not be used to pull large RDDs to the driver. - Use
spark.read.text()only as a DataFrame alternative, not as a replacement for the RDD examples.
Conclusion: choosing the right Spark textFile input pattern
In this Spark Tutorial – Read multiple text files to single RDD, we have covered different scenarios of reading multiple files. Use comma-separated paths for known files, a directory path for all files in one folder, comma-separated directories for split input folders, and glob patterns when file names follow a matching rule.
TutorialKart.com