Read input text file to Spark RDD with SparkContext.textFile()

To read an input text file to an RDD in Apache Spark, use the SparkContext.textFile() method. It reads one or more text files from a local path, HDFS path, or any Hadoop-supported file system path, and returns an RDD where each record is one line from the input.

In this tutorial, we will learn the syntax of SparkContext.textFile(), how the path and partition arguments work, and how to load a text file into an RDD with Java and Python examples. We will also cover common path formats, multiple text files, and mistakes to avoid when using collect() on large files.

For reference, Spark documents text file RDDs in the RDD Programming Guide, and PySpark documents the Python method in SparkContext.textFile.

What SparkContext.textFile() returns when reading a text file

textFile() returns an RDD of strings. Each element in the RDD is normally one line from the input text file. Newline characters are not included in the returned line values.

For example, if the file has three lines, the resulting RDD contains three string records. You can then apply RDD transformations such as map(), flatMap(), filter(), and count() on those lines.

SparkContext.textFile() syntax for reading text files into RDD

The syntax of textFile() method is

</>
Copy
JavaRDD<String> textFile(String path, int minPartitions)

textFile() method reads a text file from HDFS/local file system/any hadoop supported file system URI into the number of partitions specified and returns it as an RDD of Strings

ParameterDescription
pathRequired. Specifies the path to text file.
minPartitionsSpecifies the number of partitions the resulting RDD should have.

The minPartitions value is a suggested minimum number of partitions. Spark may create more partitions depending on the input format, file size, file blocks, and cluster configuration. If you omit this argument, Spark chooses a default based on the environment.

Path formats supported by Spark textFile()

The path argument can point to different file systems. In local mode, a normal local path is often enough. In a cluster, make sure the path is accessible from the Spark driver and worker nodes, or use a distributed storage path such as HDFS or cloud storage.

Path typeExample pathWhen to use it
Local file path/home/user/data/sample.txtLocal testing or single-machine Spark execution.
HDFS pathhdfs://namenode:8020/data/sample.txtCluster workloads where files are stored in HDFS.
Directory path/data/logs/Read all supported text files inside a directory.
Wildcard path/data/logs/*.txtRead multiple matching text files into one RDD.

If you use a local file path in a real cluster, the same file must be available at the same path on every worker that needs to read it. For shared data, use HDFS or another distributed file system instead of a file that exists only on your laptop.

Spark textFile() Java example for a local text file

Following is a Java Example where we shall read a local text file and load it to RDD.

ReadTextToRDD.java

</>
Copy
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class ReadTextToRDD {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD")
										.setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// provide path to input text file
		String path = "data/rdd/input/sample.txt";
		
		// read text file to RDD
		JavaRDD<String> lines = sc.textFile(path);
		
		// collect RDD for printing
		for(String line:lines.collect()){
			System.out.println(line);
		}
	}
}

Input Text File

Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD

Output

17/11/28 10:33:55 INFO DAGScheduler: ResultStage 0 (collect at ReadTextToRDD.java:20) finished in 0.407 s
17/11/28 10:33:55 INFO DAGScheduler: Job 0 finished: collect at ReadTextToRDD.java:20, took 0.751794 s
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
17/11/28 10:33:55 INFO SparkContext: Invoking stop() from shutdown hook

The Java example uses collect() only to print a small sample file. For large input files, avoid collecting the whole RDD to the driver. Use actions such as take(), count(), or write the transformed data back to storage.

PySpark textFile() example for reading a text file into RDD

Following is a Python Example where we shall read a local text file and load it to RDD.

read-text-file-to-rdd.py

</>
Copy
import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Read Text to RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text file to RDD
  lines = sc.textFile("/home/arjun/workspace/spark/sample.txt")

  # collect the RDD to a list
  llist = lines.collect()

  # print the list
  for line in llist:
    print(line)

Submit this python application to Spark using the following command.

~$ spark-submit /workspace/spark/read-text-file-to-rdd.py
17/11/28 15:03:13 INFO DAGScheduler: ResultStage 0 (collect at /home/arjun/workspace/spark/read-text-file-to-rdd.py:15) finished in 0.508 s
17/11/28 15:03:13 INFO DAGScheduler: Job 0 finished: collect at /home/arjun/workspace/spark/read-text-file-to-rdd.py:15, took 0.699556 s
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
17/11/28 15:03:13 INFO SparkContext: Invoking stop() from shutdown hook

The above PySpark program is suitable for a small tutorial file. For a larger file, read the RDD and inspect only a few records with take() before applying transformations.

</>
Copy
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("Preview Text File RDD")
sc = SparkContext(conf=conf)

lines = sc.textFile("/home/arjun/workspace/spark/sample.txt")

for line in lines.take(5):
    print(line)

print("Total lines:", lines.count())

sc.stop()

Read multiple text files into a single Spark RDD

textFile() can read more than one text file into a single RDD. You can pass a directory path or use a wildcard pattern if you want to read a selected group of files.

</>
Copy
logs = sc.textFile("/data/app-logs/*.txt")

error_lines = logs.filter(lambda line: "ERROR" in line)

for line in error_lines.take(10):
    print(line)

The resulting RDD contains lines from all matching files. This is useful for log analysis, batch text processing, and simple ETL jobs where many small text files need to be processed together.

Read a text file from HDFS into Spark RDD

In a cluster environment, the input file is often stored in HDFS. The same textFile() method can read from an HDFS URI.

</>
Copy
lines = sc.textFile("hdfs://namenode:8020/user/data/sample.txt")

print(lines.count())

Use the correct HDFS host, port, and file path for your cluster. If your Spark configuration already knows the default file system, a shorter path such as /user/data/sample.txt may also work.

Use minPartitions when reading text files with SparkContext.textFile()

The optional minPartitions argument lets you suggest how many partitions Spark should use for the resulting RDD. More partitions can increase parallelism, but too many partitions can add scheduling overhead. A reasonable value depends on the input size, cluster size, and later transformations.

</>
Copy
lines = sc.textFile("/data/large-input.txt", minPartitions=8)

print(lines.getNumPartitions())

For Java, pass the second argument to textFile() as shown below.

</>
Copy
JavaRDD<String> lines = sc.textFile("data/rdd/input/sample.txt", 4);

Use Spark textFile() for RDDs and spark.read.text() for DataFrames

Use SparkContext.textFile() when you specifically need an RDD of strings. If you are working with Spark SQL or DataFrames, use spark.read.text() instead. The DataFrame API returns a DataFrame with a string column named value, which is often more convenient for SQL-style processing.

MethodReturnsUse when
sc.textFile(path)RDD of stringsYou want RDD transformations and actions.
spark.read.text(path)DataFrame with a value columnYou want DataFrame, SQL, or structured processing.
</>
Copy
df = spark.read.text("/data/sample.txt")

df.show(truncate=False)

Common Spark textFile() mistakes while reading text files

  • Using collect() on large files: collect() brings all records to the driver. Use it only for small examples or debugging.
  • Using a local-only path in cluster mode: the path must be reachable by worker nodes, not only by the driver machine.
  • Expecting structured columns: textFile() gives one string per line. Parse the line yourself or use a DataFrame reader for structured data.
  • Confusing minPartitions with an exact partition count: Spark treats it as a minimum suggestion, not always an exact final count.
  • Reading many tiny files without planning: many small files can create overhead. Consider combining files or using a more suitable storage layout for large workloads.

Spark textFile() FAQ for reading text files into RDD

How to read a text file in PySpark RDD?

Create or access a SparkContext and call sc.textFile("path/to/file.txt"). The method returns an RDD of strings, where each element is one line from the file.

How to create an RDD from a text file in Spark?

Use SparkContext.textFile(path). In Java, it returns a JavaRDD<String>. In Python, it returns a PySpark RDD containing string records.

Can Spark textFile() read multiple text files?

Yes. Pass a directory path or a wildcard path such as /data/logs/*.txt. Spark reads the matching files and creates one RDD containing the lines from those files.

What is the difference between sc.textFile() and spark.read.text()?

sc.textFile() returns an RDD of strings. spark.read.text() returns a DataFrame with a string column named value. Use RDDs for RDD-style processing and DataFrames for Spark SQL or structured processing.

Does Spark textFile() read a text file line by line?

Yes. The resulting RDD is made of line records. Each record is a string representing one line from the input text file, without the newline character.

Spark textFile() editorial QA checklist

  • Confirm that the tutorial clearly states that textFile() returns an RDD of strings, one record per line.
  • Check that Java and PySpark examples use paths that match the explanation around local mode and cluster mode.
  • Verify that new command-line examples use PrismJS-compatible classes, and leave the older existing code blocks unchanged.
  • Ensure the article warns against collect() for large text files.
  • Review the Spark documentation links periodically because method signatures and API pages can move between Spark versions.

Spark textFile() tutorial summary

In this Spark Tutorial – Read Text file to RDD, we have learnt to read data from a text file to an RDD using SparkContext.textFile() method, with the help of Java and Python examples.

In our next tutorial, we shall learn to Read multiple text files to single RDD.