Spark - textFile() - Read Text file to RDD

Read input text file to Spark RDD with SparkContext.textFile()

To read an input text file to an RDD in Apache Spark, use the SparkContext.textFile() method. It reads one or more text files from a local path, HDFS path, or any Hadoop-supported file system path, and returns an RDD where each record is one line from the input.

In this tutorial, we will learn the syntax of SparkContext.textFile(), how the path and partition arguments work, and how to load a text file into an RDD with Java and Python examples. We will also cover common path formats, multiple text files, and mistakes to avoid when using collect() on large files.

For reference, Spark documents text file RDDs in the RDD Programming Guide, and PySpark documents the Python method in SparkContext.textFile.

What SparkContext.textFile() returns when reading a text file

textFile() returns an RDD of strings. Each element in the RDD is normally one line from the input text file. Newline characters are not included in the returned line values.

For example, if the file has three lines, the resulting RDD contains three string records. You can then apply RDD transformations such as map(), flatMap(), filter(), and count() on those lines.

SparkContext.textFile() syntax for reading text files into RDD

The syntax of textFile() method is

</>

Copy

JavaRDD<String> textFile(String path, int minPartitions)

textFile() method reads a text file from HDFS/local file system/any hadoop supported file system URI into the number of partitions specified and returns it as an RDD of Strings

Parameter	Description
path	Required. Specifies the path to text file.
minPartitions	Specifies the number of partitions the resulting RDD should have.

The minPartitions value is a suggested minimum number of partitions. Spark may create more partitions depending on the input format, file size, file blocks, and cluster configuration. If you omit this argument, Spark chooses a default based on the environment.

Path formats supported by Spark textFile()

The path argument can point to different file systems. In local mode, a normal local path is often enough. In a cluster, make sure the path is accessible from the Spark driver and worker nodes, or use a distributed storage path such as HDFS or cloud storage.

Path type	Example path	When to use it
Local file path	`/home/user/data/sample.txt`	Local testing or single-machine Spark execution.
HDFS path	`hdfs://namenode:8020/data/sample.txt`	Cluster workloads where files are stored in HDFS.
Directory path	`/data/logs/`	Read all supported text files inside a directory.
Wildcard path	`/data/logs/*.txt`	Read multiple matching text files into one RDD.

If you use a local file path in a real cluster, the same file must be available at the same path on every worker that needs to read it. For shared data, use HDFS or another distributed file system instead of a file that exists only on your laptop.

Spark textFile() Java example for a local text file

Following is a Java Example where we shall read a local text file and load it to RDD.

ReadTextToRDD.java

</>

Copy

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

public class ReadTextToRDD {

	public static void main(String[] args) {
		// configure spark
		SparkConf sparkConf = new SparkConf().setAppName("Read Text to RDD")
										.setMaster("local[2]").set("spark.executor.memory","2g");
		// start a spark context
		JavaSparkContext sc = new JavaSparkContext(sparkConf);
		
		// provide path to input text file
		String path = "data/rdd/input/sample.txt";
		
		// read text file to RDD
		JavaRDD<String> lines = sc.textFile(path);
		
		// collect RDD for printing
		for(String line:lines.collect()){
			System.out.println(line);
		}
	}
}

Input Text File

Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD

Output

17/11/28 10:33:55 INFO DAGScheduler: ResultStage 0 (collect at ReadTextToRDD.java:20) finished in 0.407 s
17/11/28 10:33:55 INFO DAGScheduler: Job 0 finished: collect at ReadTextToRDD.java:20, took 0.751794 s
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
17/11/28 10:33:55 INFO SparkContext: Invoking stop() from shutdown hook

The Java example uses collect() only to print a small sample file. For large input files, avoid collecting the whole RDD to the driver. Use actions such as take(), count(), or write the transformed data back to storage.

PySpark textFile() example for reading a text file into RDD

Following is a Python Example where we shall read a local text file and load it to RDD.

read-text-file-to-rdd.py

</>

Copy

import sys

from pyspark import SparkContext, SparkConf

if __name__ == "__main__":

  # create Spark context with Spark configuration
  conf = SparkConf().setAppName("Read Text to RDD - Python")
  sc = SparkContext(conf=conf)

  # read input text file to RDD
  lines = sc.textFile("/home/arjun/workspace/spark/sample.txt")

  # collect the RDD to a list
  llist = lines.collect()

  # print the list
  for line in llist:
    print(line)

Submit this python application to Spark using the following command.

~$ spark-submit /workspace/spark/read-text-file-to-rdd.py

17/11/28 15:03:13 INFO DAGScheduler: ResultStage 0 (collect at /home/arjun/workspace/spark/read-text-file-to-rdd.py:15) finished in 0.508 s
17/11/28 15:03:13 INFO DAGScheduler: Job 0 finished: collect at /home/arjun/workspace/spark/read-text-file-to-rdd.py:15, took 0.699556 s
Welcome to TutorialKart
Learn Apache Spark
Learn to work with RDD
17/11/28 15:03:13 INFO SparkContext: Invoking stop() from shutdown hook

The above PySpark program is suitable for a small tutorial file. For a larger file, read the RDD and inspect only a few records with take() before applying transformations.

</>

Copy

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("Preview Text File RDD")
sc = SparkContext(conf=conf)

lines = sc.textFile("/home/arjun/workspace/spark/sample.txt")

for line in lines.take(5):
    print(line)

print("Total lines:", lines.count())

sc.stop()

Read multiple text files into a single Spark RDD

textFile() can read more than one text file into a single RDD. You can pass a directory path or use a wildcard pattern if you want to read a selected group of files.

</>

Copy

logs = sc.textFile("/data/app-logs/*.txt")

error_lines = logs.filter(lambda line: "ERROR" in line)

for line in error_lines.take(10):
    print(line)

The resulting RDD contains lines from all matching files. This is useful for log analysis, batch text processing, and simple ETL jobs where many small text files need to be processed together.

Read a text file from HDFS into Spark RDD

In a cluster environment, the input file is often stored in HDFS. The same textFile() method can read from an HDFS URI.

</>

Copy

lines = sc.textFile("hdfs://namenode:8020/user/data/sample.txt")

print(lines.count())

Use the correct HDFS host, port, and file path for your cluster. If your Spark configuration already knows the default file system, a shorter path such as /user/data/sample.txt may also work.

Use minPartitions when reading text files with SparkContext.textFile()

The optional minPartitions argument lets you suggest how many partitions Spark should use for the resulting RDD. More partitions can increase parallelism, but too many partitions can add scheduling overhead. A reasonable value depends on the input size, cluster size, and later transformations.

</>

Copy

lines = sc.textFile("/data/large-input.txt", minPartitions=8)

print(lines.getNumPartitions())

For Java, pass the second argument to textFile() as shown below.

</>

Copy

JavaRDD<String> lines = sc.textFile("data/rdd/input/sample.txt", 4);

Use Spark textFile() for RDDs and spark.read.text() for DataFrames

Use SparkContext.textFile() when you specifically need an RDD of strings. If you are working with Spark SQL or DataFrames, use spark.read.text() instead. The DataFrame API returns a DataFrame with a string column named value, which is often more convenient for SQL-style processing.

Method	Returns	Use when
`sc.textFile(path)`	RDD of strings	You want RDD transformations and actions.
`spark.read.text(path)`	DataFrame with a `value` column	You want DataFrame, SQL, or structured processing.

</>

Copy

df = spark.read.text("/data/sample.txt")

df.show(truncate=False)

Common Spark textFile() mistakes while reading text files

Using collect() on large files: collect() brings all records to the driver. Use it only for small examples or debugging.
Using a local-only path in cluster mode: the path must be reachable by worker nodes, not only by the driver machine.
Expecting structured columns: textFile() gives one string per line. Parse the line yourself or use a DataFrame reader for structured data.
Confusing minPartitions with an exact partition count: Spark treats it as a minimum suggestion, not always an exact final count.
Reading many tiny files without planning: many small files can create overhead. Consider combining files or using a more suitable storage layout for large workloads.

Spark textFile() FAQ for reading text files into RDD

How to read a text file in PySpark RDD?

Create or access a SparkContext and call sc.textFile("path/to/file.txt"). The method returns an RDD of strings, where each element is one line from the file.

How to create an RDD from a text file in Spark?

Use SparkContext.textFile(path). In Java, it returns a JavaRDD<String>. In Python, it returns a PySpark RDD containing string records.

Can Spark textFile() read multiple text files?

Yes. Pass a directory path or a wildcard path such as /data/logs/*.txt. Spark reads the matching files and creates one RDD containing the lines from those files.

What is the difference between sc.textFile() and spark.read.text()?

sc.textFile() returns an RDD of strings. spark.read.text() returns a DataFrame with a string column named value. Use RDDs for RDD-style processing and DataFrames for Spark SQL or structured processing.

Does Spark textFile() read a text file line by line?

Yes. The resulting RDD is made of line records. Each record is a string representing one line from the input text file, without the newline character.

Spark textFile() editorial QA checklist

Confirm that the tutorial clearly states that textFile() returns an RDD of strings, one record per line.
Check that Java and PySpark examples use paths that match the explanation around local mode and cluster mode.
Verify that new command-line examples use PrismJS-compatible classes, and leave the older existing code blocks unchanged.
Ensure the article warns against collect() for large text files.
Review the Spark documentation links periodically because method signatures and API pages can move between Spark versions.

Spark textFile() tutorial summary

In this Spark Tutorial – Read Text file to RDD, we have learnt to read data from a text file to an RDD using SparkContext.textFile() method, with the help of Java and Python examples.

In our next tutorial, we shall learn to Read multiple text files to single RDD.

TutorialKart.com

Spark – textFile() – Read Text file to RDD

Read input text file to Spark RDD with SparkContext.textFile()

What SparkContext.textFile() returns when reading a text file

SparkContext.textFile() syntax for reading text files into RDD

Path formats supported by Spark textFile()

Spark textFile() Java example for a local text file

PySpark textFile() example for reading a text file into RDD

Read multiple text files into a single Spark RDD

Read a text file from HDFS into Spark RDD

Use minPartitions when reading text files with SparkContext.textFile()

Use Spark textFile() for RDDs and spark.read.text() for DataFrames

Common Spark textFile() mistakes while reading text files

Spark textFile() FAQ for reading text files into RDD

Spark textFile() editorial QA checklist

Spark textFile() tutorial summary

Popular Courses

SAP

CRM

SAP Resources

Apache

GUI

Programming

Databases

Mobile

Linux

Web & Server

Testing

Learning