Scala - Spark Application - Word Count Example

A Spark Scala application is a JVM program written in Scala that uses Apache Spark libraries to process data locally or on a Spark cluster. In this tutorial, we shall set up a Scala project with Apache Spark in Eclipse IDE and run a simple WordCount example.

The example uses Spark Core RDD operations, so it is easy to see the basic flow: read a text file, split each line into words, convert each word into a pair, reduce the pairs by key, and write the result. The same structure is useful when you later package the application as a JAR and run it with spark-submit.

Spark Scala WordCount project requirements in Eclipse

Before creating the Eclipse project, check these items. Most Spark Scala setup errors come from using an incompatible Scala library, missing Spark JARs, or writing output to a folder that already exists.

Java: Install a JDK version supported by the Spark distribution you are using.
Scala: Use the same Scala binary version that your Spark build uses. For example, Spark artifacts ending in _2.13 require Scala 2.13.
Apache Spark: Download Spark from the official Spark downloads page and use the JARs from its jars directory for this Eclipse setup.
Eclipse Scala support: Install a Scala plugin that works with your Eclipse version, or use a Scala-enabled Eclipse distribution if that is available for your environment.
Input file: Create a local input file at data/wordcount/input.txt, because the example below reads from that path.

For current Spark and Scala compatibility notes, refer to the official Apache Spark documentation. If you are building a production-style project, prefer sbt or Maven dependency management instead of adding every Spark JAR manually.

Setup Spark Scala Application in Eclipse

Following is a step by step process to setup Spark Scala Application in Eclipse.

1. Install Scala support in Eclipse for Spark Scala code

Download Scala Eclipse (in Ubuntu) or install scala plugin from Eclipse Marketplace. After installation, restart Eclipse and confirm that you can create a Scala project.

If your Eclipse installation does not provide stable Scala support for the Spark version you want to use, you can still use this tutorial as a project-structure reference and build the same WordCount program with sbt from the command line.

2. Create a new Scala project for Spark WordCount

Open Eclipse and Create a new Scala Project. Give the project a clear name such as SparkScalaWordCount. Inside the project, create the folders data/wordcount and place your input file as data/wordcount/input.txt.

</>

Copy

spark scala spark
apache spark wordcount
scala application example

The above sample input is enough to test whether the project reads the file and counts repeated words correctly.

3. Download Apache Spark for the Scala Eclipse application

Spark Scala Application - WordCount Example - Eclipse

Hit the url https://spark.apache.org/downloads.html.

Choose a Spark package that matches your environment. For a local WordCount example, a pre-built Spark package is sufficient. Extract the downloaded archive and note the location of its jars directory, because Eclipse needs those libraries on the project build path.

4. Add Apache Spark libraries to the Eclipse Scala build path

Go to Java Build Path, and add all the jars present under spark-n.n.n-bin-hadoopN.N/jars/. This should be similar to the process of creating a Java Project with Apache Spark libraries.

This manual JAR method is acceptable for a local learning example. However, it is easy to miss a dependency or mix versions. When you move beyond this tutorial, use sbt or Maven so that Spark dependencies are declared in one build file.

5. Match Scala library version with the Spark Scala binary version

If you get any errors with the scala version of the eclipse, you may change and give a try. To change scala version of your project :
Java Build Path -> Libraries -> Add Library -> Scala Library -> Choose a lower version than the latest and click on Finish. Give a try with all the versions available if you have an issue with Scala version.

A more precise way to check compatibility is to look at the Spark artifact suffix. A dependency such as spark-core_2.13 means the application must compile with Scala 2.13. A dependency such as spark-core_2.12 means the application must compile with Scala 2.12. Mixing Scala binary versions is a common reason for NoSuchMethodError, ClassNotFoundException, and unresolved Spark imports.

6. Create WordCount.scala for the Spark Scala RDD example

Right click on the project and create a new Scala class. Name it WordCount. The class would be WordCount.scala.
In the following example, we provided input placed at data/wordcount/input.txt. The output is generated at root of the Project, or you may change its location as well. The output folder contains files with result and status (SUCCESS/FAILURE).

WordCount.scala

</>

Copy

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

object WordCount {
    def main(args: Array[String]) {

        /* configure spark application */
        val conf = new SparkConf().setAppName("Spark Scala WordCount Example").setMaster("local[1]")

        /* spark context*/
        val sc = new SparkContext(conf)

        /* map */
        var map = sc.textFile("data/wordcount/input.txt").flatMap(line => line.split(" ")).map(word => (word,1))

        /* reduce */
        var counts = map.reduceByKey(_ + _)

        /* print */
        counts.collect().foreach(println)

        /* or save the output to file */
        counts.saveAsTextFile("out.txt")

        sc.stop()
    }
}

7. Run the Spark Scala WordCount application from Eclipse

Run WordCount.scala as Scala application. Upon successful run, the result should be stored in out.txt folder.

When the program runs, Spark creates a folder named out.txt, not a single text file. Inside that folder, you should see one or more part- files and a _SUCCESS marker. If out.txt already exists, delete it before running the program again or change the output path in counts.saveAsTextFile(...).

(spark,3)
(scala,2)
(apache,1)
(wordcount,1)
(application,1)
(example,1)

The order of printed pairs can vary because Spark transformations are distributed. The important part is that each word is grouped with its final count.

How the Spark Scala WordCount example works

The WordCount program uses a short chain of Spark RDD transformations and actions.

new SparkConf() creates the application configuration. The app name appears in Spark logs and UI screens.
setMaster("local[1]") runs the example locally with one worker thread. You can use local[*] to use all available local cores for testing.
sc.textFile(...) reads the input text file as an RDD of lines.
flatMap(line => line.split(" ")) splits each line into words and flattens the result into one RDD of words.
map(word => (word, 1)) creates a pair RDD where every word starts with count 1.
reduceByKey(_ + _) groups matching words and adds their counts.
collect() brings the result to the driver for printing. Use it only for small results.
saveAsTextFile("out.txt") writes the result to an output directory.

For a tutorial-size input file, collect() is fine. For large data, avoid collecting all records to the driver. Save the result, inspect a limited sample, or use actions that do not require the complete dataset in driver memory.

Optional sbt build file for a Spark Scala WordCount JAR

Search results for this topic often include JAR packaging and spark-submit issues. The Eclipse setup above adds Spark JARs directly, but a real Spark Scala application is usually built with sbt or Maven. With sbt, the Spark dependency is declared in build.sbt.

</>

Copy

name := "spark-scala-wordcount"

version := "1.0"

scalaVersion := "2.13.16"

libraryDependencies += "org.apache.spark" %% "spark-core" % "4.1.1" % "provided"

Change scalaVersion and spark-core version to match the Spark release you are using. The %% operator adds the Scala binary suffix automatically. The provided scope is commonly used when the Spark runtime already supplies Spark libraries on the cluster.

</>

Copy

sbt package

After packaging, submit the generated JAR with Spark. The exact JAR path depends on your project name, Scala binary version, and package settings.

</>

Copy

spark-submit \
  --class WordCount \
  --master local[1] \
  target/scala-2.13/spark-scala-wordcount_2.13-1.0.jar

For cluster deployment details, refer to the official Spark submitting applications guide. When submitting to a cluster, do not hard-code local-only file paths unless the file is available to the driver and executors in that environment.

Common Spark Scala WordCount errors in Eclipse

The following checks help resolve the most frequent issues when a Scala WordCount JAR or Eclipse run does not work as expected.

Issue	Likely cause	Fix
`object apache is not a member of package org`	Spark JARs are not on the build path.	Add all JARs from the Spark `jars` directory or use sbt/Maven dependencies.
`NoSuchMethodError` or binary incompatibility errors	Scala version in Eclipse does not match Spark artifacts.	Use the Scala binary version shown by the Spark artifact suffix, such as `_2.13`.
Output path already exists	`saveAsTextFile` does not overwrite an existing directory.	Delete `out.txt` before rerunning or write to a new output path.
JAR runs in Eclipse but not with `spark-submit`	Main class, packaging, or dependency scope is wrong.	Confirm the object name, use the right `--class`, and package dependencies correctly.
Input file not found	The relative path is resolved from a different working directory.	Use the correct project working directory or provide an absolute path for local testing.

Spark Scala application FAQ for WordCount in Eclipse

Can I run this Spark Scala WordCount example without installing Hadoop?

Yes. For local testing, a pre-built Spark package can run this WordCount example with local[1] or local[*]. Hadoop libraries included with the Spark distribution are enough for the local file example shown here.

Why does Spark create an out.txt folder instead of an out.txt file?

saveAsTextFile writes distributed output as a directory. The actual records are written in one or more part- files inside that directory, along with status files such as _SUCCESS.

Should I use Eclipse JARs or sbt for a Spark Scala application?

Adding Spark JARs in Eclipse is simple for learning. For repeatable builds, dependency upgrades, and JAR packaging, sbt or Maven is better because the Spark version and Scala version are declared in a build file.

Which Scala version should I use for Spark WordCount?

Use the Scala binary version that matches your Spark artifacts. For example, spark-core_2.13 should be compiled with Scala 2.13, while spark-core_2.12 should be compiled with Scala 2.12.

Why does my Spark Scala WordCount JAR fail with ClassNotFoundException?

Common causes are an incorrect --class value, a package name mismatch, a JAR that was not rebuilt after code changes, or Spark dependencies missing from the runtime. Check the main object name and rebuild the JAR before running spark-submit.

Spark Scala WordCount QA checklist before publishing or rerunning

Verify that the input path in sc.textFile("data/wordcount/input.txt") exists relative to the Eclipse working directory.
Confirm that the Spark JARs on the Eclipse build path all come from the same Spark distribution.
Confirm that the Scala library selected in Eclipse matches the Spark Scala binary version.
Delete the existing out.txt output directory before rerunning the same example.
Use local[1] for predictable beginner output and local[*] only when you want local parallel execution.
For JAR submission, verify the --class value and the generated JAR path before running spark-submit.

Spark Scala WordCount application recap

In this Apache Spark Tutorial – Spark Scala Application, we have learnt to setup a Scala Project in Eclipse with Apache Spark libraries, and run WordCount example application. We also reviewed Scala version matching, output folder behavior, sbt packaging, spark-submit, and common Eclipse setup errors.

TutorialKart.com

Scala – Spark Application – Word Count Example – Eclipse

Spark Scala WordCount project requirements in Eclipse

Setup Spark Scala Application in Eclipse

1. Install Scala support in Eclipse for Spark Scala code

2. Create a new Scala project for Spark WordCount

3. Download Apache Spark for the Scala Eclipse application

4. Add Apache Spark libraries to the Eclipse Scala build path

5. Match Scala library version with the Spark Scala binary version

6. Create WordCount.scala for the Spark Scala RDD example

7. Run the Spark Scala WordCount application from Eclipse

How the Spark Scala WordCount example works

Optional sbt build file for a Spark Scala WordCount JAR

Common Spark Scala WordCount errors in Eclipse

Spark Scala application FAQ for WordCount in Eclipse

Spark Scala WordCount QA checklist before publishing or rerunning

Spark Scala WordCount application recap

Popular Courses

SAP

CRM

SAP Resources

Apache

GUI

Programming

Databases

Mobile

Linux

Web & Server

Testing

Learning