Spark Dataset union() to append or concatenate rows

In Apache Spark, appending one Dataset to another means adding the rows of the second Dataset below the rows of the first Dataset. For Java Dataset<Row> objects, use the union() method when both Datasets have matching columns in the same order.

Spark provides union() method in Dataset class to concatenate or append a Dataset to another.

To append or concatenate two Datasets use Dataset.union() method on the first dataset and provide second Dataset as argument.

Note: Dataset Union can only be performed on Datasets with the same number of columns.

Spark Dataset append is a row-wise union, not a column-wise join

The word concatenate can mean two different things in data processing. In this tutorial, concatenate means row-wise append: the output contains all rows from ds1 followed by all rows from ds2. It does not place columns from two Datasets side by side.

Spark operationUse it whenResult
ds1.union(ds2)Both Datasets have the same column count and matching column orderRows from both Datasets are appended
ds1.unionByName(ds2)The same column names may appear in a different orderRows are appended after matching columns by name
ds1.join(ds2, ...)You need to combine columns using a key such as employee idColumns from both Datasets are placed side by side based on join conditions
ds1.union(ds2).distinct()You want appended rows but duplicate rows should be removedRows are appended, then exact duplicate rows are removed

Dataset.union() syntax in Java for appending rows

For row-wise append, the method you need is union(). Do not use join() when your goal is to add the rows of one Dataset below another. A join-style method signature looks like this, but it is for combining Datasets by columns, not for appending rows:

</>
Copy
public Dataset<Row> join(Dataset<?> right)

The corrected Java signature for Spark Dataset row-wise append is:

</>
Copy
public Dataset<T> union(Dataset<T> other)

The function returns a new Dataset with the specified Dataset concatenated/appended to this Dataset. The original input Datasets are not modified.

Example – Concatenate two Datasets with Dataset.union()

In the following example, we have two Datasets with employee information read from two different data files. We shall use union() method to concatenate these two Datasets.

ConcatenateDatasets.java

</>
Copy
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class ConcatenateDatasets {

	public static void main(String[] args) {
		// configure spark
		SparkSession spark = SparkSession
				.builder()
				.appName("Spark Example - Append/Concatenate two Datasets")
				.master("local[2]")
				.getOrCreate();

		Dataset<Row> ds1 = spark.read().json("data/employees.json");
		Dataset<Row> ds2 = spark.read().json("data/employees2.json");
		
		// print dataset
		System.out.println("Dataset 1\n==============");
		ds1.show();
		System.out.println("Dataset 2\n==============");
		ds1.show();
		
		// concatenate datasets
		Dataset<Row> ds3 = ds1.union(ds2);
		
		System.out.println("Dataset 3 = Dataset 1 + Dataset 2\n==============================");
		ds3.show();
		
		spark.stop();
	}
}


Output

Dataset 1
==============
+-------+------+
|   name|salary|
+-------+------+
|Michael|  3000|
|   Andy|  4500|
| Justin|  3500|
|  Berta|  4000|
|   Raju|  3000|
+-------+------+

Dataset 2
==============
+-------+------+
|   name|salary|
+-------+------+
|Michael|  3000|
|   Andy|  4500|
| Justin|  3500|
|  Berta|  4000|
|   Raju|  3000|
+-------+------+

Dataset 3 = Dataset 1 + Dataset 2
==============================
+-------+------+
|   name|salary|
+-------+------+
|Michael|  3000|
|   Andy|  4500|
| Justin|  3500|
|  Berta|  4000|
|   Raju|  3000|
| Chandy|  4500|
|   Joey|  3500|
|    Mon|  4000|
| Rachel|  4000|
+-------+------+

The important line in the example is Dataset<Row> ds3 = ds1.union(ds2);. If you are printing both input Datasets for checking, make sure the second preview uses ds2.show() in your local code.

What Dataset.union() checks before appending Spark rows

Before using union(), compare the schemas of the two Datasets. Spark expects the same number of columns and compatible data types at the same positions. Column names alone are not enough for safe use of union().

  • Same column count: both Datasets must have the same number of columns.
  • Compatible column types: values in the same position should be compatible, such as string with string or long with long.
  • Correct column order: union() resolves columns by position, so the first column in ds1 is appended to the first column in ds2, even if the names differ.
  • Duplicate rows are preserved: union() behaves like SQL UNION ALL. Use distinct() if exact duplicate rows must be removed.

You can inspect schemas with printSchema() before the union operation.

</>
Copy
ds1.printSchema();
ds2.printSchema();

Dataset<Row> appendedEmployees = ds1.union(ds2);

Use unionByName() when Spark Dataset column order may differ

If two Datasets contain the same logical columns but the columns may be arranged in a different order, prefer unionByName(). This avoids a common mistake where values are placed under the wrong output column only because the input schema order changed.

</>
Copy
Dataset<Row> combinedByName = ds1.unionByName(ds2);

In Spark versions that support the missing-column option, unionByName(ds2, true) can fill missing columns with nulls. Use this only when nulls are valid for the missing fields in your data model.

</>
Copy
Dataset<Row> combinedWithMissingColumns = ds1.unionByName(ds2, true);

For method details, refer to the official Apache Spark Dataset Java API.

Append three or more Spark Datasets in Java

To append more than two Datasets, chain union() calls when all input Datasets share the same schema and column order.

</>
Copy
Dataset<Row> allEmployees = ds1
        .union(ds2)
        .union(ds3);

For a larger list of Datasets, validate the schema of each input before appending. A single Dataset with a changed column order or extra column can make the final result incorrect or fail during analysis.

Remove duplicate rows after Spark Dataset union

union() keeps duplicates. This is usually correct for event logs, transactions, or employee records from different sources where repeated-looking rows may still be valid. If the output should contain only unique rows, call distinct() after the append.

</>
Copy
Dataset<Row> appended = ds1.union(ds2);
Dataset<Row> uniqueRows = appended.distinct();

General Pitfalls while concatenating Datasets

If number of columns in the two Datasets do not match, union() method throws an AnalysisException as shown below :

Exception in thread "main" org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 2 columns and the second table has 3 columns;;
'Union
:- Relation[name#8,salary#9L] json
+- Relation[name#21,nn#22L,salary#23L] json

In the above case, there are two columns in the first Dataset, while the second Dataset has three columns.

Fix AnalysisException when Spark Dataset union columns do not match

When you see this error, do not add a random empty column just to make the code run. First decide whether the extra field is required in the final output.

  • If both Datasets should have the same schema, select the columns in the same order before calling union().
  • If the column order differs but the names are correct, use unionByName().
  • If one Dataset has missing columns and your Spark version supports it, use unionByName(other, true) and confirm that null values are acceptable.
  • If you actually need to combine columns from two Datasets, use a Spark join instead of union.

FAQs on Spark append and concatenate Datasets

Does Spark Dataset.union() remove duplicate rows?

No. Dataset.union() keeps duplicate rows. If you need only unique rows, call distinct() after union().

Does Spark Dataset.union() match columns by name?

No. union() matches columns by position. Use unionByName() when you want Spark to align columns by their names.

Can I concatenate Spark Datasets with different numbers of columns?

Not with plain union(). The input Datasets must have the same number of columns. For missing columns, use unionByName(other, true) if your Spark version supports it and null values are acceptable.

What is the difference between Spark union and join?

union() appends rows from one Dataset below another Dataset. join() combines columns from two Datasets based on a join condition such as matching ids.

How can I append multiple Spark Datasets safely?

Check that every Dataset has the same schema and column order, then chain union() calls. If column order may vary, align columns first or use unionByName().

QA checklist for this Spark Dataset union example

  • Confirm that the tutorial explains row-wise append, not column-wise concatenation.
  • Verify that the Spark Java example uses ds1.union(ds2) for appending rows.
  • Check that the input Datasets have the same column count before running the example.
  • Confirm whether column order is safe for union() or whether unionByName() is the better method.
  • Decide whether duplicate rows should be preserved or removed with distinct().

Conclusion

In this Apache Spark Tutorial – Concatenate two Datasets, we have learnt to use Dataset.union() method to append a Dataset to another with same number of columns. We also covered when to use unionByName(), why column order matters, and how to remove duplicates after appending Spark Datasets.