Spark Dataset union() to append or concatenate rows
In Apache Spark, appending one Dataset to another means adding the rows of the second Dataset below the rows of the first Dataset. For Java Dataset<Row> objects, use the union() method when both Datasets have matching columns in the same order.
Spark provides union() method in Dataset class to concatenate or append a Dataset to another.
To append or concatenate two Datasets use Dataset.union() method on the first dataset and provide second Dataset as argument.
Note: Dataset Union can only be performed on Datasets with the same number of columns.
Spark Dataset append is a row-wise union, not a column-wise join
The word concatenate can mean two different things in data processing. In this tutorial, concatenate means row-wise append: the output contains all rows from ds1 followed by all rows from ds2. It does not place columns from two Datasets side by side.
| Spark operation | Use it when | Result |
|---|---|---|
ds1.union(ds2) | Both Datasets have the same column count and matching column order | Rows from both Datasets are appended |
ds1.unionByName(ds2) | The same column names may appear in a different order | Rows are appended after matching columns by name |
ds1.join(ds2, ...) | You need to combine columns using a key such as employee id | Columns from both Datasets are placed side by side based on join conditions |
ds1.union(ds2).distinct() | You want appended rows but duplicate rows should be removed | Rows are appended, then exact duplicate rows are removed |
Dataset.union() syntax in Java for appending rows
For row-wise append, the method you need is union(). Do not use join() when your goal is to add the rows of one Dataset below another. A join-style method signature looks like this, but it is for combining Datasets by columns, not for appending rows:
public Dataset<Row> join(Dataset<?> right)
The corrected Java signature for Spark Dataset row-wise append is:
public Dataset<T> union(Dataset<T> other)
The function returns a new Dataset with the specified Dataset concatenated/appended to this Dataset. The original input Datasets are not modified.
Example – Concatenate two Datasets with Dataset.union()
In the following example, we have two Datasets with employee information read from two different data files. We shall use union() method to concatenate these two Datasets.
ConcatenateDatasets.java
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class ConcatenateDatasets {
public static void main(String[] args) {
// configure spark
SparkSession spark = SparkSession
.builder()
.appName("Spark Example - Append/Concatenate two Datasets")
.master("local[2]")
.getOrCreate();
Dataset<Row> ds1 = spark.read().json("data/employees.json");
Dataset<Row> ds2 = spark.read().json("data/employees2.json");
// print dataset
System.out.println("Dataset 1\n==============");
ds1.show();
System.out.println("Dataset 2\n==============");
ds1.show();
// concatenate datasets
Dataset<Row> ds3 = ds1.union(ds2);
System.out.println("Dataset 3 = Dataset 1 + Dataset 2\n==============================");
ds3.show();
spark.stop();
}
}
Output
Dataset 1
==============
+-------+------+
| name|salary|
+-------+------+
|Michael| 3000|
| Andy| 4500|
| Justin| 3500|
| Berta| 4000|
| Raju| 3000|
+-------+------+
Dataset 2
==============
+-------+------+
| name|salary|
+-------+------+
|Michael| 3000|
| Andy| 4500|
| Justin| 3500|
| Berta| 4000|
| Raju| 3000|
+-------+------+
Dataset 3 = Dataset 1 + Dataset 2
==============================
+-------+------+
| name|salary|
+-------+------+
|Michael| 3000|
| Andy| 4500|
| Justin| 3500|
| Berta| 4000|
| Raju| 3000|
| Chandy| 4500|
| Joey| 3500|
| Mon| 4000|
| Rachel| 4000|
+-------+------+
The important line in the example is Dataset<Row> ds3 = ds1.union(ds2);. If you are printing both input Datasets for checking, make sure the second preview uses ds2.show() in your local code.
What Dataset.union() checks before appending Spark rows
Before using union(), compare the schemas of the two Datasets. Spark expects the same number of columns and compatible data types at the same positions. Column names alone are not enough for safe use of union().
- Same column count: both Datasets must have the same number of columns.
- Compatible column types: values in the same position should be compatible, such as string with string or long with long.
- Correct column order:
union()resolves columns by position, so the first column inds1is appended to the first column inds2, even if the names differ. - Duplicate rows are preserved:
union()behaves like SQLUNION ALL. Usedistinct()if exact duplicate rows must be removed.
You can inspect schemas with printSchema() before the union operation.
ds1.printSchema();
ds2.printSchema();
Dataset<Row> appendedEmployees = ds1.union(ds2);
Use unionByName() when Spark Dataset column order may differ
If two Datasets contain the same logical columns but the columns may be arranged in a different order, prefer unionByName(). This avoids a common mistake where values are placed under the wrong output column only because the input schema order changed.
Dataset<Row> combinedByName = ds1.unionByName(ds2);
In Spark versions that support the missing-column option, unionByName(ds2, true) can fill missing columns with nulls. Use this only when nulls are valid for the missing fields in your data model.
Dataset<Row> combinedWithMissingColumns = ds1.unionByName(ds2, true);
For method details, refer to the official Apache Spark Dataset Java API.
Append three or more Spark Datasets in Java
To append more than two Datasets, chain union() calls when all input Datasets share the same schema and column order.
Dataset<Row> allEmployees = ds1
.union(ds2)
.union(ds3);
For a larger list of Datasets, validate the schema of each input before appending. A single Dataset with a changed column order or extra column can make the final result incorrect or fail during analysis.
Remove duplicate rows after Spark Dataset union
union() keeps duplicates. This is usually correct for event logs, transactions, or employee records from different sources where repeated-looking rows may still be valid. If the output should contain only unique rows, call distinct() after the append.
Dataset<Row> appended = ds1.union(ds2);
Dataset<Row> uniqueRows = appended.distinct();
General Pitfalls while concatenating Datasets
If number of columns in the two Datasets do not match, union() method throws an AnalysisException as shown below :
Exception in thread "main" org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the same number of columns, but the first table has 2 columns and the second table has 3 columns;;
'Union
:- Relation[name#8,salary#9L] json
+- Relation[name#21,nn#22L,salary#23L] json
In the above case, there are two columns in the first Dataset, while the second Dataset has three columns.
Fix AnalysisException when Spark Dataset union columns do not match
When you see this error, do not add a random empty column just to make the code run. First decide whether the extra field is required in the final output.
- If both Datasets should have the same schema, select the columns in the same order before calling
union(). - If the column order differs but the names are correct, use
unionByName(). - If one Dataset has missing columns and your Spark version supports it, use
unionByName(other, true)and confirm that null values are acceptable. - If you actually need to combine columns from two Datasets, use a Spark join instead of union.
FAQs on Spark append and concatenate Datasets
Does Spark Dataset.union() remove duplicate rows?
No. Dataset.union() keeps duplicate rows. If you need only unique rows, call distinct() after union().
Does Spark Dataset.union() match columns by name?
No. union() matches columns by position. Use unionByName() when you want Spark to align columns by their names.
Can I concatenate Spark Datasets with different numbers of columns?
Not with plain union(). The input Datasets must have the same number of columns. For missing columns, use unionByName(other, true) if your Spark version supports it and null values are acceptable.
What is the difference between Spark union and join?
union() appends rows from one Dataset below another Dataset. join() combines columns from two Datasets based on a join condition such as matching ids.
How can I append multiple Spark Datasets safely?
Check that every Dataset has the same schema and column order, then chain union() calls. If column order may vary, align columns first or use unionByName().
QA checklist for this Spark Dataset union example
- Confirm that the tutorial explains row-wise append, not column-wise concatenation.
- Verify that the Spark Java example uses
ds1.union(ds2)for appending rows. - Check that the input Datasets have the same column count before running the example.
- Confirm whether column order is safe for
union()or whetherunionByName()is the better method. - Decide whether duplicate rows should be preserved or removed with
distinct().
Conclusion
In this Apache Spark Tutorial – Concatenate two Datasets, we have learnt to use Dataset.union() method to append a Dataset to another with same number of columns. We also covered when to use unionByName(), why column order matters, and how to remove duplicates after appending Spark Datasets.
TutorialKart.com