Cloud & AI Analytics

BigQuery UNION and UNION ALL : A Comprehensive Guide with Examples

When working with large datasets in Google BigQuery, combining results from multiple queries is a common task. But which one should you use: UNION DISTINCT or UNION ALL?

UNION

  • The UNION function combines the results of two or more SELECT queries into a single result set, removing duplicate rows.
  • Each SELECT statement within the UNION must have the same number of columns.
  • Also, they have to have similar data types, and the columns must also be in the same order.

What Are the Use Cases for UNION in SQL?

Here are some common use cases for the UNION function.

  • Combining Results From Different Tables: When you want to combine data from multiple tables and ensure no duplicate records, UNION is the go-to function.
  • Handling Different Data Sources: UNION is useful when combining tables from different data sources.
  • Removing Duplicates Across Queries: When you want to ensure the uniqueness of the combined result set, use UNION.

UNION ALL

  • The UNION ALL function combines the results of two or more SELECT queries, including all duplicate rows.
  • This function is faster than UNION because it doesn’t bother removing duplicates.

What Are the Use Cases for UNION ALL in SQL?

Here are some common use cases for the UNION ALL function.

  • Combining Results With Duplicates: Use UNION ALL when you need to combine results from multiple queries and preserve all duplicate rows.
  • Performance Considerations: UNION ALL is more time-efficient than the UNION function, because it doesn’t require the additional step of removing duplicates.
  • Aggregating Data From Different Periods: When aggregating data from different periods or sources, and you need to preserve the duplicate entries, UNION ALL is preferred.
  • Reporting and Analysis: For reporting purposes where every record, including duplicates, is necessary, UNION ALL is suitable.

For UNION and UNION ALL to work

In the select statement

  • Number,
  • Data types,
  • Order of the columns

It should be the same.

Difference between UNION and UNION ALL

  • The key difference is that UNION removes duplicate records, whereas UNION ALL includes all duplicates.
  • This distinction not only changes the number of rows in the query result, but it also impacts performance.

When to Use Which:

  • Use UNION when you need a distinct list of records from multiple sources and want to eliminate duplicates.
  • Use UNION ALL when you need to combine all records from multiple sources, including duplicates, and performance is a concern.
  • It’s also useful when you explicitly want to see how many times a record appears across different sets.

Hands-On with UNION vs UNION ALL in BigQuery

  1. Create the sales_team table with employee_id and employee_name columns

CREATE TABLE `amiable-might-453515-g6.demo_dataset.sales_team` (
employee_id INT,
employee_name STRING
);

2. Insert sample data into the sales_team table

sql
INSERT INTO `amiable-might-453515-g6.demo_dataset.sales_team` (employee_id, employee_name) VALUES
(1, 'Alice'),
(2, 'Bob'),
(3, 'Charlie');

3. Select all data from the sales_team table to verify the insertion

sql
SELECT * FROM `amiable-might-453515-g6.demo_dataset.sales_team`;

4. Create the customer_team table with employee_id and employee_name columns

sql
CREATE TABLE `amiable-might-453515-g6.demo_dataset.customer_team` (
employee_id INT,
employee_name STRING
);

5. Insert sample data into the customer_team table

sql
INSERT INTO `amiable-might-453515-g6.demo_dataset.customer_team` (employee_id, employee_name) VALUES
(3, 'Charlie'),
(4, 'David'),
(5, 'Eve');

6. Select all data from the customer_team table to verify the insertion

sql
SELECT * FROM `amiable-might-453515-g6.demo_dataset.customer_team`;

7. Use UNION DISTINCT to combine data from sales_team and customer_team tables, excluding duplicates

sql
SELECT employee_id, employee_name
FROM `amiable-might-453515-g6.demo_dataset.sales_team`
UNION DISTINCT
SELECT employee_id, employee_name
FROM `amiable-might-453515-g6.demo_dataset.customer_team`
ORDER BY employee_id;

8. Use UNION ALL to combine data from sales_team and customer_team tables, including duplicates

sql
SELECT employee_id, employee_name
FROM `amiable-might-453515-g6.demo_dataset.sales_team`
UNION ALL
SELECT employee_id, employee_name
FROM `amiable-might-453515-g6.demo_dataset.customer_team`
ORDER BY employee_id;

Conclusion:

Understanding the differences between UNION and UNION ALL in BigQuery is key to writing optimized and purposeful SQL queries. Whether you’re focused on preserving every row for detailed analysis or eliminating duplicates for cleaner reporting, selecting the appropriate operator can make a significant impact on your results and performance. With these practical insights and examples, you’re now better equipped to apply UNION and UNION ALL effectively within your BigQuery workflows, enabling smarter data processing and more reliable analytics.

Happy Learning!

Share:

Leave a Comment

Your email address will not be published. Required fields are marked *