SparkSQL优化的行列转换

王一川2025-02-082025-07-09

行列转换是数据处理与分析中的关键操作，它能够将数据的结构从行转为列，或从列转为行。这种转换不仅简化了复杂的数据展示，还提升了数据分析的效率。在业务场景中，行列转换常用于报表生成、数据透视和多维度数据分析，通过更直观的方式呈现数据，帮助管理者快速获取关键信息。此外，它还能有效减少数据冗余，优化查询性能，满足灵活多变的业务需求。无论是在财务报表、销售分析，还是市场趋势分析中，行列转换都是不可或缺的工具。

本文会基于 SparkSQL 3.5.x 给出常用的行列转换方式，但本文的重点是介绍pivot和unpivot子句在行列转换场景的应用，其中细节、优雅程度交由开发者自己选择

一、数据准备

以下是城市各年GDP 数据的表结构和测试数据，用于后续演示行列转换

create table city_gdp
(
    city string comment '城市名',
    year int comment '年份',
    gdp  double comment '单位：亿'
) comment '城市 gdp'
    stored as parquet;

insert into city_gdp (city, year, gdp)
values ('北京', 2018, 30320.00),
       ('北京', 2019, 35370.00),
       ('北京', 2020, 36100.00),
       ('上海', 2018, 32600.00),
       ('上海', 2019, 38160.00),
       ('上海', 2020, 38700.00),
       ('广州', 2018, 23000.00),
       ('广州', 2019, 23628.00),
       ('广州', 2020, 25019.00);

二、行列转换

2.1 传统方式

2.1.1 行转列

统计各城市 2018-2020 之间的 GDP，不同年份作为单独列显示。下面是 sql 实现

select city,
       max(case when year = 2018 then gdp end) as `2018`,
       max(case when year = 2019 then gdp end) as `2019`,
       max(case when year = 2020 then gdp end) as `2020`
from city_gdp
group by city;

+-------+----------+----------+----------+
| city  |   2018   |   2019   |   2020   |
+-------+----------+----------+----------+
| 上海    | 32600.0  | 38160.0  | 38700.0  |
| 广州    | 23000.0  | 23628.0  | 25019.0  |
| 北京    | 30320.0  | 35370.0  | 36100.0  |
+-------+----------+----------+----------+

2.1.2 列转行

为了更好的演示，将 2.1.1 的结果存储到临时表

create table tmp_city_gdp stored as parquet as
select city,
       max(case when year = 2018 then gdp end) as `2018`,
       max(case when year = 2019 then gdp end) as `2019`,
       max(case when year = 2020 then gdp end) as `2020`
from city_gdp
group by city;

基于 2.1.1 的结果，统计各城市 2018-2020 之间的 GDP，不同年份作为单独行显示。下面是 sql 实现

select city, '2018' as year, `2018` as gdp
from tmp_city_gdp
union all
select city, '2019', `2019`
from tmp_city_gdp
union all
select city, '2020', `2020`
from tmp_city_gdp;

+-------+-------+----------+
| city  | year  |   gdp    |
+-------+-------+----------+
| 上海    | 2018  | 32600.0  |
| 北京    | 2018  | 30320.0  |
| 广州    | 2018  | 23000.0  |
| 上海    | 2019  | 38160.0  |
| 北京    | 2019  | 35370.0  |
| 广州    | 2019  | 23628.0  |
| 上海    | 2020  | 38700.0  |
| 北京    | 2020  | 36100.0  |
| 广州    | 2020  | 25019.0  |
+-------+-------+----------+

2.2 pivot 和 unpivot

2.2.1 行转列

pivot的标准语法如下

SELECT [columns]
FROM (
    SELECT [columns]
    FROM table_name
) 
PIVOT (
    aggregate_function([column]) FOR [column_to_pivot] IN ([pivot_values])
)

pivot子句是可以完全替代case when

select *
from (select city, year, gdp
      from city_gdp)
pivot (
    sum(gdp) for year in (2018, 2019, 2020)
);

+-------+----------+----------+----------+
| city  |   2018   |   2019   |   2020   |
+-------+----------+----------+----------+
| 上海    | 32600.0  | 38160.0  | 38700.0  |
| 北京    | 30320.0  | 35370.0  | 36100.0  |
| 广州    | 23000.0  | 23628.0  | 25019.0  |
+-------+----------+----------+----------+

2.2.2 列转行

unpivot的标准语法如下

SELECT [columns]
FROM table_name 
UNPIVOT [ { INCLUDE | EXCLUDE } NULLS ](
    column_value_unpivot FOR name_column IN (unpivot_column [as alias])
)

Tips:

unpivot 子句会产生新的字段，列名需要自定义
column_value_unpivot 定义 unpivot 度量值字段名
name_column 定义 unpivot 维度值字段名
unpivot_column 限定 unpivot 需要转成行的列，从 table_name 中选取且作为 name_colume 的取值
unpivot 列转行时默认剔除 null 值，如需要体现 null 值所在的 unpivot_column 需要使用 unpivot include nulls

unpivot子句是可以完全替代union all

select *
from tmp_city_gdp
unpivot include nulls (
    gdp for year in (`2018`, `2019`, `2020`)
);

+-------+-------+----------+
| city  | year  |   gdp    |
+-------+-------+----------+
| 上海    | 2018  | 32600.0  |
| 上海    | 2019  | 38160.0  |
| 上海    | 2020  | 38700.0  |
| 北京    | 2018  | 30320.0  |
| 北京    | 2019  | 35370.0  |
| 北京    | 2020  | 36100.0  |
| 广州    | 2018  | 23000.0  |
| 广州    | 2019  | 23628.0  |
| 广州    | 2020  | 25019.0  |
+-------+-------+----------+