← 返回目录

第六章:聚合与分组

GROUP BY、聚合函数与窗口函数

1 聚合函数

聚合函数对一组值执行计算并返回单个结果。它们通常与 GROUP BY 配合使用,但也可以单独使用来计算整个表的统计值。

COUNT — 计数

-- COUNT(*) 统计所有行,包括 NULL
SELECT COUNT(*) AS total_orders FROM orders;

-- COUNT(column) 统计该列非 NULL 的行数
SELECT COUNT(email) AS has_email FROM customers;

-- COUNT(DISTINCT column) 统计去重后的非 NULL 值数
SELECT COUNT(DISTINCT category) AS category_count FROM products;

-- 组合使用:一条语句获取多个统计值
SELECT
    COUNT(*) AS total_products,
    COUNT(discount_price) AS on_sale,
    COUNT(DISTINCT brand) AS brand_count
FROM products;

SUM / AVG — 求和与平均

-- 订单总金额
SELECT SUM(amount) AS total_revenue FROM orders
WHERE order_date >= '2024-01-01';

-- 商品平均价格
SELECT AVG(price) AS avg_price FROM products;

-- SUM 和 AVG 自动忽略 NULL 值
SELECT
    SUM(bonus) AS total_bonus,
    AVG(bonus) AS avg_bonus
FROM employees;
-- 如果 10 条记录中 3 条 bonus 为 NULL,AVG 的分母是 7 而非 10

MIN / MAX — 最小与最大

-- 最低价和最高价
SELECT
    MIN(price) AS cheapest,
    MAX(price) AS most_expensive
FROM products;

-- MIN/MAX 也可用于日期和字符串
SELECT
    MIN(order_date) AS first_order,
    MAX(order_date) AS last_order
FROM orders;

-- 综合示例:订单统计概览
SELECT
    COUNT(*) AS order_count,
    SUM(amount) AS total_amount,
    AVG(amount) AS avg_amount,
    MIN(amount) AS min_amount,
    MAX(amount) AS max_amount
FROM orders
WHERE YEAR(order_date) = 2024;

💡 COUNT(*) vs COUNT(column)

COUNT(*) 统计所有行(包含 NULL),COUNT(column) 只统计该列值不为 NULL 的行。当列中有 NULL 值时,两者结果不同。如果只想知道行数,用 COUNT(*) 性能更好。

2 GROUP BY 分组

GROUP BY 将结果集按一个或多个列的值分组,每组返回一行。SELECT 中只能出现分组列或聚合函数。

基本分组

-- 每个类别的商品数量
SELECT category, COUNT(*) AS product_count
FROM products
GROUP BY category;

-- 每个客户的订单总额
SELECT
    customer_id,
    COUNT(*) AS order_count,
    SUM(amount) AS total_spent,
    AVG(amount) AS avg_order
FROM orders
GROUP BY customer_id
ORDER BY total_spent DESC;

多列分组

-- 按年月统计销售额
SELECT
    YEAR(order_date) AS order_year,
    MONTH(order_date) AS order_month,
    COUNT(*) AS order_count,
    SUM(amount) AS monthly_revenue
FROM orders
GROUP BY YEAR(order_date), MONTH(order_date)
ORDER BY order_year, order_month;

-- 每个部门、每个职位的平均工资
SELECT
    department,
    job_title,
    COUNT(*) AS headcount,
    AVG(salary) AS avg_salary
FROM employees
GROUP BY department, job_title
ORDER BY department, avg_salary DESC;

WITH ROLLUP — 小计与合计

WITH ROLLUP 在分组结果后自动添加小计行和总计行,适合生成报表。

-- 每个类别的销售额,附带总计
SELECT
    COALESCE(category, '【总计】') AS category,
    COUNT(*) AS product_count,
    SUM(price) AS total_price
FROM products
GROUP BY category WITH ROLLUP;

-- 结果示例:
-- +-----------+---------------+-------------+
-- | category  | product_count | total_price |
-- +-----------+---------------+-------------+
-- | 手机      |             5 |    25000.00 |
-- | 笔记本    |             3 |    18000.00 |
-- | 平板      |             4 |    12000.00 |
-- | 【总计】  |            12 |    55000.00 |
-- +-----------+---------------+-------------+

-- 多列 ROLLUP:先按 year 小计,再总计
SELECT
    COALESCE(CAST(YEAR(order_date) AS CHAR), '合计') AS order_year,
    COALESCE(CAST(MONTH(order_date) AS CHAR), '小计') AS order_month,
    SUM(amount) AS revenue
FROM orders
GROUP BY YEAR(order_date), MONTH(order_date) WITH ROLLUP;

3 HAVING 过滤

HAVING 在分组之后过滤,而 WHERE 在分组之前过滤行。这是 SQL 执行顺序决定的。

WHERE vs HAVING

特性 WHERE HAVING
执行时机分组之前分组之后
过滤对象单行数据分组后的结果
能否用聚合函数不能
能否引用别名不能MySQL 允许

HAVING 使用示例

-- 找出平均价格超过 100 的类别
SELECT
    category,
    AVG(price) AS avg_price,
    COUNT(*) AS product_count
FROM products
GROUP BY category
HAVING avg_price > 100;

-- 找出下单超过 5 次的客户
SELECT
    customer_id,
    COUNT(*) AS order_count,
    SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING order_count >= 5
ORDER BY total_spent DESC;

-- WHERE 和 HAVING 同时使用
-- 先用 WHERE 过滤 2024 年的订单,再用 HAVING 筛选大客户
SELECT
    customer_id,
    COUNT(*) AS order_count,
    SUM(amount) AS total_spent
FROM orders
WHERE YEAR(order_date) = 2024
GROUP BY customer_id
HAVING total_spent > 10000
ORDER BY total_spent DESC;

SQL 执行顺序

理解执行顺序是掌握 WHERE 与 HAVING 的关键:

-- SQL 的逻辑执行顺序(非书写顺序):
-- 1. FROM       — 确定数据源
-- 2. WHERE      — 过滤行
-- 3. GROUP BY   — 分组
-- 4. HAVING     — 过滤分组
-- 5. SELECT     — 选择列 / 计算表达式
-- 6. DISTINCT   — 去重
-- 7. ORDER BY   — 排序
-- 8. LIMIT      — 限制行数

💡 性能提示

能用 WHERE 过滤的条件不要放在 HAVING 中。WHERE 先执行,减少了参与分组的数据量,效率更高。

4 窗口函数(MySQL 8.0+)

窗口函数在不减少结果行数的情况下进行聚合计算。与 GROUP BY 不同,窗口函数保留每一行的数据,同时添加统计列。语法:函数() OVER (PARTITION BY ... ORDER BY ...)

排名函数

-- ROW_NUMBER():顺序编号,不存在并列
-- RANK():并列时编号相同,后续编号跳过
-- DENSE_RANK():并列时编号相同,后续编号不跳过

SELECT
    name,
    department,
    salary,
    ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
    RANK()       OVER (ORDER BY salary DESC) AS rank_num,
    DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank_num
FROM employees;

-- 结果示例(salary: 10000, 8000, 8000, 6000):
-- row_num: 1, 2, 3, 4
-- rank_num: 1, 2, 2, 4       (跳过了 3)
-- dense_rank_num: 1, 2, 2, 3  (不跳过)

PARTITION BY — 分组内计算

-- 每个部门内按工资排名
SELECT
    name,
    department,
    salary,
    RANK() OVER (
        PARTITION BY department
        ORDER BY salary DESC
    ) AS dept_rank
FROM employees;

-- 每个类别内的商品价格排名
SELECT
    product_name,
    category,
    price,
    DENSE_RANK() OVER (
        PARTITION BY category
        ORDER BY price DESC
    ) AS price_rank
FROM products;

聚合窗口函数

-- SUM() OVER — 累计求和(Running Total)
SELECT
    order_date,
    amount,
    SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM orders;

-- AVG() OVER — 每个部门的平均工资(附在每行)
SELECT
    name,
    department,
    salary,
    AVG(salary) OVER (PARTITION BY department) AS dept_avg,
    salary - AVG(salary) OVER (PARTITION BY department) AS diff_from_avg
FROM employees;

-- COUNT() OVER — 累计计数
SELECT
    order_date,
    customer_id,
    COUNT(*) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
    ) AS customer_order_seq
FROM orders;

LEAD / LAG — 前后行对比

-- LAG(column, offset, default):取前 N 行的值
-- LEAD(column, offset, default):取后 N 行的值

SELECT
    order_date,
    amount,
    LAG(amount, 1, 0) OVER (ORDER BY order_date) AS prev_amount,
    amount - LAG(amount, 1, 0) OVER (ORDER BY order_date) AS change_amount,
    LEAD(amount, 1) OVER (ORDER BY order_date) AS next_amount
FROM orders;

-- 环比增长率
SELECT
    order_month,
    monthly_revenue,
    LAG(monthly_revenue, 1) OVER (ORDER BY order_month) AS prev_month_revenue,
    ROUND(
        (monthly_revenue - LAG(monthly_revenue, 1) OVER (ORDER BY order_month))
        / LAG(monthly_revenue, 1) OVER (ORDER BY order_month) * 100,
        2
    ) AS growth_rate_pct
FROM (
    SELECT
        DATE_FORMAT(order_date, '%Y-%m') AS order_month,
        SUM(amount) AS monthly_revenue
    FROM orders
    GROUP BY DATE_FORMAT(order_date, '%Y-%m')
) monthly;

Frame 子句 — 精确控制窗口范围

-- ROWS BETWEEN 定义窗口的物理范围
-- 3 日移动平均
SELECT
    order_date,
    daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ) AS moving_avg_3day
FROM (
    SELECT order_date, SUM(amount) AS daily_revenue
    FROM orders
    GROUP BY order_date
) daily;

-- 常用 Frame 写法:
-- ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW   从第一行到当前行(默认)
-- ROWS BETWEEN 2 PRECEDING AND CURRENT ROW            前 2 行 + 当前行
-- ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING             前 1 行 + 当前行 + 后 1 行
-- ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING    当前行到最后一行

🔄 MySQL vs PostgreSQL 窗口函数

MySQL 8.0 引入了窗口函数,PostgreSQL 从 8.4(2009 年)起就支持。PostgreSQL 提供更丰富的功能:

-- 两者通用语法相同
SELECT RANK() OVER (PARTITION BY dept ORDER BY salary DESC) FROM employees;

-- PostgreSQL 额外支持:
-- 1. FILTER 子句
--    SELECT COUNT(*) FILTER (WHERE status = 'active') OVER () FROM users;
-- 2. GROUPS frame 类型(基于值的分组)
--    GROUPS BETWEEN 1 PRECEDING AND 1 FOLLOWING
-- 3. EXCLUDE 子句
--    ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING EXCLUDE CURRENT ROW

-- MySQL 8.0 支持 ROWS 和 RANGE,但不支持 GROUPS 和 EXCLUDE

5 实战:销售报表

以下示例基于订单表和商品表,展示实际业务场景中的聚合查询。

月度销售汇总

SELECT
    DATE_FORMAT(o.order_date, '%Y-%m') AS month,
    COUNT(DISTINCT o.id) AS order_count,
    COUNT(DISTINCT o.customer_id) AS customer_count,
    SUM(oi.quantity) AS items_sold,
    SUM(oi.quantity * oi.unit_price) AS revenue,
    ROUND(SUM(oi.quantity * oi.unit_price) / COUNT(DISTINCT o.id), 2) AS avg_order_value
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
WHERE o.order_date >= '2024-01-01'
GROUP BY DATE_FORMAT(o.order_date, '%Y-%m')
ORDER BY month;

Top N 商品(每个类别)

-- 每个类别销量前 3 的商品
SELECT category, product_name, total_sold, category_rank
FROM (
    SELECT
        p.category,
        p.product_name,
        SUM(oi.quantity) AS total_sold,
        DENSE_RANK() OVER (
            PARTITION BY p.category
            ORDER BY SUM(oi.quantity) DESC
        ) AS category_rank
    FROM products p
    JOIN order_items oi ON p.id = oi.product_id
    GROUP BY p.category, p.product_name
) ranked
WHERE category_rank <= 3
ORDER BY category, category_rank;

累计销售额与移动平均

-- 每日累计销售额 + 7 日移动平均
SELECT
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_revenue,
    ROUND(AVG(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 2) AS moving_avg_7day
FROM (
    SELECT
        order_date,
        SUM(amount) AS daily_revenue
    FROM orders
    WHERE order_date >= '2024-01-01'
    GROUP BY order_date
) daily
ORDER BY order_date;

同比分析(Year-over-Year)

-- 月度同比对比
SELECT
    curr.month_label,
    curr.revenue AS current_year,
    prev.revenue AS previous_year,
    ROUND(
        (curr.revenue - prev.revenue) / prev.revenue * 100, 2
    ) AS yoy_growth_pct
FROM (
    SELECT
        MONTH(order_date) AS m,
        DATE_FORMAT(order_date, '%m月') AS month_label,
        SUM(amount) AS revenue
    FROM orders
    WHERE YEAR(order_date) = 2024
    GROUP BY MONTH(order_date), DATE_FORMAT(order_date, '%m月')
) curr
LEFT JOIN (
    SELECT
        MONTH(order_date) AS m,
        SUM(amount) AS revenue
    FROM orders
    WHERE YEAR(order_date) = 2023
    GROUP BY MONTH(order_date)
) prev ON curr.m = prev.m
ORDER BY curr.m;

💡 报表查询优化

对于频繁查询的报表,考虑以下优化方式:① 在 order_date 列上建立索引;② 对于历史数据,使用汇总表(summary table)预先计算结果;③ 复杂报表可创建视图简化查询。

6 本章要点

🔢 聚合函数

  • COUNT(*) 统计所有行,COUNT(col) 排除 NULL
  • SUM()AVG() 自动忽略 NULL 值
  • MIN() / MAX() 适用于数字、日期和字符串
  • COUNT(DISTINCT col) 统计不重复的非 NULL 值

📦 GROUP BY 分组

  • • SELECT 中非聚合列必须出现在 GROUP BY 中
  • • 支持多列分组和表达式分组
  • WITH ROLLUP 生成小计和总计行
  • HAVING 过滤分组,WHERE 过滤行

🪟 窗口函数

  • ROW_NUMBER()RANK()DENSE_RANK() 排名
  • PARTITION BY 分组 + ORDER BY 排序
  • LAG() / LEAD() 访问前后行
  • ROWS BETWEEN 定义窗口范围

📊 实战技巧

  • • 用窗口函数实现 Top N、累计求和、移动平均
  • • 同比 / 环比分析结合子查询或 LAG()
  • • 能用 WHERE 过滤的不要放到 HAVING
  • • 复杂报表考虑视图和汇总表优化