← Back to Index

Chapter 6: Aggregation

GROUP BY, aggregate functions & window functions

1. Aggregate Functions

Aggregate functions compute a single result from a set of rows. They are the foundation of data analysis in SQL. MySQL provides five core aggregate functions:

Function Description NULLs
COUNT()Counts number of rowsSee below
SUM()Totals numeric valuesIgnored
AVG()Averages numeric valuesIgnored
MIN()Finds minimum valueIgnored
MAX()Finds maximum valueIgnored

Let's work with a sample orders table:

CREATE TABLE orders (
    id          INT PRIMARY KEY AUTO_INCREMENT,
    customer_id INT NOT NULL,
    product     VARCHAR(100),
    amount      DECIMAL(10,2),
    quantity    INT,
    order_date  DATE
);

INSERT INTO orders (customer_id, product, amount, quantity, order_date) VALUES
(1, 'Laptop',     999.99, 1, '2024-01-15'),
(2, 'Mouse',       29.99, 3, '2024-01-16'),
(1, 'Keyboard',    79.99, 1, '2024-01-20'),
(3, 'Monitor',    349.99, 2, '2024-02-01'),
(2, 'Webcam',      59.99, 1, '2024-02-05'),
(1, 'Headset',     89.99, 1, '2024-02-10'),
(3, 'USB Hub',     NULL,  1, '2024-02-15'),
(4, 'SSD',        129.99, 2, '2024-03-01'),
(2, 'RAM',         64.99, 4, '2024-03-10'),
(1, 'Cable',        9.99, 5, '2024-03-15');

Basic Usage

-- Count all rows
SELECT COUNT(*) AS total_orders FROM orders;
-- Result: 10

-- Sum of all amounts (NULLs ignored)
SELECT SUM(amount) AS total_revenue FROM orders;

-- Average order amount
SELECT AVG(amount) AS avg_order FROM orders;

-- Cheapest and most expensive products
SELECT MIN(amount) AS cheapest, MAX(amount) AS most_expensive FROM orders;

-- Combine multiple aggregates
SELECT
    COUNT(*)       AS total_orders,
    SUM(amount)    AS total_revenue,
    AVG(amount)    AS avg_amount,
    MIN(order_date) AS first_order,
    MAX(order_date) AS last_order
FROM orders;

COUNT(*) vs COUNT(column)

Key Point: COUNT(*) vs COUNT(column)

COUNT(*) counts all rows, including those with NULL values. COUNT(column) counts only non-NULL values in that column. This distinction is critical when your data contains NULLs.

-- COUNT(*) counts all rows: 10
SELECT COUNT(*) AS all_rows FROM orders;

-- COUNT(amount) ignores the NULL row: 9
SELECT COUNT(amount) AS non_null_amounts FROM orders;

-- COUNT(DISTINCT ...) counts unique values
SELECT COUNT(DISTINCT customer_id) AS unique_customers FROM orders;
-- Result: 4

Using DISTINCT Inside Aggregates

-- Total quantity of distinct products ordered
SELECT COUNT(DISTINCT product) AS unique_products FROM orders;

-- Sum only distinct amounts (rarely used, but valid)
SELECT SUM(DISTINCT amount) AS sum_distinct FROM orders;

2. GROUP BY

GROUP BY divides rows into groups. Each group is then processed by an aggregate function to produce one output row per group.

Basic Grouping

-- Orders per customer
SELECT
    customer_id,
    COUNT(*)    AS order_count,
    SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id;

-- Result:
-- customer_id | order_count | total_spent
-- 1           | 4           | 1179.96
-- 2           | 3           | 154.97
-- 3           | 2           | 349.99
-- 4           | 1           | 129.99

Grouping by Multiple Columns

-- Monthly orders per customer
SELECT
    customer_id,
    DATE_FORMAT(order_date, '%Y-%m') AS month,
    COUNT(*)    AS order_count,
    SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id, DATE_FORMAT(order_date, '%Y-%m')
ORDER BY customer_id, month;

GROUP BY with Expressions

-- Group by year and month
SELECT
    YEAR(order_date)  AS yr,
    MONTH(order_date) AS mo,
    SUM(amount)       AS monthly_revenue
FROM orders
GROUP BY YEAR(order_date), MONTH(order_date)
ORDER BY yr, mo;

-- Group by price range
SELECT
    CASE
        WHEN amount < 50   THEN 'Budget'
        WHEN amount < 200  THEN 'Mid-range'
        ELSE                    'Premium'
    END AS price_tier,
    COUNT(*) AS product_count,
    AVG(amount) AS avg_price
FROM orders
WHERE amount IS NOT NULL
GROUP BY price_tier
ORDER BY avg_price;

WITH ROLLUP

WITH ROLLUP generates subtotals and a grand total automatically:

SELECT
    customer_id,
    SUM(amount) AS total_spent,
    COUNT(*)    AS order_count
FROM orders
GROUP BY customer_id WITH ROLLUP;

-- The last row with customer_id = NULL is the grand total
-- customer_id | total_spent | order_count
-- 1           | 1179.96     | 4
-- 2           | 154.97      | 3
-- 3           | 349.99      | 2
-- 4           | 129.99      | 1
-- NULL        | 1814.91     | 10

-- Use GROUPING() to identify rollup rows (MySQL 8.0+)
SELECT
    IF(GROUPING(customer_id), 'Grand Total', customer_id) AS customer,
    SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id WITH ROLLUP;

3. HAVING

HAVING filters groups after aggregation. Think of it as WHERE for groups.

Clause Filters Timing Can Use Aggregates?
WHERE Individual rows Before grouping No
HAVING Groups After grouping Yes
-- Customers who have placed more than 2 orders
SELECT
    customer_id,
    COUNT(*) AS order_count
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 2;

-- Customers who spent more than $200
SELECT
    customer_id,
    SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING SUM(amount) > 200;

-- Combine WHERE and HAVING:
-- Among 2024 orders, find customers with average order > $100
SELECT
    customer_id,
    AVG(amount) AS avg_order,
    COUNT(*)    AS order_count
FROM orders
WHERE order_date >= '2024-01-01'
  AND amount IS NOT NULL
GROUP BY customer_id
HAVING AVG(amount) > 100
ORDER BY avg_order DESC;

SQL execution order matters here:

-- Logical execution order:
-- 1. FROM     - pick the table
-- 2. WHERE    - filter rows
-- 3. GROUP BY - form groups
-- 4. HAVING   - filter groups
-- 5. SELECT   - compute output columns
-- 6. ORDER BY - sort results
-- 7. LIMIT    - restrict result count

4. Window Functions (MySQL 8.0+)

Window functions perform calculations across a set of rows related to the current row, without collapsing the result into a single row like GROUP BY does. They use the OVER() clause.

-- Syntax:
-- function_name() OVER (
--     [PARTITION BY column, ...]
--     [ORDER BY column [ASC|DESC], ...]
--     [frame_clause]
-- )

ROW_NUMBER(), RANK(), DENSE_RANK()

These three ranking functions number rows within each partition. The difference appears when there are ties:

-- Sample data for ranking demonstration
CREATE TABLE exam_scores (
    student VARCHAR(50),
    subject VARCHAR(50),
    score   INT
);

INSERT INTO exam_scores VALUES
('Alice', 'Math', 95), ('Bob',   'Math', 90),
('Carol', 'Math', 90), ('Dave',  'Math', 85),
('Alice', 'English', 88), ('Bob', 'English', 92),
('Carol', 'English', 92), ('Dave', 'English', 78);

-- Compare all three ranking functions
SELECT
    subject,
    student,
    score,
    ROW_NUMBER() OVER (PARTITION BY subject ORDER BY score DESC) AS row_num,
    RANK()       OVER (PARTITION BY subject ORDER BY score DESC) AS rnk,
    DENSE_RANK() OVER (PARTITION BY subject ORDER BY score DESC) AS dense_rnk
FROM exam_scores;

-- For Math subject:
-- student | score | row_num | rnk | dense_rnk
-- Alice   | 95    | 1       | 1   | 1
-- Bob     | 90    | 2       | 2   | 2
-- Carol   | 90    | 3       | 2   | 2     (tie!)
-- Dave    | 85    | 4       | 4   | 3     (rank skips 3; dense_rank doesn't)

SUM() OVER, AVG() OVER

Aggregate functions can be used as window functions to compute running totals, averages, etc.:

-- Running total by order date
SELECT
    id,
    order_date,
    amount,
    SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM orders
WHERE amount IS NOT NULL
ORDER BY order_date;

-- Per-customer running total
SELECT
    customer_id,
    order_date,
    amount,
    SUM(amount) OVER (
        PARTITION BY customer_id
        ORDER BY order_date
    ) AS customer_running_total
FROM orders
WHERE amount IS NOT NULL;

-- Percentage of total
SELECT
    product,
    amount,
    SUM(amount) OVER () AS grand_total,
    ROUND(amount / SUM(amount) OVER () * 100, 2) AS pct_of_total
FROM orders
WHERE amount IS NOT NULL;

PARTITION BY and ORDER BY

-- PARTITION BY divides rows into groups (like GROUP BY, but keeps all rows)
-- ORDER BY defines the order within each partition

SELECT
    customer_id,
    product,
    amount,
    -- Average per customer
    AVG(amount) OVER (PARTITION BY customer_id) AS customer_avg,
    -- Rank within customer by amount
    RANK() OVER (PARTITION BY customer_id ORDER BY amount DESC) AS price_rank,
    -- Overall rank
    RANK() OVER (ORDER BY amount DESC) AS overall_rank
FROM orders
WHERE amount IS NOT NULL;

LEAD() and LAG()

Access data from the next or previous row without self-joining:

-- Compare each order with the previous one
SELECT
    order_date,
    amount,
    LAG(amount, 1)  OVER (ORDER BY order_date) AS prev_amount,
    LEAD(amount, 1) OVER (ORDER BY order_date) AS next_amount,
    amount - LAG(amount, 1) OVER (ORDER BY order_date) AS diff_from_prev
FROM orders
WHERE amount IS NOT NULL
ORDER BY order_date;

-- With default values for first/last rows
SELECT
    order_date,
    amount,
    LAG(amount, 1, 0)  OVER (ORDER BY order_date) AS prev_amount,
    LEAD(amount, 1, 0) OVER (ORDER BY order_date) AS next_amount
FROM orders
WHERE amount IS NOT NULL;

-- FIRST_VALUE and LAST_VALUE
SELECT
    customer_id,
    product,
    amount,
    FIRST_VALUE(product) OVER (
        PARTITION BY customer_id ORDER BY order_date
    ) AS first_product_bought,
    LAST_VALUE(product) OVER (
        PARTITION BY customer_id ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS last_product_bought
FROM orders;

Frame Clause: ROWS BETWEEN

The frame clause defines exactly which rows relative to the current row are included in the window calculation:

-- Syntax:
-- ROWS BETWEEN <start> AND <end>
--
-- Options for start/end:
--   UNBOUNDED PRECEDING  (first row of partition)
--   n PRECEDING           (n rows before current)
--   CURRENT ROW
--   n FOLLOWING            (n rows after current)
--   UNBOUNDED FOLLOWING   (last row of partition)

-- 3-row moving average
SELECT
    order_date,
    amount,
    AVG(amount) OVER (
        ORDER BY order_date
        ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
    ) AS moving_avg_3
FROM orders
WHERE amount IS NOT NULL;

-- Cumulative sum (default when ORDER BY is present)
SELECT
    order_date,
    amount,
    SUM(amount) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_sum
FROM orders
WHERE amount IS NOT NULL;

-- Sum of current + next 2 rows
SELECT
    order_date,
    amount,
    SUM(amount) OVER (
        ORDER BY order_date
        ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
    ) AS forward_sum
FROM orders
WHERE amount IS NOT NULL;

Named Windows (WINDOW Clause)

-- Avoid repeating window definitions
SELECT
    customer_id,
    order_date,
    amount,
    SUM(amount)   OVER w AS running_total,
    AVG(amount)   OVER w AS running_avg,
    COUNT(*)      OVER w AS running_count
FROM orders
WHERE amount IS NOT NULL
WINDOW w AS (PARTITION BY customer_id ORDER BY order_date)
ORDER BY customer_id, order_date;

MySQL vs PostgreSQL: Window Functions

MySQL 8.0+: Supports all standard window functions. Added in 8.0 (2018). No support for RANGE with interval offsets (e.g., RANGE BETWEEN INTERVAL 7 DAY PRECEDING).

PostgreSQL: Full window function support since version 8.4 (2009). Supports RANGE with interval types, GROUPS frame mode, and FILTER (WHERE ...) clause on aggregates. Also supports EXCLUDE in frame specs.

Tip: If you need FILTER (WHERE ...) in MySQL, use SUM(CASE WHEN ... END) as a workaround.

5. Practical Example: Sales Report

Let's build a comprehensive sales reporting system using a realistic dataset:

-- Setup: products and sales tables
CREATE TABLE products (
    id          INT PRIMARY KEY AUTO_INCREMENT,
    name        VARCHAR(100) NOT NULL,
    category    VARCHAR(50)  NOT NULL,
    price       DECIMAL(10,2) NOT NULL
);

CREATE TABLE sales (
    id          INT PRIMARY KEY AUTO_INCREMENT,
    product_id  INT NOT NULL,
    quantity    INT NOT NULL,
    sale_date   DATE NOT NULL,
    FOREIGN KEY (product_id) REFERENCES products(id)
);

INSERT INTO products (name, category, price) VALUES
('Laptop Pro',    'Electronics', 1299.99),
('Wireless Mouse','Electronics',   29.99),
('Desk Lamp',    'Office',        45.99),
('Notebook Set', 'Office',        12.99),
('Headphones',   'Electronics',  199.99),
('Monitor 27"',  'Electronics',  449.99),
('Pen Pack',     'Office',         8.99),
('Keyboard',     'Electronics',   79.99);

INSERT INTO sales (product_id, quantity, sale_date) VALUES
(1, 2, '2024-01-05'), (2, 10, '2024-01-08'), (3, 5, '2024-01-12'),
(4, 20, '2024-01-15'), (5, 3, '2024-01-20'), (1, 1, '2024-02-02'),
(6, 4, '2024-02-10'), (2, 8, '2024-02-14'), (7, 30, '2024-02-18'),
(3, 7, '2024-02-22'), (8, 6, '2024-03-01'), (1, 3, '2024-03-05'),
(5, 5, '2024-03-10'), (4, 15, '2024-03-15'), (6, 2, '2024-03-20'),
(2, 12, '2024-03-25'), (7, 25, '2024-04-01'), (8, 4, '2024-04-05'),
(3, 8, '2024-04-10'), (1, 2, '2024-04-15');

Monthly Sales Summary

SELECT
    DATE_FORMAT(s.sale_date, '%Y-%m') AS month,
    COUNT(DISTINCT s.id)              AS total_transactions,
    SUM(s.quantity)                   AS units_sold,
    SUM(s.quantity * p.price)         AS revenue,
    ROUND(AVG(s.quantity * p.price), 2) AS avg_transaction_value
FROM sales s
JOIN products p ON s.product_id = p.id
GROUP BY DATE_FORMAT(s.sale_date, '%Y-%m')
ORDER BY month;

-- Add month-over-month growth using window functions
SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
    ROUND(
        (revenue - LAG(revenue) OVER (ORDER BY month))
        / LAG(revenue) OVER (ORDER BY month) * 100,
        2
    ) AS growth_pct
FROM (
    SELECT
        DATE_FORMAT(s.sale_date, '%Y-%m') AS month,
        SUM(s.quantity * p.price)         AS revenue
    FROM sales s
    JOIN products p ON s.product_id = p.id
    GROUP BY DATE_FORMAT(s.sale_date, '%Y-%m')
) monthly_data
ORDER BY month;

Top N Products per Category

-- Top 2 products by revenue in each category
SELECT category, name, total_revenue, rnk
FROM (
    SELECT
        p.category,
        p.name,
        SUM(s.quantity * p.price) AS total_revenue,
        ROW_NUMBER() OVER (
            PARTITION BY p.category
            ORDER BY SUM(s.quantity * p.price) DESC
        ) AS rnk
    FROM sales s
    JOIN products p ON s.product_id = p.id
    GROUP BY p.category, p.name
) ranked
WHERE rnk <= 2
ORDER BY category, rnk;

-- With DENSE_RANK to handle ties
SELECT category, name, total_revenue, rnk
FROM (
    SELECT
        p.category,
        p.name,
        SUM(s.quantity * p.price) AS total_revenue,
        DENSE_RANK() OVER (
            PARTITION BY p.category
            ORDER BY SUM(s.quantity * p.price) DESC
        ) AS rnk
    FROM sales s
    JOIN products p ON s.product_id = p.id
    GROUP BY p.category, p.name
) ranked
WHERE rnk <= 2
ORDER BY category, rnk;

Running Total and Moving Average

-- Daily running total and 3-day moving average
SELECT
    sale_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY sale_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total,
    ROUND(
        AVG(daily_revenue) OVER (
            ORDER BY sale_date
            ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
        ), 2
    ) AS moving_avg_3day
FROM (
    SELECT
        s.sale_date,
        SUM(s.quantity * p.price) AS daily_revenue
    FROM sales s
    JOIN products p ON s.product_id = p.id
    GROUP BY s.sale_date
) daily_data
ORDER BY sale_date;

-- Category-level cumulative revenue
SELECT
    p.category,
    s.sale_date,
    SUM(s.quantity * p.price) AS daily_revenue,
    SUM(SUM(s.quantity * p.price)) OVER (
        PARTITION BY p.category
        ORDER BY s.sale_date
    ) AS category_running_total
FROM sales s
JOIN products p ON s.product_id = p.id
GROUP BY p.category, s.sale_date
ORDER BY p.category, s.sale_date;

Complete Dashboard Query

-- A single query combining multiple analytics
SELECT
    p.category,
    p.name AS product,
    SUM(s.quantity)           AS units_sold,
    SUM(s.quantity * p.price) AS revenue,
    ROUND(
        SUM(s.quantity * p.price)
        / SUM(SUM(s.quantity * p.price)) OVER () * 100,
        2
    ) AS revenue_share_pct,
    RANK() OVER (ORDER BY SUM(s.quantity * p.price) DESC) AS overall_rank,
    RANK() OVER (
        PARTITION BY p.category
        ORDER BY SUM(s.quantity * p.price) DESC
    ) AS category_rank
FROM sales s
JOIN products p ON s.product_id = p.id
GROUP BY p.category, p.name
ORDER BY revenue DESC;

6. Chapter Summary

Aggregate Functions

  • COUNT, SUM, AVG, MIN, MAX are the five core aggregates
  • COUNT(*) counts all rows; COUNT(col) skips NULLs
  • Use DISTINCT inside aggregates for unique values

GROUP BY & HAVING

  • GROUP BY collapses rows into groups
  • WITH ROLLUP adds subtotals and a grand total
  • HAVING filters groups; WHERE filters rows

Window Functions

  • ROW_NUMBER, RANK, DENSE_RANK for ranking
  • LEAD, LAG to access adjacent rows
  • Frame clauses like ROWS BETWEEN control the window
  • WINDOW clause to name and reuse definitions

Best Practices

  • Filter with WHERE first to reduce grouping work
  • Use window functions to avoid correlated subqueries
  • Name your windows when reusing the same definition
  • Watch for NULLs in aggregate calculations