Understanding SQL Window Functions: A Comprehensive Guide

SQL Window Functions are powerful tools that allow users to perform calculations across a set of table rows related to the current row.

What is a Window Function?

A window function performs a calculation across a set of table rows that are somehow related to the current row. This set of rows is referred to as the “window” and is defined using the OVER clause. Window functions are versatile and can be used for a wide range of analytical tasks.

Unlike aggregate functions, which return a single result for a group of rows, window functions can return multiple rows for each partition. This makes them incredibly useful for running totals, moving averages, and rank calculations.

window functions

Syntax

The general syntax for a window function is:

<function_name>(<expression>) OVER (
    [PARTITION BY <expression>]
    [ORDER BY <expression>]
    [ROWS/RANGE BETWEEN <start> AND <end>]
)
SQL

  • <function_name>: The window function to be used (e.g., ROW_NUMBER, RANK, SUM, AVG).
  • <expression>: The column or expression to be used in the function.
  • PARTITION BY: Divides the result set into partitions to which the window function is applied.
  • ORDER BY: Defines the order of the rows within each partition.
  • ROWS/RANGE BETWEEN: Specifies the window frame for the function.

Common Window Functions

window functions

Sample Data to practice with

CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    department_id INT,
    salary INT,
    hire_date DATE
);



INSERT INTO employees (employee_id, first_name, last_name, department_id, salary, hire_date) VALUES
(1, 'John', 'Doe', 101, 70000, '2018-04-12'),
(2, 'Jane', 'Smith', 101, 85000, '2017-03-09'),
(3, 'Alice', 'Johnson', 102, 60000, '2020-06-01'),
(4, 'Bob', 'Lee', 101, 85000, '2021-11-15'),
(5, 'Charlie', 'Brown', 103, 95000, '2016-07-23'),
(6, 'Eva', 'Green', 102, 67000, '2019-08-11'),
(7, 'David', 'Clark', 103, 95000, '2015-05-30'),
(8, 'Fiona', 'White', 101, 72000, '2020-01-10'),
(9, 'George', 'Black', 102, 60000, '2022-02-22'),
(10, 'Hannah', 'Scott', 103, 90000, '2018-12-03');
SQL

Get Compete data set

Sample Table

employee_idfirst_namelast_namedepartment_idsalaryhire_date
1JohnDoe101700002018-04-12
2JaneSmith101850002017-03-09
3AliceJohnson102600002020-06-01
4BobLee101850002021-11-15
5CharlieBrown103950002016-07-23
6EvaGreen102670002019-08-11
7DavidClark103950002015-05-30
8FionaWhite101720002020-01-10
9GeorgeBlack102600002022-02-22
10HannahScott103900002018-12-03

ROW_NUMBER()

The ROW_NUMBER function assigns a unique sequential integer to rows within a partition of a result set, starting at 1 for the first row in each partition.

SELECT employee_id,first_name, last_name, department_id, salary, ROW_NUMBER() OVER 
(PARTITION BY department_id ORDER BY salary DESC) AS row_num FROM employees;

SQL

Result:

employee_idfirst_namelast_namedepartment_idsalaryrow_num
4BobLee101850001
2JaneSmith101850002
8FionaWhite101720003
1JohnDoe101700004
6EvaGreen102670001
9GeorgeBlack102600002
3AliceJohnson102600003
5CharlieBrown103950001
7DavidClark103950002
10HannahScott103900003

RANK() and DENSE_RANK()

The RANK function assigns a rank to each row within a partition, with gaps in the ranking values when there are ties. DENSE_RANK assigns ranks without gaps.

SELECT employee_id, first_name, last_name, department_id, salary, RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank, DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS dense_rank FROM employees;
SQL

Result:

employee_idfirst_namelast_namedepartment_idsalaryrankdense_rank
4BobLee1018500011
2JaneSmith1018500011
8FionaWhite1017200032
1JohnDoe1017000043
6EvaGreen1026700011
3AliceJohnson1026000022
9GeorgeBlack1026000022
7DavidClark1039500011
5CharlieBrown1039500011
10HannahScott1039000032

Explanation:

  • RANK() leaves gaps when there are ties.
  • DENSE_RANK() gives the same rank to ties but does not skip ranks.

For example:

  • In department 101, both Jane and Bob earn 85000 → RANK = 1, next gets RANK = 3.
  • But DENSE_RANK = 1, and next gets DENSE_RANK = 2.

How Ties Are Determined:

A “tie” happens when two or more rows have identical values in the ORDER BY column(s) within their partition.

So here:

  • Rows are first partitioned by department_id, so employees are grouped by department.
  • Then within each department group, rows are ordered by salary DESC.
  • If multiple employees have the same salary in the same department, they are considered a tie.

SQL RANK() versus ROW_NUMBER()

  • RANK and DENSE_RANK are deterministic, all rows with the same value for both the ordering and partitioning columns will end up with an equal result, whereas ROW_NUMBER will arbitrarily (non-deterministically) assign an incrementing result to the tied
  • ROW_NUMBER: Returns a unique number for each row starting with 1. For rows that have duplicate values, numbers are arbitrarily assigned.
  • Rank: Assigns a unique number for each row starting with 1, except for rows that have duplicate values, in which case the same ranking is assigned and a gap appears in the sequence for each duplicate ranking.

LAG() and LEAD()

The LAG function provides access to a row at a specified physical offset before the current row within the result set. LEAD does the same but for the row after the current row.

LAG() Function

LAG() provides access to a previous row’s value in the result set without using self-joins.

Syntax:

LAG(column_name, offset, default_value) OVER (PARTITION BYORDER BY …)
SQL

  • column_name: The column you want to look back on.
  • offset: How many rows back you want to go (default is 1).
  • default_value: What to return if there’s no previous row (default is NULL).

Use case:

  • Compare each employee’s salary with the one ranked before them (e.g., to find salary differences).

LEAD() Function

LEAD() returns the value of a subsequent (next) row in the result set.

Syntax:

LEAD(column_name, offset, default_value) OVER (PARTITION BYORDER BY …)
SQL

Use case:

  • Compare each employee’s value with the next row (e.g., find the next salary in the department).

SELECT employee_id,first_name,last_name,department_id,salary, 
  LAG(salary, 1) OVER (PARTITION BY department_id ORDER BY salary DESC) AS previous_salary, 
  LEAD(salary, 1) OVER (PARTITION BY department_id ORDER BY salary DESC) AS next_salary
FROM employees;
SQL

Result:

employee_idfirst_namelast_namedepartment_idsalaryprevious_salarynext_salary
2JaneSmith10185000NULL85000
4BobLee101850008500072000
8FionaWhite101720008500070000
1JohnDoe1017000072000NULL
6EvaGreen10267000NULL60000
3AliceJohnson102600006700060000
9GeorgeBlack1026000060000NULL
5CharlieBrown10395000NULL95000
7DavidClark103950009500090000
10HannahScott1039000095000NULL

Explanation:

  • LAG(salary, 1) gives the salary of the previous row in the department (by descending salary).
  • LEAD(salary, 1) gives the salary of the next row in the department (by descending salary).
  • NULL appears when there’s no previous or next row in that group.

Application

Example 1: Trend Analysis (e.g., Salary Changes, Stock Prices)

  • Use: Compare current value with previous or next row.
  • Example: Check if an employee’s salary increased or decreased over time.

Data set

CREATE TABLE employee_salaries (
    employee_id INT,
    salary INT,
    effective_date DATE
);

INSERT INTO employee_salaries (employee_id, salary, effective_date) VALUES
(1, 60000, '2019-01-01'),
(1, 65000, '2020-01-01'),
(1, 70000, '2021-01-01'),
(2, 80000, '2019-01-01'),
(2, 85000, '2021-01-01'),
(3, 50000, '2020-06-01'),
(3, 55000, '2021-06-01'),
(3, 60000, '2022-06-01');
SQL

Query

SELECT
  employee_id,
  effective_date,
  salary,
  LAG(salary) OVER (PARTITION BY employee_id ORDER BY effective_date) AS previous_salary,
  salary - LAG(salary) OVER (PARTITION BY employee_id ORDER BY effective_date) AS salary_change
FROM employee_salaries;
SQL

Output

employee_ideffective_datesalaryprevious_salarysalary_change
12019-01-0160000NULLNULL
12020-01-0165000600005000
12021-01-0170000650005000
22019-01-0180000NULLNULL
22021-01-0185000800005000
32020-06-0150000NULLNULL
32021-06-0155000500005000
32022-06-0160000550005000

Example 2: Detecting Gaps or Anomalies

  • Use: Identify missing records or inconsistencies.
  • Example: In a time series, use LAG() to find skipped dates or missing transactions.

CREATE TABLE user_logins (
  user_id INT,
  login_date DATE
);


INSERT INTO user_logins (user_id, login_date) VALUES
(1, '2024-01-01'),
(1, '2024-01-02'),
(1, '2024-01-04'),  -- gap: missing 2024-01-03
(1, '2024-01-05'),
(1, '2024-01-08');  -- gap: missing 2024-01-06, 2024-01-07
SQL

Query

SELECT
  user_id,
  login_date,
  LAG(login_date) OVER (PARTITION BY user_id ORDER BY login_date) AS previous_login,
  DATEDIFF(login_date, LAG(login_date) OVER (PARTITION BY user_id ORDER BY login_date)) AS gap
FROM user_logins;
SQL

Output

user_idlogin_dateprevious_logingap
12024-01-01NULLNULL
12024-01-022024-01-011
12024-01-042024-01-022 ← gap (2024-01-03 missing)
12024-01-052024-01-041
12024-01-082024-01-053 ← gap (2024-01-06 & 07 missing)

Example 3: Customer Retention / Churn Analysis

  • Use: Find the time between customer interactions (purchases, logins, etc.).
  • Example: Use LAG() to calculate days since last purchase.

Data sample

CREATE TABLE purchases (
    customer_id INT,
    purchase_date DATE
);
INSERT INTO purchases (customer_id, purchase_date) VALUES
(1, '2023-01-05'),
(1, '2023-02-15'),
(1, '2023-03-22'),
(2, '2023-01-10'),
(2, '2023-06-25'),
(3, '2023-03-05');
SQL

Query

SELECT
  customer_id,
  purchase_date,
  LAG(purchase_date) OVER (
    PARTITION BY customer_id 
    ORDER BY purchase_date
  ) AS previous_purchase,
  DATEDIFF(
    purchase_date,
    LAG(purchase_date) OVER (
      PARTITION BY customer_id 
      ORDER BY purchase_date
    )
  ) AS days_since_last_purchase
FROM purchases;
SQL

Output

customer_idpurchase_dateprevious_purchasedays_since_last_purchase
12023-01-05NULLNULL
12023-02-152023-01-0541
12023-03-222023-02-1535
22023-01-10NULLNULL
22023-06-252023-01-10166
32023-03-05NULLNULL

Moving Average

A moving average calculates the average of a set of values over a sliding window of rows. It’s often used to smooth out short-term fluctuations and highlight trends.

Sample data

CREATE TABLE daily_sales (
    region VARCHAR(20),
    sale_date DATE,
    sales_amount INT
);

INSERT INTO daily_sales (region, sale_date, sales_amount) VALUES
('East', '2023-05-01', 100),
('East', '2023-05-02', 120),
('East', '2023-05-03', 90),
('East', '2023-05-04', 130),
('East', '2023-05-05', 110),
('West', '2023-05-01', 90),
('West', '2023-05-02', 130),
('West', '2023-05-03', 115),
('West', '2023-05-04', 140),
('West', '2023-05-05', 120);
SQL

Query

SELECT
  region,
  sale_date,
  sales_amount,
  AVG(sales_amount) OVER (
    PARTITION BY region
    ORDER BY sale_date
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS moving_avg_3_days
FROM daily_sales;
SQL

Output

sale_datesales_amountmoving_avg_3_days
2023-05-01100100.0
2023-05-02120110.0
2023-05-0390103.3
2023-05-04130113.3
2023-05-05110110.0

Cumulative Distribution

Calculating the cumulative distribution of salaries within each department.

SELECT employee_id, first_name, last_name, department_id, salary, CUME_DIST() OVER (PARTITION BY department_id ORDER BY salary) AS cum_dist FROM employees;
SQL

Result:

employee_idfirst_namelast_namedepartment_idsalarycum_dist
1JohnDoe101700000.2500000000
8FionaWhite101720000.5000000000
4BobLee101850001.0000000000
2JaneSmith101850001.0000000000
9GeorgeBlack102600000.6666666667
3AliceJohnson102600000.6666666667
6EvaGreen102670001.0000000000
10HannahScott103900000.3333333333
5CharlieBrown103950001.0000000000
7DavidClark103950001.0000000000

Explanation

Assume department 101 has 4 employees with salaries:

  • 70000
  • 72000
  • 85000
  • 85000

CUME_DIST() for each salary:

salaryCUME_DIST()Explanation
700001/4 = 0.25One out of four salaries ≤ 70000
720002/4 = 0.50Two out of four salaries ≤ 72000
850004/4 = 1.00Four out of four salaries ≤ 85000 (both tied rows)

What the output tells you:

  • The cum_dist value is a percentile rank of each employee’s salary within their department.
  • A value close to 0 means the employee’s salary is among the lowest in their department.
  • A value close to 1 means the employee’s salary is among the highest.

Why is this useful?

  • You can understand how an employee’s salary compares to peers in the same department.
  • Useful for salary benchmarking, compensation analysis, or percentile-based reports.

Common Window functions

PERCENT_RANK()

Calculates the relative rank of a row as a percentage between 0 and 1.

Formula:

Percent rank = rank−1/total_rows−1
SQL

  • Use case: To get percentile rank of a value.
  • Example: Employees’ salaries percentile within a department.

FIRST_VALUE()

Returns the first value in the window frame. To get the earliest or minimum value according to ordering.

Example:

FIRST_VALUE(salary) OVER (PARTITION BY department_id ORDER BY hire_date)
SQL

LAST_VALUE()

Returns the last value in the window frame.

Note: Be careful, default frame includes rows up to current row; to get last overall value, frame specification might be needed.

Use case: Get the latest or maximum value in a partition.

Example:

LAST_VALUE(salary) OVER (PARTITION BY department_id ORDER BY hire_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
SQL

NTH_VALUE()

Returns the nth value in the window frame.

Use case: Get a specific value from the sorted partition.

Example:

NTH_VALUE(salary, 2) OVER (PARTITION BY department_id ORDER BY hire_date)
SQL

Aggregate functions as window functions

You can also use regular aggregates as window functions, like:

  • SUM()
  • AVG()
  • MIN()
  • MAX()
  • COUNT()

Query

SUM(salary) OVER (PARTITION BY department_id ORDER BY hire_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
SQL

Calculates running total of salary within each department ordered by hire date.

Common use cases

  • Ranking and Leaderboards
    • Use: Show top performers, ranks by sales, scores, performance, etc.
    • Functions: RANK(), DENSE_RANK(), ROW_NUMBER()
    • Example: Top 3 earners per department.
  • Running Totals / Cumulative Sums
    • Cumulative sales over time per customer or region.
    • Running balance in a bank account.
    • Cumulative profit/loss in financial reports.
    • Total number of signups up to each day.
    • Tracking cumulative expenses in a budget.
    • Inventory tracking (stock added vs. sold over time).
    • User activity streaks (e.g., consecutive login days).
    • Points accumulated in games or reward systems.
    • Project progress tracking (e.g., tasks completed over time).
    • Employee work hours logged over time.
  • First / Last Value per Group : FIRST_VALUE(), LAST_VALUE()
    • Get the earliest hire date or join date of employees in each department
    • Find the first purchase date for each customer
    • Retrieve the latest transaction or activity date per user
    • Identify the initial or most recent status update in workflows
    • Extract the opening or closing price of a stock per trading day or period
  • De-duplication / Keeping Latest Entry :ROW_NUMBER() in a CTE
    • Selecting the most recent record per user, customer, or entity (e.g., latest login, last purchase, newest update) to avoid duplicates and keep only the latest data entry.

De duplication

Latest Entry

SELECT *
FROM (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_date DESC) AS rn
    FROM user_events
) t
WHERE rn = 1;
SQL

WITH RankedEntries AS (
    SELECT *,
           ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_date DESC) AS rn
    FROM user_events
)
DELETE FROM user_events
WHERE user_id IN (
    SELECT user_id FROM RankedEntries WHERE rn > 1
)
AND event_date IN (
    SELECT event_date FROM RankedEntries WHERE rn > 1
);
SQL

Explanation

Step 1

  • Creates a CTE (Common Table Expression) named RankedEntries.
  • For each user_id, it assigns a row number rn starting at 1 for the most recent event_date (because of ORDER BY event_date DESC).
  • So for each user, the latest event gets rn = 1, the second latest rn = 2, and so on.

Step 2

  • This part deletes rows from the original user_events table.
  • It deletes rows where the user_id and event_date match any rows in RankedEntries where rn > 1 — i.e., all but the latest event for each user.
  • So effectively, it keeps the latest entry (rn = 1) for each user and deletes the rest (duplicates or older entries).

Conclusion

SQL Window Functions are incredibly powerful tools for performing complex calculations across a set of table rows related to the current row. By using the OVER clause, these functions can provide insights and analytics that are difficult to achieve with standard SQL queries. Understanding and leveraging these functions can greatly enhance your ability to perform data analysis and gain meaningful insights from your datasets.

Use the examples provided in this article to experiment with window functions and see how they can be applied to your specific use cases. With practice, you’ll be able to harness the full power of SQL Window Functions to perform advanced data analysis with ease.

Resource

2 thoughts on “Understanding SQL Window Functions: A Comprehensive Guide”

Leave a Comment