Population Stability Index (PSI) Metrics
Population Stability Index (PSI) metrics let you quantify how much a feature’s distribution has shifted over time by comparing it either to a fixed reference dataset (like training data) or to a recent rolling baseline (e.g., last 30 days), with PSI scores computed per time bucket so you can detect both long-term drift and short-term anomalies in production.
Arthur supports implementing Population Stability Index (PSI) as custom metrics in two main ways:
- PSI vs Fixed Reference Dataset — compare your live data to a fixed baseline (e.g., training data).
- PSI vs 30-Day Rolling Baseline — compare each time bucket to a recent window of past data.
Both variants measure how much a feature’s distribution has shifted using PSI:
PSI is defined as:
PSI = Σ_b (cur_b − ref_b) · ln(cur_b / ref_b)
Where, for each bin b:
- ref_b = proportion of the reference distribution in that bin
- cur_b = proportion of the current distribution in that bin
PSI vs Fixed Reference Dataset (Numeric Feature)
Overview
This custom metric computes PSI per 5-minute bucket for a numeric feature by comparing:
- Reference: a fixed dataset (e.g., training or holdout)
- Current: your live / inference dataset in Arthur
It:
- Uses the reference dataset to define min/max and bin boundaries
- Computes reference bin proportions once
- Computes current bin proportions per 5-minute bucket
- Aggregates to PSI per bucket
This is ideal when you want a stable baseline for regulatory/compliance or long-term drift monitoring.
Step 1: Write the SQL
This SQL assumes:
{{reference_dataset}}— baseline dataset/table (e.g., training){{dataset}}— current/live dataset{{timestamp_col}}— timestamp column on the current dataset{{feature_col}}— numeric feature to monitor{{num_bins}}— number of bins (e.g., 10 or 20)
SELECT
bucket,
SUM((p_cur - p_ref) * LN(p_cur / p_ref)) AS psi_against_reference
FROM
(
----------------------------------------------------------------------------
-- 0) Compute ref_min and ref_max ONCE and reuse them everywhere
----------------------------------------------------------------------------
SELECT
bucket,
bin_id,
-- sanitized proportions
GREATEST(p_cur_raw, 1e-6) AS p_cur,
GREATEST(p_ref_raw, 1e-6) AS p_ref
FROM
(
----------------------------------------------------------------------------
-- Join reference + current distributions
----------------------------------------------------------------------------
SELECT
c.bucket,
c.bin_id,
COALESCE(c.p_cur, 0) AS p_cur_raw,
COALESCE(r.p_ref, 0) AS p_ref_raw
FROM
(
----------------------------------------------------------------------------
-- Current distribution (per 5-minute bucket)
----------------------------------------------------------------------------
SELECT
bucket,
bin_id,
cur_count / NULLIF(
SUM(cur_count) OVER (
PARTITION BY
bucket
),
0
) AS p_cur
FROM
(
SELECT
time_bucket (INTERVAL '5 minutes', d.{{timestamp_col}}) AS bucket,
CASE
WHEN ref_max = ref_min THEN 1
ELSE LEAST(
{{num_bins}},
GREATEST(
1,
CAST(
FLOOR(
({{feature_col}} - ref_min) / NULLIF(ref_max - ref_min, 0) * {{num_bins}}
) AS INTEGER
) + 1
)
)
END AS bin_id,
COUNT(*)::float AS cur_count
FROM
{{dataset}} AS d,
(
SELECT
MIN({{feature_col}})::float AS ref_min,
MAX({{feature_col}})::float AS ref_max
FROM
{{reference_dataset}}
) AS rs
GROUP BY
bucket,
bin_id,
ref_min,
ref_max
) cur_raw
) c
LEFT JOIN (
----------------------------------------------------------------------------
-- Reference distribution (global)
----------------------------------------------------------------------------
SELECT
bin_id,
ref_count / NULLIF(SUM(ref_count) OVER (), 0) AS p_ref
FROM
(
SELECT
CASE
WHEN ref_max = ref_min THEN 1
ELSE LEAST(
{{num_bins}},
GREATEST(
1,
CAST(
FLOOR(
({{feature_col}} - ref_min) / NULLIF(ref_max - ref_min, 0) * {{num_bins}}
) AS INTEGER
) + 1
)
)
END AS bin_id,
COUNT(*)::float AS ref_count
FROM
{{reference_dataset}} r,
(
SELECT
MIN({{feature_col}})::float AS ref_min,
MAX({{feature_col}})::float AS ref_max
FROM
{{reference_dataset}}
) AS rs
GROUP BY
bin_id,
ref_min,
ref_max
) ref_bins
) r USING (bin_id)
) psi_terms
) psi_final
GROUP BY
bucket
ORDER BY
bucket;What this query is doing
ref_stats— computeref_minandref_maxfrom the reference dataset to define bin edges.ref_bins/ref_dist— compute reference bin counts and convert to proportionsp_ref.cur_bins/cur_dist— compute current bin counts per 5-minute bucket and convert top_cur.psi_terms/psi_sanitized— join the two distributions and enforce a small epsilon to avoidlog(0)and division by zero.- Final
SELECT— aggregates PSI perbucketaspsi_against_reference.
Step 2: Fill Basic Information
In the Arthur UI, for this custom metric:
-
Name:
PSI_<feature_name>_vs_reference
(e.g.,PSI_income_vs_training_reference) -
Description:
Population Stability Index for {{feature_col}} comparing the current dataset against {{reference_dataset}} as a fixed baseline.
Step 3: Configure the Aggregate Arguments
You’ll define the following aggregate arguments.
Argument 1 — Timestamp Column
- Parameter Key:
timestamp_col - Friendly Name:
Timestamp Column - Description:
Timestamp column on the current dataset used for 5-minute bucketing. - Parameter Type:
Column - Source Dataset Parameter Key:
Dataset (dataset) - Allow Any Column Type:
No - Allowed Column Types (optional):
timestamp
Argument 2 — Feature Column
- Parameter Key:
feature_col - Friendly Name:
Feature Column - Description:
Numeric feature column to compute PSI for. - Parameter Type:
Column - Source Dataset Parameter Key:
Dataset (dataset) - Allow Any Column Type:
No - Allowed Column Types:
float,int
Argument 3 — Current Dataset
- Parameter Key:
dataset - Friendly Name:
Current Dataset - Description:
Current (live/log) dataset to compute PSI on. - Parameter Type:
Dataset
Argument 4 — Reference Dataset
- Parameter Key:
reference_dataset - Friendly Name:
Reference Dataset - Description:
Fixed baseline dataset (e.g., training or holdout) used as the PSI reference. - Parameter Type:
Dataset
Argument 5 — Number of Bins
- Parameter Key:
num_bins - Friendly Name:
Number of Bins - Description:
Number of equal-width bins between the reference min and max. - Parameter Type:
Literal - Data Type:
Integer
Typical values: 10 or 20.
Step 4: Configure the Reported Metrics
Reported Metric 1 — PSI Against Reference
- Metric Name:
PSI (vs Reference) - Description:
Population Stability Index for {{feature_col}} per 5-minute bucket, against the fixed reference dataset. - Value Column:
psi_against_reference - Timestamp Column:
bucket - Metric Kind:
Numeric
Interpreting PSI vs Fixed Reference
Common rules of thumb:
- PSI < 0.1
- Little to no shift. Distribution is stable vs training.
- 0.1 ≤ PSI < 0.25
- Moderate shift. Monitor closely; may indicate emerging drift.
- PSI ≥ 0.25
- Significant shift. Often used as a trigger for review/retraining.
Because the reference dataset is fixed, this metric is ideal for:
- Regulatory/Model Risk Management reporting
- Comparing production behavior to the original model design population
- Long-horizon stability analysis
Optional: Categorical PSI vs Reference
For categorical features, you can use categories as “bins” instead of numeric ranges. The logic is the same; only the bin definition changes.
High-level SQL pattern (categorical):
WITH ref_dist AS (
SELECT
{{feature_col}} AS category,
COUNT(*)::float AS ref_count
FROM {{reference_dataset}}
GROUP BY category
),
ref_norm AS (
SELECT
category,
ref_count / NULLIF(SUM(ref_count) OVER (), 0) AS p_ref
FROM ref_dist
),
cur_dist AS (
SELECT
time_bucket(INTERVAL '5 minutes', {{timestamp_col}}) AS bucket,
{{feature_col}} AS category,
COUNT(*)::float AS cur_count
FROM {{dataset}}
GROUP BY bucket, category
),
cur_norm AS (
SELECT
bucket,
category,
cur_count / NULLIF(SUM(cur_count) OVER (PARTITION BY bucket), 0) AS p_cur
FROM cur_dist
),
psi_terms AS (
SELECT
c.bucket,
COALESCE(c.p_cur, 0.0) AS p_cur_raw,
COALESCE(r.p_ref, 0.0) AS p_ref_raw
FROM cur_norm AS c
LEFT JOIN ref_norm AS r USING (category)
),
psi_sanitized AS (
SELECT
bucket,
GREATEST(p_cur_raw, 1e-6) AS p_cur,
GREATEST(p_ref_raw, 1e-6) AS p_ref
FROM psi_terms
)
SELECT
bucket,
SUM( (p_cur - p_ref) * LN(p_cur / p_ref) ) AS psi_against_reference
FROM psi_sanitized
GROUP BY bucket
ORDER BY bucket;The arguments and reported metric configuration are analogous to the numeric case, but feature_col must be a categorical type (e.g., string).
PSI vs 30-Day Rolling Baseline
Overview
The rolling PSI variant compares each 5-minute bucket to a recent 30-day window instead of a fixed dataset.
For each bucket b:
- Reference = all rows from
{{dataset}}withtimestampin[b - 30 days, b) - Current = rows in bucket
bitself
This:
- Highlights short-term changes in feature distribution
- Adapts to evolving populations
- Is useful for day-to-day operational monitoring
Note: This is more complex to compute than the fixed-reference version and may be more expensive for very large datasets. Treat this SQL as a pattern that you can adapt and optimize.
Step 1: Write the SQL (Numeric Feature, Rolling 30-Day Baseline)
Below is a conceptual SQL pattern for 5-minute buckets and a 30-day reference window:
WITH base AS (
SELECT
{{timestamp_col}} AS event_ts,
{{feature_col}} AS feature_value
FROM {{dataset}}
),
-- Distinct 5-minute buckets present in the data
buckets AS (
SELECT DISTINCT time_bucket(INTERVAL '5 minutes', event_ts) AS bucket
FROM base
),
-- Optional: global min/max to define stable bins (instead of per-window min/max)
global_stats AS (
SELECT
MIN(feature_value)::float AS global_min,
MAX(feature_value)::float AS global_max
FROM base
),
-- Current distribution per bucket
cur_bins AS (
SELECT
b.bucket,
WIDTH_BUCKET(bc.feature_value, gs.global_min, gs.global_max, {{num_bins}}) AS bin_id,
COUNT(*)::float AS cur_count
FROM buckets b
JOIN base bc
ON bc.event_ts >= b.bucket
AND bc.event_ts < b.bucket + INTERVAL '5 minutes'
CROSS JOIN global_stats AS gs
GROUP BY b.bucket, bin_id
),
cur_dist AS (
SELECT
bucket,
bin_id,
cur_count / NULLIF(SUM(cur_count) OVER (PARTITION BY bucket), 0) AS p_cur
FROM cur_bins
),
-- Reference distribution: last 30 days before each bucket
ref_bins AS (
SELECT
b.bucket,
WIDTH_BUCKET(br.feature_value, gs.global_min, gs.global_max, {{num_bins}}) AS bin_id,
COUNT(*)::float AS ref_count
FROM buckets b
JOIN base br
ON br.event_ts >= b.bucket - INTERVAL '30 days'
AND br.event_ts < b.bucket
CROSS JOIN global_stats AS gs
GROUP BY b.bucket, bin_id
),
ref_dist AS (
SELECT
bucket,
bin_id,
ref_count / NULLIF(SUM(ref_count) OVER (PARTITION BY bucket), 0) AS p_ref
FROM ref_bins
),
psi_terms AS (
SELECT
c.bucket,
c.bin_id,
COALESCE(c.p_cur, 0.0) AS p_cur_raw,
COALESCE(r.p_ref, 0.0) AS p_ref_raw
FROM cur_dist AS c
LEFT JOIN ref_dist AS r
ON c.bucket = r.bucket
AND c.bin_id = r.bin_id
),
psi_sanitized AS (
SELECT
bucket,
bin_id,
GREATEST(p_cur_raw, 1e-6) AS p_cur,
GREATEST(p_ref_raw, 1e-6) AS p_ref
FROM psi_terms
)
SELECT
bucket,
SUM( (p_cur - p_ref) * LN(p_cur / p_ref) ) AS psi_vs_30d_baseline
FROM psi_sanitized
GROUP BY bucket
ORDER BY bucket;What this query is doing
buckets— identifies all 5-minute buckets present in the data.global_stats— computes global min/max to define bin edges (keeps bins stable).cur_bins/cur_dist— compute per-bucket proportionsp_cur.ref_bins/ref_dist— for each bucket, compute the distribution over the preceding 30 days asp_ref.psi_terms/psi_sanitized— joinp_curandp_refper bucket/bin with epsilons.- Final
SELECT— returnspsi_vs_30d_baselineper 5-minute bucket.
Step 2: Basic Information
In Arthur:
-
Name:
PSI_<feature_name>_vs_30d_baseline -
Description:
Population Stability Index for {{feature_col}} comparing each 5-minute bucket to the preceding 30 days on the same dataset.
Step 3: Aggregate Arguments
Most arguments are shared with the fixed-reference version:
timestamp_col— timestamp columnfeature_col— numeric featuredataset— current datasetnum_bins— number of bins (Literal, Integer)
You don’t need a reference_dataset here because the reference is derived from the same dataset.
Step 4: Reported Metrics
- Metric Name:
PSI (vs 30-Day Baseline) - Description:
Population Stability Index per 5-minute bucket comparing to a rolling 30-day reference window. - Value Column:
psi_vs_30d_baseline - Timestamp Column:
bucket - Metric Kind:
Numeric
Interpreting PSI vs 30-Day Baseline
Interpretation thresholds (0.1 / 0.25) are similar, but now:
- The reference is moving with time (last 30 days).
- Large PSI spikes indicate short-term distribution changes relative to recent behavior.
This variant is ideal when:
- The overall population evolves over time (seasonality, product changes).
- You care about operational anomalies (“something weird happened this week”) rather than deviation from old training data.
Choosing Between the Two
You can (and often should) have both metrics in Arthur:
-
PSI_<feature>_vs_reference- Fixed baseline (e.g., training).
- Great for long-term stability and regulatory reporting.
-
PSI_<feature>_vs_30d_baseline- Rolling baseline (last 30 days).
- Great for day-to-day operational monitoring and alerting.
Used together, they let you answer both:
- “How far have we drifted from where we trained?”
- “How weird is today compared to what we’ve recently seen?”
Updated 1 day ago