Classification Metrics Overview

Overview

Arthur supports comprehensive evaluation for classification models, including:

Binary classification (e.g., fraud vs. non-fraud, approve vs. decline)
Multiclass classification (e.g., topic labeling, intent detection, document type)

This page introduces how Arthur thinks about classification metrics and how to interpret the metric buckets that follow (Positive-Class Error Profile, Detection & Acceptance, Subgroup Rate Comparison, Curve-Based Discrimination, AUC Stability, Rank Association, etc.).

If you’re building or monitoring classification models, this is your starting point for understanding:

How Arthur expects your model outputs to be structured
How metrics differ between binary and multiclass
How the downstream bucket documentation fits together

For details on how to implement metrics via SQL and the Custom Metrics system, see
Custom Metrics.

Supported Classification Problem Types

Arthur treats classification tasks in two main categories:

Binary Classification

Binary classification problems have two mutually exclusive classes, typically represented as:

Label set: {0, 1} or {negative, positive}
Model outputs:
- A predicted probability for the positive class (e.g. p(y=1 | x))
- A predicted label (0 or 1), usually obtained by thresholding the probability

Many of the buckets (e.g., error profiles, ROC-AUC, KS, detection/acceptance profiles) are most natural in the binary case and assume a single “positive class” of interest.

Multiclass Classification

Multiclass classification problems have more than two classes, e.g.:

Label set: {0, 1, 2, …, K-1} or {spam, promo, social, primary}
Model outputs:
- Either a probability distribution over all classes
- Or a single predicted label per example

In the multiclass setting, some metrics generalize cleanly (e.g., per-class recall, overall accuracy), while others require more care (e.g., AUC, KS, rank-based metrics). Arthur’s bucket docs will specify:

Whether a metric is binary-only or multiclass-compatible
How to compute per-class vs aggregated metrics (macro, micro, weighted)
How to interpret differences between classes

Model Outputs and Data Expectations

To use the classification metrics in these buckets, Arthur expects your data to contain:

Ground truth labels (encoded consistently across train/validation/production)
Predictions, at minimum:
- Binary: probability or score for the positive class, and/or predicted label
- Multiclass: per-class probabilities or a predicted label, ideally both
Optional but recommended:
- Scores/logits for ranking-based metrics
- Auxiliary features (for subgroup and fairness metrics)
- Timestamps (for stability and drift metrics)
- Sample weights (if your training or evaluation pipeline uses weighting)

Each bucket document will call out any additional requirements (e.g. need for per-class probabilities for multiclass ROC-like curves).

How Buckets Relate to Classification Evaluation

The classification metric buckets you’ll see in this documentation represent families of related metrics, not single numbers. They’re grouped by how they characterize model behavior.

For example:

Bucket 1 — Positive-Class Error Profile
Focuses on false positives, false negatives, and how errors concentrate in different score regions. Most natural in the binary setting, but aspects can be adapted per-class for multiclass.
Bucket 3 — Detection & Acceptance Profile
Covers recall, acceptance rate, precision/PPV, and how those trade off as you move the score threshold—useful for both binary and multiclass (per-class or “one-vs-rest” variants).
Bucket 4 — Subgroup Rate Comparison
Evaluates fairness and performance across segments (e.g., geography, age band, channel). Works for both binary and multiclass by comparing rates per subgroup.
Bucket 5 — Curve-Based Discrimination
Includes ROC curves, AUC, Gini, KS statistics, and related discrimination curves. These are naturally binary, but can be applied to multiclass via one-vs-rest or macro-averaged formulations.
Bucket 6 — AUC Stability Charts
Tracks how discrimination quality (e.g., AUC) changes over time, highlighting stability or drift issues. Typically uses binary AUC or per-class AUC in multiclass scenarios.
Bucket 7 — Rank Association Profile
Looks at rank-based correlation between scores and outcomes (e.g., Spearman’s ρ, Kendall’s τ). Applicable to both binary and multiclass, as long as you have a meaningful scalar score.

Each bucket doc will:

Specify whether a metric is Binary-only, Multiclass-only, or Both
Describe any additional steps needed for multiclass (e.g. per-class aggregation)
Provide SQL examples that match Arthur’s Custom Metrics engine
Show example plots and how to interpret them

Binary vs. Multiclass: How Metrics Differ

While many metric names overlap between binary and multiclass, their definitions and interpretations can differ.

Binary

Metrics like TPR, FPR, Precision, Recall, F1, AUC, KS are defined with respect to a single positive class.
Threshold-based profiles (Detection/Acceptance) are usually computed on the positive class score.
Confusion matrices are 2×2 and easy to interpret.

Multiclass

Common strategies Arthur supports conceptually (and via custom metrics):

Per-class metrics
Compute binary-style metrics for each class via a one-vs-rest framing.
Macro averaging
Average per-class metrics equally across all classes.
Micro averaging
Aggregate across all instances and classes, then compute a single metric.
Weighted averaging
Weight per-class metrics by class frequency or a custom weighting scheme.

For each bucket, the documentation will call out:

Which averaging strategy is used (if applicable)
Whether you are looking at per-class plots or aggregated plots
Any caveats about highly imbalanced classes or rare labels

How Classification Metrics Are Implemented in Arthur

Most of the classification metrics in the buckets are implemented using Arthur’s Custom Metrics engine:

Metrics are expressed as SQL queries over Arthur datasets (inference logs, ML tables, GenAI outputs, etc.)
SQL outputs reported metrics (e.g., TP, FP, FN, acceptance rate, AUC contributions)
Arthur runs these queries in 5-minute time buckets for time-series support
Plots and dashboards are built using aggregations of these reported metrics

To understand the underlying mechanics of how these metrics are constructed and executed, see:

Custom Metrics
Metrics & Querying Overview

The bucket-specific pages focus on what the metric means, how to read the plots, and how to interpret them for binary vs. multiclass, rather than on the generic custom metrics plumbing.

Getting Started

If you are new to classification metrics in Arthur, a good path is:

Read this overview to understand how binary vs. multiclass is treated.
Skim the bucket list and decide which evaluation angles matter most:
- Error concentration
- Threshold tradeoffs
- Fairness/subgroup comparisons
- Discrimination curves (AUC, KS, Gini)
- Stability over time
- Rank-based behavior
Dive into each relevant bucket page:
- Start with the “When to use this bucket” and “Binary vs Multiclass” sections.
- Look at example plots and interpretation notes.
If you need to add or customize metrics:
- Go to Custom Metrics for SQL patterns and configuration guidance.
Use the Metrics Querying tools to validate metric outputs and explore time series.