Classification Metrics Overview

Arthur’s Classification Metrics framework provides a structured way to evaluate binary and multiclass models using families of related metric “buckets.

Overview

Arthur supports comprehensive evaluation for classification models, including:

  • Binary classification (e.g., fraud vs. non-fraud, approve vs. decline)
  • Multiclass classification (e.g., topic labeling, intent detection, document type)

This page introduces how Arthur thinks about classification metrics and how to interpret the metric buckets that follow (Positive-Class Error Profile, Detection & Acceptance, Subgroup Rate Comparison, Curve-Based Discrimination, AUC Stability, Rank Association, etc.).

If you’re building or monitoring classification models, this is your starting point for understanding:

  • How Arthur expects your model outputs to be structured
  • How metrics differ between binary and multiclass
  • How the downstream bucket documentation fits together

For details on how to implement metrics via SQL and the Custom Metrics system, see
Custom Metrics.


Supported Classification Problem Types

Arthur treats classification tasks in two main categories:

Binary Classification

Binary classification problems have two mutually exclusive classes, typically represented as:

  • Label set: {0, 1} or {negative, positive}
  • Model outputs:
    • A predicted probability for the positive class (e.g. p(y=1 | x))
    • A predicted label (0 or 1), usually obtained by thresholding the probability

Many of the buckets (e.g., error profiles, ROC-AUC, KS, detection/acceptance profiles) are most natural in the binary case and assume a single “positive class” of interest.


Multiclass Classification

Multiclass classification problems have more than two classes, e.g.:

  • Label set: {0, 1, 2, …, K-1} or {spam, promo, social, primary}
  • Model outputs:
    • Either a probability distribution over all classes
    • Or a single predicted label per example

In the multiclass setting, some metrics generalize cleanly (e.g., per-class recall, overall accuracy), while others require more care (e.g., AUC, KS, rank-based metrics). Arthur’s bucket docs will specify:

  • Whether a metric is binary-only or multiclass-compatible
  • How to compute per-class vs aggregated metrics (macro, micro, weighted)
  • How to interpret differences between classes

Model Outputs and Data Expectations

To use the classification metrics in these buckets, Arthur expects your data to contain:

  • Ground truth labels (encoded consistently across train/validation/production)
  • Predictions, at minimum:
    • Binary: probability or score for the positive class, and/or predicted label
    • Multiclass: per-class probabilities or a predicted label, ideally both
  • Optional but recommended:
    • Scores/logits for ranking-based metrics
    • Auxiliary features (for subgroup and fairness metrics)
    • Timestamps (for stability and drift metrics)
    • Sample weights (if your training or evaluation pipeline uses weighting)

Each bucket document will call out any additional requirements (e.g. need for per-class probabilities for multiclass ROC-like curves).


How Buckets Relate to Classification Evaluation

The classification metric buckets you’ll see in this documentation represent families of related metrics, not single numbers. They’re grouped by how they characterize model behavior.

For example:

  • Bucket 1 — Positive-Class Error Profile
    Focuses on false positives, false negatives, and how errors concentrate in different score regions. Most natural in the binary setting, but aspects can be adapted per-class for multiclass.

  • Bucket 3 — Detection & Acceptance Profile
    Covers recall, acceptance rate, precision/PPV, and how those trade off as you move the score threshold—useful for both binary and multiclass (per-class or “one-vs-rest” variants).

  • Bucket 4 — Subgroup Rate Comparison
    Evaluates fairness and performance across segments (e.g., geography, age band, channel). Works for both binary and multiclass by comparing rates per subgroup.

  • Bucket 5 — Curve-Based Discrimination
    Includes ROC curves, AUC, Gini, KS statistics, and related discrimination curves. These are naturally binary, but can be applied to multiclass via one-vs-rest or macro-averaged formulations.

  • Bucket 6 — AUC Stability Charts
    Tracks how discrimination quality (e.g., AUC) changes over time, highlighting stability or drift issues. Typically uses binary AUC or per-class AUC in multiclass scenarios.

  • Bucket 7 — Rank Association Profile
    Looks at rank-based correlation between scores and outcomes (e.g., Spearman’s ρ, Kendall’s τ). Applicable to both binary and multiclass, as long as you have a meaningful scalar score.

Each bucket doc will:

  • Specify whether a metric is Binary-only, Multiclass-only, or Both
  • Describe any additional steps needed for multiclass (e.g. per-class aggregation)
  • Provide SQL examples that match Arthur’s Custom Metrics engine
  • Show example plots and how to interpret them

Binary vs. Multiclass: How Metrics Differ

While many metric names overlap between binary and multiclass, their definitions and interpretations can differ.

Binary

  • Metrics like TPR, FPR, Precision, Recall, F1, AUC, KS are defined with respect to a single positive class.
  • Threshold-based profiles (Detection/Acceptance) are usually computed on the positive class score.
  • Confusion matrices are 2×2 and easy to interpret.

Multiclass

Common strategies Arthur supports conceptually (and via custom metrics):

  • Per-class metrics
    Compute binary-style metrics for each class via a one-vs-rest framing.

  • Macro averaging
    Average per-class metrics equally across all classes.

  • Micro averaging
    Aggregate across all instances and classes, then compute a single metric.

  • Weighted averaging
    Weight per-class metrics by class frequency or a custom weighting scheme.

For each bucket, the documentation will call out:

  • Which averaging strategy is used (if applicable)
  • Whether you are looking at per-class plots or aggregated plots
  • Any caveats about highly imbalanced classes or rare labels

How Classification Metrics Are Implemented in Arthur

Most of the classification metrics in the buckets are implemented using Arthur’s Custom Metrics engine:

  • Metrics are expressed as SQL queries over Arthur datasets (inference logs, ML tables, GenAI outputs, etc.)
  • SQL outputs reported metrics (e.g., TP, FP, FN, acceptance rate, AUC contributions)
  • Arthur runs these queries in 5-minute time buckets for time-series support
  • Plots and dashboards are built using aggregations of these reported metrics

To understand the underlying mechanics of how these metrics are constructed and executed, see:

The bucket-specific pages focus on what the metric means, how to read the plots, and how to interpret them for binary vs. multiclass, rather than on the generic custom metrics plumbing.


Getting Started

If you are new to classification metrics in Arthur, a good path is:

  1. Read this overview to understand how binary vs. multiclass is treated.
  2. Skim the bucket list and decide which evaluation angles matter most:
    • Error concentration
    • Threshold tradeoffs
    • Fairness/subgroup comparisons
    • Discrimination curves (AUC, KS, Gini)
    • Stability over time
    • Rank-based behavior
  3. Dive into each relevant bucket page:
    • Start with the “When to use this bucket” and “Binary vs Multiclass” sections.
    • Look at example plots and interpretation notes.
  4. If you need to add or customize metrics:
    • Go to Custom Metrics for SQL patterns and configuration guidance.
  5. Use the Metrics Querying tools to validate metric outputs and explore time series.

Related Documentation

  • Custom Metrics – how to define metrics with SQL, reported metrics, and aggregations
    Custom Metrics

  • Metrics & Querying Overview – how metrics are stored, queried, and visualized in Arthur
    https://docs.arthur.ai/docs/metrics-querying-overview-1#/

  • Bucket Docs – detailed classification metric buckets:

    • Bucket 1 — Positive-Class Error Profile
    • Bucket 3 — Detection & Acceptance Profile
    • Bucket 4 — Subgroup Rate Comparison
    • Bucket 5 — Curve-Based Discrimination
    • Bucket 6 — AUC Stability Charts
    • Bucket 7 — Rank Association Profile

These bucket pages are all classification-focused, and this overview is intended to be the glue that ties them together for both binary and multiclass use cases.