Metrics & Querying Overview

This document provides an overview of the metrics in Arthur, as well as how to write queries against them to build
dashboards and alert rules.

Introduction to Arthur Metrics

Arthur reports pre-aggregated metrics on 5 minute cadences, based on the source data for user's models. Each metric reported has the following format:

(Model ID, Metric Name, Timestamp, Value, Dimensions)

  1. Model ID - the UUID of the model
  2. Metric Name - the name of the metric, see the metric overview section below for a
    description of each
  3. Timestamp - the 5-minute aligned timestamp of the metric that summarizes the 5-minute interval. A timestamp of
    05:00 covers the interval [05:00-09:59)
  4. Value - a floating point number (or sketch)
  5. Dimensions - a set of key-value pairs that provide metadata for the metric

Types of Metrics

Arthur supports two different types of Values in its metrics.

For the first type, Value is a floating point number. These are called "numeric" metrics, and they generally represent counts of observable values. Some examples include inference count, hallucination count, rule failure count, true positive count, etc.

The second type of metrics support a Value that is a data sketch. Data sketches provide a way to calculate summary statistics of a distribution like median, max, min, and quantiles, without saving the entire distribution. They are probabilistic summaries of the data with bounded error guarantees. Common examples of these metrics would be latencies, distributions, histograms, etc.

Metrics Versioning

Arthur supports versioned metrics to deduplicate and update metrics across multiple Metrics Calculation job runs. Each metric is stored with a metric_version integer, that's a monotonically increasing version for each metric. Metrics are deduplicated based on the tuple (metric name, timestamp, and metric version), and the latest version is queried by default for each time point. Each time a new metrics computation runs, the metrics uploaded from that run are given a new version number, effectively enabling a metric history that allows comparison across versions.

Metrics Tables and Views

At a high level, Arthur supports two different views for querying a model's metrics:

  1. versioned - these views allow you to query a specific version for your metrics
  2. latest_version - these views show only the latest version for each time point, so you can easily query across all
    versions

Additionally, Arthur supports two views based on the type of metric:

  1. numeric metrics
  2. sketch metrics

Overall, this produces 4 different views that are available to query:

  1. metrics_numeric - a view of versioned, numeric metrics
  2. metrics_numeric_latest_version - a view of the numeric metrics with the latest version for each time point
  3. metrics_sketch - a view of the versioned, sketch metrics
  4. metrics_sketch_latest_version - a view of the sketch metrics with the latest version for each time point

Metrics Table Formats

All the metrics tables are organized similar to the tuple above, (Model ID, Metric Name, Timestamp, Metric Version, Value, Dimensions).

An example of a numeric and sketch row in the table would look like:

Model IDMetric NameTimestampMetric VersionValueDimensions
27a69cb5-29e7-4058-b426-fc5294a0a059inference_count2024-07-30 14:00:00.000000 +00:001146{"result": "Fail", "prompt_result": "Fail", "response_result": "Pass"}
27a69cb5-29e7-4058-b426-fc5294a0a059pii_score2024-07-30 13:50:00.000000 +00:001(sketch type){"entity": "US_SSN", "result": "Fail", "location": "prompt"}

Query Language

Arthur exposes two places where users can write queries to interact with their metrics, writing queries to generate
dashboards, and writing queries to generate alert rules. Both places allow users to write PostgreSQL 16 with the Timescale extension SQL to investigate and visualize their metrics.

Considerations for Writing Queries

Since Arthur's data plane reports metrics on pre-aggregated 5-minute intervals, queries need to be written to perform final calculations on top of the raw count and sketch metrics it reports. For example, the data plane does not report "rates", because the division required to calculate them means they cannot be further aggregated into larger intervals. As a result, any calculations will perform the final rollups at query time in order to ensure metrics remain accurate at all time window aggregations. In general, metrics like rates, averages, etc. need to be calculated at query time based on the pre-aggregate metrics stored in the platform.

Callouts

  1. The queries below are taken from Arthur's standard dashboards. They contain syntax used by the dashboard to filter the range of data shown in the graph. The following lines/filters can be omitted when writing alert rule queries because the query API will add the time range and model filters automatically:

    AND model_id = '{{model_id}}'

    This line should not be omitted in its entirety, but the brackets will need to be removed:

    [[ AND timestamp BETWEEN '{{dateStart}}' AND '{{dateEnd}}' ]]
  2. The Arthur standard dashboards use dashboard-wide model ID filters, which means they are not specified in the
    queries. Be careful when creating a new dashboard or removing that dashboard-wide filter because the queries will no longer be filtered for a single model. A model ID filter can be added in the query SQL using the following dashboard query filter syntax.

    AND model_id = '{{model_id}}'

Example Queries

1. Querying a numeric metric

This query is an example of how to query a numeric metric, and aggregate it on a daily roll up:

select time_bucket(interval '1 day', timestamp) as bucket,
       sum(value)                               as total
from metrics_numeric_latest_version
where metric_name = 'inference_count'
    AND timestamp BETWEEN '{{dateStart}}' AND '{{dateEnd}}'
group by bucket

2. Hallucination Rate

This query is an example of a more advanced calculation finalized at query time. In this query, we select two metrics,
one for the count of hallucinations per interval, and one for the count of inferences in that interval. We join them on timestamp then divide to get the rate, or percentage, of hallucinations per interval. It includes a divide by zero protection in the case there were no inferences in the interval.

with hallucination_count as (select time_bucket(interval '1 day', timestamp)             as bucket,
                                    sum(value)                                           as total
                             from metrics_numeric_latest_version
                             where metric_name = 'hallucination_count'
                                   AND timestamp BETWEEN '{{dateStart}}' AND '{{dateEnd}}'
                             group by bucket
                             order by bucket DESC),
     total_count as (select time_bucket(interval '1 day', timestamp)                 as bucket,
                            sum(value)                                               as total
                     from metrics_numeric_latest_version
                     where metric_name = 'inference_count'
                           AND timestamp BETWEEN '{{dateStart}}' AND '{{dateEnd}}'
                     group by bucket
                     order by bucket DESC)
select hallucination_count.bucket as timestamp,
    CASE WHEN total_count.total = 0 THEN 0 ELSE hallucination_count.total / total_count.total END as hallucination_rate
from hallucination_count
         join total_count on hallucination_count.bucket = total_count.bucket
order by hallucination_count.bucket DESC

3. Basic Sketch Operations

Sketch metric types allow for querying properties of a distribution from the stored values. For aggregating on intervals larger than 5 minutes, they can be merged to generate sketches that represent the combined properties of all data that were summarized by the individual sketches before merging. Some helpful functions include:

  • kll_float_sketch_merge(sketch) - this function allows merging sketch values in a group into a single sketch
  • kll_float_sketch_get_quantile(sketch, quantile) - this function returns the value of the requested quantile from the
    distribution summarized by the sketch. Getting the median is the same as the 0.5th quantile. Getting the 95% percentile is the 0.95th quantile.
  • kll_float_sketch_get_n(sketch) - returns the number of values summarized by the sketch
  • kll_float_sketch_get_max_item(sketch) - returns the max item seen by the sketch
  • kll_float_sketch_get_min_item(sketch) - returns the min item seen by the sketch
  • kll_float_sketch_get_pmf(sketch, [points]) - returns the probability mass function value evaluated at each of the
    points. This can be multiplied by the result of the kll_float_sketch_get_n value to obtain counts for a distribution.

4. Querying a sketch metric

This query is an example of how to query a sketch based metric. This query returns the median latency grouped by
the rule_type dimension on the rule_latency metric per day.

select time_bucket(interval '1 day', timestamp)                         as bucket,
      kll_float_sketch_get_quantile(kll_float_sketch_merge(value), 0.5) as median_latency,
      dimensions ->> 'rule_type'                                        as rule_type
from metrics_sketch_latest_version
where metric_name = 'rule_latency'
  AND timestamp BETWEEN '{{dateStart}}' AND '{{dateEnd}}'
group by bucket, rule_type;

5. Creating a distribution

This query creates a table of the values of a distribution for a variable that varies between 0 and 1. It creates a
bucket at 0.05 intervals for all data within the time range, so it is not grouped by time buckets. It uses the get_pmf sketch function to obtain the percentage of values in each interval, then multiplies it by the total count of values seen by the sketch to get the number in each interval.

with merged
         as (select kll_float_sketch_get_pmf(kll_float_sketch_merge(value),
                                             ARRAY [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95]) as sketch,
                    kll_float_sketch_get_n(kll_float_sketch_merge(value))                                                                   as total
             from metrics_sketch_latest_version
             where metric_name = 'toxicity_score'
                 AND timestamp BETWEEN '{{dateStart}}' AND '{{dateEnd}}'
             )
select ROUND(((ordinality) / 20.0), 2)::VARCHAR as num_claims,
       val * merged.total                       as inf_count
from merged,
     unnest(merged.sketch) with ordinality as val;

6. Other helpful Timescale functions

  1. time_bucket_gapfill - this function can be used to fill in all time buckets in a range in case values are not present in all time intervals. A common place this is used is to visualize inference_count over time. If there weren't any inferences in a time bucket, there would be no count to show, so the graph would be empty. Using this function allows queries to fill in NULL values for missing time buckets, which can be converted to zero or other defaults. Here's an example query:
    select time_bucket_gapfill(interval '1 hour', timestamp)  as bucket,
      CASE WHEN sum(value) is NULL THEN 0 ELSE sum(value) END as total
    from metrics_numeric_latest_version
    where metric_name = 'inference_count'
      AND timestamp BETWEEN '{{dateStart}}' AND '{{dateEnd}}'
    group by bucket;

List of Current Metrics

Shield Task Metrics

Numeric, Counter Type Metrics

These metrics track the number of occurrences of a measure every 5 minutes. To investigate larger intervals, they can be summed.

1. Inference Count

Name: inference_count

Dimensions:

  • result - the overall result of the shield check on the inferences, either "Pass" or "Fail"
  • prompt_result - the overall result of the shield check on the inference's prompt, either "Pass" or "Fail"
  • response_result - the overall result of the shield check on the inference's response, either "Pass", "Fail", or
    "null" if it was not checked

Description:

Summing the value of this metric gives a total count of inferences in the time range.

2. Rule Count

Name: rule_count

Dimensions:

  • location - either "prompt" or "response", describes which portion of the inference this rule was applied to
  • rule_type - the shield identifier for the type of rule, examples include "PIIDataRule",
    "ModelHallucinationRuleV2", "RegexRule", etc.
  • result - the overall result of the check, either "Pass" or "Fail"
  • name - the given name of this rule in Shield, not guaranteed to be unique across rules
  • id - the id of the rule in Shield, unique across all rules

Description:

Summing the value of this metric gives a total count of the rules checked across all inferences in the time range.

3. Hallucination Count

Name: hallucination_count

Dimensions:

N/A

Description:

Summing the value of this metric gives a total count of the inferences that had at least one ModelHallucinationRuleV2 or ModelHallucinationRule check fail in the response.

4. Token Count

Name: token_count

Dimensions:

  • location - either "prompt" or "response", describes which portion of the inference the token count occurred in.

Description:

This metric tracks the number of tokens in the prompt and response of an inference. Summing the value of this metric gives a total count of the tokens in the prompt and response of all inferences in the time range. Summing the value of this metric by location gives a count of tokens by prompt and response in the time range.

5. Token Cost

Name: token_cost.{model_name}

Dimensions:

  • location - either "prompt" or "response", describes which portion of the inference the token cost occurred in.

Description:

This metric tracks the estimated cost of the tokens in the prompt and response of an inference for a given model. Summing the value of this metric by location gives the estimated total cost invoked by prompting the model and its response.

Sketch Based Metrics

Sketch metrics can be used to query the min, max, quantile, probability density function, probability mass function, and number of observations seen in an interval. They are reported every 5 minutes, but merging two intervals' sketches together will return a sketch that has the same properties as if a single sketch had been generated for the entire interval.

1. Toxicity Score Distribution

Name: toxicity_score

Dimensions:

  • result - the overall result of the check, either "Pass" or "Fail"
  • location - either "prompt" or "response", describes which portion of the inference the toxicity score was scored on

Description:

This sketch tracks the distribution of the toxicity score for the collection of inferences in the interval.

2. PII Score Distribution

Name: pii_score

Dimensions:

  • result - the overall result of the check, either "Pass" or "Fail"
  • location - either "prompt" or "response", describes which portion of the inference the toxicity score was scored on
  • entity - the type of PII detected, examples include "PERSON", "US_SSN", "LOCATION", "US_DRIVER_LICENSE",
    "US_BANK_NUMBER", "PHONE_NUMBER", etc.

Description:

This sketch tracks the distribution of the PII rule score for the collection of inferences in the interval.

3. Claim Count Distribution

Name: claim_count

Dimensions:

  • result - the overall result of the check, either "Pass" or "Fail"

Description:

This sketch tracks the distribution of the number of claims identified in the response by the "ModelHallucinationRuleV2" per inference, for the collection of inferences in the interval.

4. Claim Valid Count Distribution

Name: claim_valid_count

Dimensions:

  • result - the overall result of the check, either "Pass" or "Fail"

Description:

This sketch tracks the distribution of the number of valid claims identified in the response by the
"ModelHallucinationRuleV2" per inference, for the collection of inferences in the interval.

5. Claim Invalid Count Distribution

Name: claim_invalid_count

Dimensions:

  • result - the overall result of the check, either "Pass" or "Fail"

Description:

This sketch tracks the distribution of the number of invalid claims identified in the response by the
"ModelHallucinationRuleV2" per inference, for the collection of inferences in the interval.

6. Rule Latency Distribution

Name: rule_latency

Dimensions:

  • location - either "prompt" or "response", describes which portion of the inference this rule was applied to
  • rule_type - the shield identifier for the type of rule, examples include "PIIDataRule",
    "ModelHallucinationRuleV2", "RegexRule", etc.
  • result - the overall result of the check, either "Pass" or "Fail"

Description:

This sketch tracks the distribution of the latency in milliseconds it took for the check to complete, for all the
checks in the interval.

General Data Metrics

Numeric, Counter Type Metrics

These metrics track the number of occurrences of a measure every 5 minutes. To investigate larger intervals, they can be summed.

1. Inference Count

Name: inference_count

Dimensions:

N/A

Description:

Summing the value of this metric gives a total count of inferences in the time range.

2. Column Null Count

Name: null_count

Dimensions:

  • column_name - the name of the column that this metric contains null counts for

Description:

Summing the value of this metric gives a total count of null values in the column_name column in the time range.

3. Column Categorical Count

Name: categorical_count

Dimensions:

  • column_name - the name of the column that was counted
  • category - the name of the category that was counted in the column

Description:

Summing the value of this metric gives a total count of rows that had a value equal to category in the column_name column in the time range.

4. Column Numeric Sum

Name: numeric_sum

Dimensions:

  • column_name - the name of the column that was summed

Description:

Summing the value of this metric gives a total sum of the column_name column in the time range.

Sketch Based Metrics

Sketch metrics can be used to query the min, max, quantile, probability density function, probability mass function, and number of observations seen in an interval. They are reported every 5 minutes, but merging two intervals' sketches together will return a sketch that has the same properties as if a single sketch had been generated for the entire interval.

1. Column Numeric Statistics

Name: numeric_sketch

Dimensions:

  • column_name - the name of the column that is summarized by the sketch

Description:

This sketch tracks the distribution of the values of the numeric column.

Regression Model Problem Type Metrics

Numeric, Counter Type Metrics

These metrics track the number of occurrences of a measure every 5 minutes. To investigate larger intervals, they can be summed.

1. Mean Absolute Error

This calculation is the combination of two different metrics. The first is the sum of the absolute errors, and the
second is the count of observations.

Sub-metric 1

Name: absolute_error_count

Dimensions:

N/A

Description:

Summing the value of this metric gives a total count of inferences in the time range that had ground truth values.

Sub-metric 2

Name: absolute_error_sum

Dimensions:

N/A

Description:

Summing the value of this metric gives a total sum of absolute errors in the time range.

Full metric

To compute the full Mean Absolute Error, divide absolute_error_sum by absolute_error_count for a time interval.

2. Mean Squared Error

This calculation is the combination of two different metrics. The first is the sum of the squared errors, and the
second is the count of observations.

Sub-metric 1

Name: squared_error_count

Dimensions:

N/A

Description:

Summing the value of this metric gives a total count of inferences in the time range that had ground truth values.

Sub-metric 2

Name: squared_error_sum

Dimensions:

N/A

Description:

Summing the value of this metric gives a total sum of squared errors in the time range.

Full metric

To compute the full Mean Absolute Error, divide squared_error_sum by squared_error_count for a time interval.

Binary Classification Model Problem Type Metrics

Numeric, Counter Type Metrics

These metrics track the number of occurrences of a measure every 5 minutes. To investigate larger intervals, they can be summed.

1. Confusion Matrix

This aggregation reports four distinct metrics.

Name: confusion_matrix_true_positive_count, confusion_matrix_false_positive_count,
confusion_matrix_false_negative_count, and confusion_matrix_true_negative_count

Dimensions:

N/A

Description:

Summing the value of this metric gives a total count of the confusion matrix counts in the time range.

2. Inference Count by Class

Name: binary_classifier_count_by_class

Dimensions:

  • prediction - the class that was predicted

Description:

Summing the value of this metric gives a total count of inferences per prediction class in the time range.

Multiclass Classification Model Problem Type Metrics

Numeric, Counter Type Metrics

These metrics track the number of occurrences of a measure every 5 minutes. To investigate larger intervals, they can be summed.

1. Single Class Confusion Matrix

This aggregation reports four distinct metrics.

Name: multiclass_confusion_matrix_single_class_true_positive_count,
multiclass_confusion_matrix_single_class_false_positive_count, multiclass_confusion_matrix_single_class_false_negative_count, and multiclass_confusion_matrix_single_class_true_negative_count

Dimensions:

  • class_label - the label of the class used as the positive class, all other classes are treated as negative classes

Description:

Summing the value of this metric gives a total count of the confusion matrix counts in the time range.

2. Inference Count by Class

Name: multiclass_classifier_count_by_class

Dimensions:

  • prediction - the class that was predicted

Description:

Summing the value of this metric gives a total count of inferences per prediction class in the time range.