Custom Metrics

Introduction to Arthur Custom Metrics

Arthur allows users to create custom metrics via SQL queries in workspaces. These custom metrics are designed to be re-usable across multiple models and projects while being managed in one place.

The output of a custom metric will be time-series data with the following format:

(Model ID, Metric Name, Timestamp, Value, Dimensions)

See the Metrics & Querying Overview Introduction for more information about what these fields correspond to and what types of metrics Arthur supports.

Arthur also supports custom metric versioning. If you create a new version of your custom metric, models will not be automatically updated to use the new custom metric version. You will need to change the version of the metric used in any model metric configurations to pick up the new version. The purpose of this is that you can know exactly what query and aggregation version produced any historical time series data as you interpret trends across your charts, dashboards, and alerts.

Quickstart Guide

In this quickstart guide, we'll walk through an example configuration for an inference count custom metric with a filter on a specific categorical value in a column (a category count metric).

Step 1: Write the SQL

Here is the SQL we're interested in for this metric:

SELECT time_bucket(INTERVAL '5 minutes', {{timestamp_col}}) as ts,
	'{{categorical_value_literal}}' as categorical_value,
	COUNT(*) as count
FROM {{dataset}}
WHERE {{categorical_column}} = '{{categorical_value_literal}}'
GROUP BY ts, categorical_value

Callouts:

Notice any generic argument has been replaced with {{}} variable templating. These represent aggregation arguments users will specify when they add our metric to their model.
Notice we're including categorical_value in the SELECT and GROUP BY clause. This is because we want to include this metadata as a dimension in our output time series so we know what value generated a given count.
Notice we're using the time_bucket function to aggregate our counts over 5-minute intervals. This means we'll produce one category count every 5 minutes from this aggregation. This column also has to be in our GROUP BY clause.

Step 2: Configure the Aggregate Arguments

Callouts:

Every argument represents a templated variable indicated by {{}} in the SQL query. Arthur supports dataset arguments, literal arguments, and column arguments. You must create an argument for each templated variable.
The timestamp and dataset arguments will be a part of every custom aggregation. They support templating for the column with the inference timestamp and the dataset the aggregation is being calculated over.
The metadata for these arguments are used by Arthur to fill in the arguments, as well as by users who are configuring your metric for their model. We suggest you provide them with type hints, tag hints, and descriptions so they know exactly how your metric applies to their model.

Step 3: Configure the Reported Metrics

In this section, we configure the metadata for every aggregation the custom metric creates. Arthur will use this to interpret your query results as time-series data.

Callouts:

The value column is the name of the column containing our aggregate value (in this case, count).
The metric name will be the name of the metric for the generated time series.
The dimension column includes categorical_value because we want this label to be included as a dimension in our output time series to know what category we're reporting the count of.
The timestamp column is the one with our time bucket values.

Step 4: Add metric to your model metric config

Finally, we add our metric to our model's metric configuration so that it gets computed in future metrics calculation jobs:

We set the arguments according to our columns of interest. At this point, once you run your next metrics calculation job, category_count metrics will start showing up in the data for your model!

More Details

Basic Information

As part of a custom metric, Arthur supports you in configuring some basic information about your aggregation:

Name: This is a user-friendly, readable name for your metric that will be displayed in the UI. It will not be the same as the metric_name field in the reported time series.
Description.
Model Problem Type: This is the type of model your custom metric applies to. It will be used by Arthur to suggest to users whether to configure your custom aggregation for their model—e.g. if you report that your aggregation applies to binary_classification models, we may suggest to users creating binary_classification models that the aggregation be included in their model's metrics.

Reported Metrics

Every custom metric may have a list of metrics reported by the SQL query. This is because for some metrics and queries, it makes sense to report multiple metrics from a single aggregation—for example, in a confusion matrix aggregation, you would want to report both a positive count and a negative count without needing to write multiple queries.

Each reported metric corresponds to a single aggregation that will be stored in the database when this metric is calculated for some model.

Every reported metric includes the following information:

Metric name: This is the name of the metric in the time series output by the query. This metric_name will be used by users when they query metric data in their dashboards and alert rules.
Description: This description is specific to the single aggregation represented by the reported metric.
Value column: This is the name of the column in your SQL query that stores the aggregation value. For example, for an inference count metric this is the name of the column with the count of all inferences for some time bucket.
Timestamp column: This is the name of the column in your SQL query that stores the timestamp buckets for your reported metric.
Metric kind: This is either Numeric or Sketch. See the Metrics & Querying Overview for more information in the Types of Metrics section. Sketch metrics produce sketch data in the value column and Numeric metrics produce numeric data in the value column.
Dimension column: These are the names of the columns in your SQL query that should be included as dimensions in the output time series. This will correspond to anything you included in a group by clause that wasn't the timestamp column. It ensures the dimension gets included in the output time series in the dimensions column. Today, only one column can be configured as a dimension column per reported metric. These columns should be categorical, with a limited number of unique columns, to avoid cardinality explosion.

Aggregate Arguments

Arthur supports generically configuring parameters for a custom metric so that your custom metrics can be applied to multiple models, which may have columns with different names of interest to your SQL query.

Every aggregate argument will be specified in the SQL query using a templating variable with double brackets: {{argument_name}}. These variables will be filled in when users configure your metric for their model. As a result, you want to make sure you use descriptive naming and descriptions so users know how to configure your metric.

Arthur supports three kinds of aggregate arguments:

Dataset Arguments

This argument will be configured for you automatically in the Arthur UI. It corresponds to the dataset used in the aggregation.

It will look as follows:

Here are the fields included in this aggregate argument:

Parameter key: This is the name of the template variable in the dataset query. As a result, your query from clause should look like FROM {{dataset}}.
Friendly name.
Description.
Parameter type: Dataset.

Literal arguments

These arguments correspond to any scalar literal parameters to the query. For example, if you want users to be able to configure some threshold that indicates a passing score so you can report the count of passing inferences, you could introduce a literal argument as follows:

Here are the fields included in this aggregate argument:

Parameter key: This is the name of the template variable in the SQL query. Your query will refer to this variable somewhere as {{<parameter_key>}}.
Friendly name.
Description.
Parameter type: Literal.
Data type: The data type of the literal argument.

Column arguments

These arguments correspond to any column parameters to the query. For example, if you want to calculate the null count of a specific column, you could introduce a column argument as follows that you would specify using the variable templating {{}} in your query:

Here are the fields included in this aggregate argument:

Parameter key: This is the name of the template variable in the SQL query. Your query will refer to this variable somewhere as {{<parameter_key>}}.
Friendly name.
Description.
Parameter type: Column.
Source dataset parameter key: This is the parameter_key corresponding to the dataset argument this column will be sourced from. This will almost always be dataset, matching the OOTB dataset argument. It indicates which dataset the column is expected to be a part of.
Allow any column type: This is a boolean representing whether the configured column can have any data type or not.
Allowed column types: If allow any column type is False, then this is a list of the data types that might be valid for the aggregation. For example, if the aggregation expects a timestamp type column, this would be set to timestamp.
Tag hints: This is a list of relevant Arthur tag hints for the column. For example, if this aggregation only applies to categorical column, you could add that hint here. It will help users know whether your metric applies to their models and help Arthur suggest relevant metrics to other users in your workspace.

SQL query

This query must be written as valid DuckDBSQL and produce your metric of interest. DuckDBSQL is very close to PostgreSQL, with a few differences called out in the linked documentation.

Any aggregate argument you've configured must appear in the SQL query using templating syntax {{<variable_name>}}.

All SQL queries are expected to output time series data. For sketch metrics, it's enough to just include the column with the timestamp value for each inference in your SELECT clause and Arthur will handle the bucketing for you. For numeric metrics, you'll need to use a time_bucket function in your SELECT clause.

time_bucket is a DuckDB timestamp function:

SELECT time_bucket(INTERVAL '5 minutes', {{timestamp_col}}) as ts

You'll also need to make sure you group by this time bucket (in this case, ts) in your query's group by clause.

Arthur's default aggregations all report aggregations at 5-minute rollup intervals. You can report aggregations at a larger interval if you want, but you will lose granularity in your metrics if you do so.

Configuring a custom aggregation for your model

Once you've created your aggregation, you likely want to configure it for your model. To do this, all you have to do is go to the model metrics configuration and add the custom aggregation you just created. You'll need to configure any arguments with appropriate column names, datasets, and literal arguments. These will be the values that will replace any templated {{variables}} in the SQL query at query execution time.

Example - Inference Count Custom Aggregation

In this example, we're going to write a custom aggregation that calculates the inference count for a dataset. Arthur supports this as a default metric, but we're going to rewrite it as a custom metric.

Here is our SQL query:

Notice we left Model Problem Type blank in the basic information section. This is because this metric applies to models of any problem type. Also notice the templated variables {{dataset}} and {{timestamp}}. These will be configured as aggregate arguments.

We'll configure the following reported metric:

Notice we configured the metric_name to be a readable, easily understandable metric name we'll want to use in our chart queries.

The value and timestamp columns correspond to the aliases used for the columns that report the metric value and the time bucketing column, respectively.

Next, we'll configure the following aggregate arguments:

The OOTB dataset argument:

The timestamp argument:

Notice we expect the primary_timestamp tag hint here. This corresponds to the column with the timestamps representing the inference creation time.

Also, although it's not visible in this screenshot, we'll enforce that the allowed column type is timestamp. We wouldn't want someone to configure a UUID column for an argument we're using in the time_bucket function.

Finally, to configure this metric for an existing model, we would configure the timestamp argument to correspond to the name of our inference_timestamp column. Then the aggregation will get picked up & computed in any future metrics calculation jobs.