Custom Metrics
This document provides an overview of the custom metrics feature in Arthur. For additional information on querying metrics and the storage of metrics in Arthur, see the Metrics & Querying Overview guide.
Introduction to Arthur Custom Metrics
Arthur allows users to create custom metrics via SQL queries in workspaces. These custom metrics are designed to be re-usable across multiple models and projects while being managed in one place.
The output of a custom metric will be time-series data with the following format:
(Model ID, Metric Name, Timestamp, Value, Dimensions)
See the Metrics & Querying Overview Introduction for more information about what these fields correspond to and what types of metrics Arthur supports.
Arthur also supports custom metric versioning. If you create a new version of your custom metric, models will not be automatically updated to use the new custom metric version. You will need to change the version of the metric used in any model metric configurations to pick up the new version. The purpose of this is that you can know exactly what query and aggregation version produced any historical time series data as you interpret trends across your charts, dashboards, and alerts.
Quickstart Guide
In this quickstart guide, we'll walk through an example configuration for an inference count custom metric with a filter on a specific categorical value in a column (a category count metric).
Step 1: Write the SQL
Here is the SQL we're interested in for this metric:
SELECT time_bucket(INTERVAL '5 minutes', {{timestamp_col}}) as ts,
'{{categorical_value_literal}}' as categorical_value,
COUNT(*) as count
FROM {{dataset}}
WHERE {{categorical_column}} = '{{categorical_value_literal}}'
GROUP BY ts, categorical_value
Callouts:
- Notice any generic argument has been replaced with
{{}}
variable templating. These represent aggregation arguments users will specify when they add our metric to their model. - Notice we're including
categorical_value
in theSELECT
andGROUP BY
clause. This is because we want to include this metadata as a dimension in our output time series so we know what value generated a given count. - Notice we're using the
time_bucket
function to aggregate our counts over 5-minute intervals. This means we'll produce one category count every 5 minutes from this aggregation. This column also has to be in ourGROUP BY
clause.
Step 2: Configure the Aggregate Arguments

Callouts:
- Every argument represents a templated variable indicated by
{{}}
in the SQL query. Arthur supports dataset arguments, literal arguments, and column arguments. You must create an argument for each templated variable. - The timestamp and dataset arguments will be a part of every custom aggregation. They support templating for the column with the inference timestamp and the dataset the aggregation is being calculated over.
- The metadata for these arguments are used by Arthur to fill in the arguments, as well as by users who are configuring your metric for their model. We suggest you provide them with type hints, tag hints, and descriptions so they know exactly how your metric applies to their model.
Step 3: Configure the Reported Metrics

In this section, we configure the metadata for every aggregation the custom metric creates. Arthur will use this to interpret your query results as time-series data.
Callouts:
- The value column is the name of the column containing our aggregate value (in this case,
count
). - The metric name will be the name of the metric for the generated time series.
- The dimension column includes
categorical_value
because we want this label to be included as a dimension in our output time series to know what category we're reporting the count of. - The timestamp column is the one with our time bucket values.
Step 4: Add metric to your model metric config
Finally, we add our metric to our model's metric configuration so that it gets computed in future metrics calculation jobs:

We set the arguments according to our columns of interest. At this point, once you run your next metrics calculation job, category_count
metrics will start showing up in the data for your model!
More Details
Basic Information
As part of a custom metric, Arthur supports you in configuring some basic information about your aggregation:
- Name: This is a user-friendly, readable name for your metric that will be displayed in the UI. It will not be the same as the
metric_name
field in the reported time series. - Description.
- Model Problem Type: This is the type of model your custom metric applies to. It will be used by Arthur to suggest to users whether to configure your custom aggregation for their modelβe.g. if you report that your aggregation applies to
binary_classification
models, we may suggest to users creatingbinary_classification
models that the aggregation be included in their model's metrics.
Reported Metrics
Every custom metric may have a list of metrics reported by the SQL query. This is because for some metrics and queries, it makes sense to report multiple metrics from a single aggregationβfor example, in a confusion matrix aggregation, you would want to report both a positive count and a negative count without needing to write multiple queries.
Each reported metric corresponds to a single aggregation that will be stored in the database when this metric is calculated for some model.
Every reported metric includes the following information:
- Metric name: This is the name of the metric in the time series output by the query. This
metric_name
will be used by users when they query metric data in their dashboards and alert rules. - Description: This description is specific to the single aggregation represented by the reported metric.
- Value column: This is the name of the column in your SQL query that stores the aggregation value. For example, for an inference count metric this is the name of the column with the count of all inferences for some time bucket.
- Timestamp column: This is the name of the column in your SQL query that stores the timestamp buckets for your reported metric.
- Metric kind: This is either
Numeric
orSketch
. See the Metrics & Querying Overview for more information in the Types of Metrics section.Sketch
metrics produce sketch data in the value column andNumeric
metrics produce numeric data in the value column. - Dimension column: These are the names of the columns in your SQL query that should be included as dimensions in the output time series. This will correspond to anything you included in a
group by
clause that wasn't the timestamp column. It ensures the dimension gets included in the output time series in thedimensions
column. Today, only one column can be configured as a dimension column per reported metric. These columns should be categorical, with a limited number of unique columns, to avoid cardinality explosion.
Aggregate Arguments
Arthur supports generically configuring parameters for a custom metric so that your custom metrics can be applied to multiple models, which may have columns with different names of interest to your SQL query.
Every aggregate argument will be specified in the SQL query using a templating variable with double brackets: {{argument_name}}
. These variables will be filled in when users configure your metric for their model. As a result, you want to make sure you use descriptive naming and descriptions so users know how to configure your metric.
Arthur supports three kinds of aggregate arguments:
Dataset Arguments
This argument will be configured for you automatically in the Arthur UI. It corresponds to the dataset used in the aggregation.
It will look as follows:

Here are the fields included in this aggregate argument:
- Parameter key: This is the name of the template variable in the dataset query. As a result, your query from clause
should look like
FROM {{dataset}}
. - Friendly name.
- Description.
- Parameter type: Dataset.
Literal arguments
These arguments correspond to any scalar literal parameters to the query. For example, if you want users to be able to configure some threshold that indicates a passing score so you can report the count of passing inferences, you could introduce a literal argument as follows:

Here are the fields included in this aggregate argument:
- Parameter key: This is the name of the template variable in the SQL query. Your query will refer to this variable somewhere as
{{<parameter_key>}}
. - Friendly name.
- Description.
- Parameter type: Literal.
- Data type: The data type of the literal argument.
Column arguments
These arguments correspond to any column parameters to the query. For example, if you want to calculate the null count of a specific column, you could introduce a column argument as follows that you would specify using the variable templating {{}}
in your query:

Here are the fields included in this aggregate argument:
- Parameter key: This is the name of the template variable in the SQL query. Your query will refer to this variable
somewhere as
{{<parameter_key>}}
. - Friendly name.
- Description.
- Parameter type: Column.
- Source dataset parameter key: This is the
parameter_key
corresponding to the dataset argument this column will be sourced from. This will almost always bedataset
, matching the OOTB dataset argument. It indicates which dataset the column is expected to be a part of. - Allow any column type: This is a boolean representing whether the configured column can have any data type or not.
- Allowed column types: If
allow any column type
isFalse
, then this is a list of the data types that might be valid for the aggregation. For example, if the aggregation expects atimestamp
type column, this would be set totimestamp
. - Tag hints: This is a list of relevant Arthur tag hints for the column. For example, if this aggregation only applies to
categorical
column, you could add that hint here. It will help users know whether your metric applies to their models and help Arthur suggest relevant metrics to other users in your workspace.
SQL query
This query must be written as valid DuckDBSQL and produce your metric of interest. DuckDBSQL is very close to PostgreSQL, with a few differences called out in the linked documentation.
Any aggregate argument you've configured must appear in the SQL query using templating syntax {{<variable_name>}}
.
All SQL queries are expected to output time series data. The way to do this is to use a time_bucket
function in your SELECT
clause. time_bucket
is a DuckDB timestamp function:
SELECT time_bucket(INTERVAL '5 minutes', {{timestamp_col}}) as ts
You'll need to make sure you group by this time bucket (in this case, ts
) in your query's group by
clause.
Arthur's default aggregations all report aggregations at 5-minute rollup intervals. You can report aggregations at a larger interval if you want, but you will lose granularity in your metrics if you do so.
Configuring a custom aggregation for your model
Once you've created your aggregation, you likely want to configure it for your model. To do this, all you have to do
is go to the model metrics configuration and add the custom aggregation you just created. You'll need to configure
any arguments with appropriate column names, datasets, and literal arguments. These will be the values that will
replace any templated {{variables}}
in the SQL query at query execution time.
Example - Inference Count Custom Aggregation
In this example, we're going to write a custom aggregation that calculates the inference count for a dataset. Arthur supports this as a default metric, but we're going to rewrite it as a custom metric.
Here is our SQL query:

Notice we left Model Problem Type
blank in the basic information section. This is because this metric applies to
models of any problem type. Also notice the templated variables {{dataset}}
and {{timestamp}}
. These will be configured as aggregate arguments.
We'll configure the following reported metric:

Notice we configured the metric_name
to be a readable, easily understandable metric name we'll want to use in our chart queries.
The value
and timestamp
columns correspond to the aliases used for the columns that report the metric value and the time bucketing column, respectively.
Next, we'll configure the following aggregate arguments:
The OOTB dataset
argument:

The timestamp
argument:

Notice we expect the primary_timestamp
tag hint here. This corresponds to the column with the timestamps representing the inference creation time.
Also, although it's not visible in this screenshot, we'll enforce that the allowed column type is timestamp
. We wouldn't want someone to configure a UUID
column for an argument we're using in the time_bucket
function.
Finally, to configure this metric for an existing model, we would configure the timestamp
argument to correspond to the name of our inference_timestamp
column. Then the aggregation will get picked up & computed in any future metrics calculation jobs.
Updated 1 day ago