Product DocumentationAPI and Python SDK ReferenceRelease Notes
Schedule a Demo
Schedule a Demo

Default Evaluation Functions

Arthur provides common default metrics for all Model Input / Output Types. A list of each default metric available can be found within each model types page. This page will provide an overview of all of them to show how to query Arthur.

Regression

All regression evaluation metrics will follow the below request body structure.

Query Request:

{
    "select": [
        {
            "function": "[rmse|mae|rSquared]",
            "alias": "<alias_name> [optional string]",
            "parameters": {
                "ground_truth_property": "<attribute_name> [string]",
                "predicted_property": "<attribute_name> [string]"
            }
        }
    ]
}

Query Response:

{
    "query_result": [
        {
            "<function_name/alias_name>": "<evaluation_value> [float]"
        }
    ]
}

RMSE

Get the RMSE between a prediction attribute and a ground truth attribute.

Sample Request:

{
    "select": [
        {
            "function": "rmse",
            "alias": "error",
            "parameters": {
                "ground_truth_property": "FICO_actual",
                "predicted_property": "FICO_predicted"
            }
        }
    ]
}

Sample Response:

{
    "query_result": [
        {
            "error": 0.76
        }
    ]
}

back to top

MAE

Get the Mean Absolute Error between a prediction and ground truth attributes.
This function takes an optional parameter aggregation that allows swapping the aggregation from "avg" to either "min" or "max". This can be helpful if you're looking for extremes, such as the lowest or highest absolute error. Additionally, this function supports optional params normalizationMax and normalizationMin that accept numbers and will perform min/max normalization on the values before aggregation if both params are provided.

Query Request:

{
    "select": [
        {
            "function": "mae",
            "alias": "<alias_name> [optional string]",
            "parameters": {
              "predicted_property": "<predicted_property_name> [string]",
              "ground_truth_property": "<ground_truth_property_name> [string]",
              "aggregation": "[avg|min|max] (default avg, optional)",
              "normalizationMin": "<value> [optional number]",
              "normalizationMax": "<value> [optional number]"
            }
        }
    ]
}

Sample Request:

{
    "select": [
        {
            "function": "mae",
            "alias": "error",
            "parameters": {
                "ground_truth_property": "FICO_actual",
                "predicted_property": "FICO_predicted"
            }
        }
    ]
}

Sample Response:

{
    "query_result": [
        {
            "error": 0.76
        }
    ]
}

back to top

R Squared

Get the R Squared value between a prediction and ground truth attributes.

Sample Request:

{
    "select": [
        {
            "function": "rSquared",
            "alias": "rsq",
            "parameters": {
                "ground_truth_property": "FICO_actual",
                "predicted_property": "FICO_predicted"
            }
        }
    ]
}

Sample Response:

{
    "query_result": [
        {
            "rsq": 0.94
        }
    ]
}

back to top

Binary Classification

When using binary classification evaluation functions with a multiclass model, outputs will be calculated assuming a one vs. all approach.

Confusion Matrix

Calculates the confusion matrix for a classification model. For binary classifiers, users must specify a probability threshold to count a prediction as a positive class.

Query Request:

{
    "select": [
        {
            "function": "confusionMatrix",
            "alias": "<alias_name> [optional string]",
            "parameters": {
                "threshold": "<value [float]> [required only for binary classifiers]"
            }
        }
    ]
}

Query Response:

{
    "query_result": [
        {
            "<function_name/alias_name>": {
                "true_positive": "<count> [int]",
                "false_positive": "<count> [int]",
                "true_negative": "<count> [int]",
                "false_negative": "<count> [int]"
            }
        }
    ]
}

Sample Request: Calculate the confusion matrix for a binary classifier with a threshold of 0.5 (the standard threshold for confusion matrix).

{
    "select": [
        {
            "function": "confusionMatrix",
            "parameters": {
                "threshold": 0.5
            }
        }
    ]
}

Sample Response:

{
    "query_result": [
        {
            "confusionMatrix": {
                "true_positive": 100480,
                "false_positive": 100076,
                "true_negative": 100302,
                "false_negative": 99142
            }
        }
    ]
}

back to top

Confusion Matrix Rate

Calculates the confusion matrix rates for a classification model. For binary classifiers, users must specify a probability threshold to count a prediction as a positive class.

Query Request:

{
    "select": [
        {
            "function": "confusionMatrixRate",
            "alias": "<alias_name> [optional string]",
            "parameters": {
                "threshold": "<value [float]> [required only for binary classifiers]"
            }
        }
    ]
}

Query Response:

{
    "query_result": [
        {
            "<function_name/alias_name>": {
                "true_positive_rate": "<rate> [float]",
                "false_positive_rate": "<rate> [float]",
                "true_negative_rate": "<rate> [float]",
                "false_negative_rate": "<rate> [float]",
                "accuracy_rate": "<rate> [float]"
            }
        }
    ]
}

Sample Request: Calculate the confusion matrix for a binary classifier with a threshold of 0.5 (the standard threshold for confusion matrix).

{
    "select": [
        {
            "function": "confusionMatrixRate",
            "parameters": {
                "threshold": 0.5
            }
        }
    ]
}

Response:

{
    "query_result": [
        {
            "confusionMatrixRate": {
                "true_positive_rate": 0.5033513340213003,
                "false_positive_rate": 0.49943606583557076,
                "true_negative_rate": 0.5005639341644292,
                "false_negative_rate": 0.4966486659786997
            }
        }
    ]
}

back to top

Confusion Matrix Variants

If you only want a specific metric derived from a confusion matrix, you can use one of the following functions:

  • truePositiveRate
  • falsePositiveRate
  • trueNegativeRate
  • falseNegativeRate
  • accuracyRate
  • balancedAccuracyRate
  • f1
  • sensitivity
  • specificity
  • precision
  • recall

For example, to return the truePositiveRate:

{
    "select": [
        {
            "function": "truePositiveRate",
            "parameters": {
                "threshold": 0.5,
                "ground_truth_property":"class_a",
                "predicted_property":"ground_truth_a"
            }
        }
    ]
}

Response:

{
    "query_result": [
        {
            "truePositiveRate": 0.5033513340213003
        }
    ]
}

back to top

AUC

The Area Under the ROC Curve can also be computed for binary classifiers.

Sample Query:

{
    "select": [
        {
            "function": "auc",
            "parameters": {
                "ground_truth_property":"class_a",
                "predicted_property":"ground_truth_a"
            }
        }
    ]
}

Response:

{
    "query_result": [
        {
            "auc": 0.9192331426352897
        }
    ]
}

Multi-class Classification

Multi-class Accuracy Rate

Calculates the global accuracy rate.

Query Request:

{
    "select": [
        {
            "function": "accuracyRateMulticlass",
            "alias": "<alias_name> [optional string]"
        }
    ]
}

Query Response:

{
    "query_result": [
        {
            "accuracyRateMulticlass": "<rate> [float]"
        }
    ]
}

Example:

{
    "select": [
        {
            "function": "accuracyRateMulticlass"
        }
    ]
}

Response:

{
    "query_result": [
        {
            "accuracyRateMulticlass": 0.785
        }
    ]
}

back to top

Multi-class Confusion Matrix

Calculates the confusion matrix for a multi-class model with regard to a single class. The predicted attribute and ground truth attribute must be passed as parameters.

Query Request:

{
    "select": [
        {
            "function": "confusionMatrixMulticlass",
            "alias": "<alias_name> [optional string]",
            "parameters": {
                "predicted_property": "<predicted_property_name>",
                "ground_truth_property": "<ground_truth_property_name>"
            }
        }
    ]
}

Query Response:

{
    "query_result": [
        {
            "<function_name/alias_name>": {
                "true_positive": "<count> [int]",
                "false_positive": "<count> [int]",
                "true_negative": "<count> [int]",
                "false_negative": "<count> [int]"
            }
        }
    ]
}

Example:

{
    "select": [
        {
            "function": "confusionMatrixMulticlass",
            "parameters": {
                "predicted_property": "predicted_class_A",
                "ground_truth_property": "gt_predicted_class_A"
            }
        }
    ]
}

Response:

{
    "query_result": [
        {
            "confusionMatrix": {
                "true_positive": 100480,
                "false_positive": 100076,
                "true_negative": 100302,
                "false_negative": 99142
            }
        }
    ]
}

back to top

Multi-class Confusion Matrix Rate

Calculates the confusion matrix rates for a multi-class classification model in regards to a single predicted class.

Query Request:

{
    "select": [
        {
            "function": "confusionMatrixRateMulticlass",
            "alias": "<alias_name> [optional string]",
            "parameters": {
                "predicted_property": "predicted_class_A",
                "ground_truth_property": "gt_predicted_class_A"
            }
        }
    ]
}

Query Response:

{
    "query_result": [
        {
            "<function_name/alias_name>": {
                "true_positive_rate": "<rate> [float]",
                "false_positive_rate": "<rate> [float]",
                "true_negative_rate": "<rate> [float]",
                "false_negative_rate": "<rate> [float]",
                "accuracy_rate": "<rate> [float]",
                "balanced_accuracy_rate": "<rate> [float]",
                "precision": "<rate> [float]",
                "f1": "<rate> [float]"
            }
        }
    ]
}

Example calculating the confusion matrix rates:

{
    "select": [
        {
            "function": "confusionMatrixRateMulticlass",
            "parameters": {
                "predicted_property": "predicted_class_A",
                "ground_truth_property": "gt_predicted_class_A"
            }
        }
    ]
}

Response:

{
    "query_result": [
        {
            "confusionMatrixRateMulticlass": {
                "true_positive_rate": 0.6831683168316832,
                "false_positive_rate": 0.015653220951234198,
                "true_negative_rate": 0.9843467790487658,
                "false_negative_rate": 0.31683168316831684,
                "accuracy_rate": 0.9378818737270875,
                "balanced_accuracy_rate": 0.8337575479402245,
                "precision": 0.8884120171673819,
                "f1": 0.7723880597014925
            }
        }
    ]
}

back to top

If you only want a specific value from the confusion matrix rate function, you can use one of the following functions:

  • truePositiveRateMulticlass
  • falsePositiveRateMulticlass
  • trueNegativeRateMulticlass
  • falseNegativeRateMulticlass

For example, to return the truePositiveRate:

{
    "select": [
        {
            "function": "truePositiveRateMulticlass",
            "parameters": {
                "predicted_property": "predicted_class_A",
                "ground_truth_property": "gt_predicted_class_A"
            }
        }
    ]
}

Response:

{
    "query_result": [
        {
            "truePositiveRate": 0.5033513340213003
        }
    ]
}

back to top

Multi-class F1

Calculates the components needed to compute a F1 score for a multi-class model.

In this example, the model has 3 classes: class-1, class-2, class-3 and the corresponding ground truth labels class-1-gt, class-2-gt, class-3-gt.

Query Request:

{
  "select": [
    {
      "function": "count",
      "alias": "count"
    },
    {
      "function": "confusionMatrixRateMulticlass",
      "alias": "class-1",
      "parameters": {
        "predicted_property": "class-1",
        "ground_truth_property": "class-1-gt"
      }
    },
    {
      "function": "countIf",
      "alias": "class-1-gt",
      "parameters": {
        "property": "multiclass_model_ground_truth_class",
        "comparator": "eq",
        "value": "class-1-gt"
      },
      "stage": "GROUND_TRUTH"
    },
    {
      "function": "confusionMatrixRateMulticlass",
      "alias": "class-2",
      "parameters": {
        "predicted_property": "class-2",
        "ground_truth_property": "class-2-gt"
      }
    },
    {
      "function": "countIf",
      "alias": "class-2-gt",
      "parameters": {
        "property": "multiclass_model_ground_truth_class",
        "comparator": "eq",
        "value": "class-2-gt"
      },
      "stage": "GROUND_TRUTH"
    },
    {
      "function": "confusionMatrixRateMulticlass",
      "alias": "class-3",
      "parameters": {
        "predicted_property": "class-3",
        "ground_truth_property": "class-3-gt"
      }
    },
    {
      "function": "countIf",
      "alias": "class-3-gt",
      "parameters": {
        "property": "multiclass_model_ground_truth_class",
        "comparator": "eq",
        "value": "class-3-gt"
      },
      "stage": "GROUND_TRUTH"
    }
  ]
}

Query Response:

{
  "query_result": [
    {
      "count": 7044794,
      "class-1-gt": 2540963,
      "class-2-gt": 2263918,
      "class-3-gt": 2239913,
      "class-1": {
        "true_positive_rate": 0.4318807475748368,
        "false_positive_rate": 0.3060401245073361,
        "true_negative_rate": 0.6939598754926639,
        "false_negative_rate": 0.5681192524251633,
        "accuracy_rate": 0.5994314383074935,
        "balanced_accuracy_rate": 0.5629203115337503,
        "precision": 0.4432575070302042,
        "f1": 0.437495178612114
      },
      "class-2": {
        "true_positive_rate": 0.42177322676881407,
        "false_positive_rate": 0.3514795196528837,
        "true_negative_rate": 0.6485204803471163,
        "false_negative_rate": 0.578226773231186,
        "accuracy_rate": 0.5756528863725469,
        "balanced_accuracy_rate": 0.5351468535579652,
        "precision": 0.3623427088234848,
        "f1": 0.38980575845890253
      },
      "class-3": {
        "true_positive_rate": 0.26144274353512836,
        "false_positive_rate": 0.2805894672521546,
        "true_negative_rate": 0.7194105327478454,
        "false_negative_rate": 0.7385572564648716,
        "accuracy_rate": 0.5737983254017079,
        "balanced_accuracy_rate": 0.4904266381414869,
        "precision": 0.3028268576818381,
        "f1": 0.2806172238153916
      }
    }
  ]
}

With this result, you can calculate the weighted F1 score by multiplying each classes's F1 score by the count of the ground truth and dividing by the total count.
In this example, that would be

(class-1.f1 * class-1-gt + class-2.f1 * class-2-gt + class-3.f1 * class-3-gt) / count

and with numbers:

(0.437495178612114 * 2540963 + 
    0.38980575845890253 * 2263918 + 
    0.2806172238153916 * 2239913) / 7044794

= 0.3722898785

back to top

Object Detection

Objects Detected

For multiclass, multilabel, and regression models, querying model performance works the same for Arthur computer vision models as more tabular and NLP models. But Object Detection computer vision has some special fields you can use when querying.

Example query fetching all bounding box fields:

{
	"select": [
        {
                "property": "inference_id"
        },
            {
            "property": "objects_detected"
        }
    ]
}

The response will have 1 object per bounding box.

{
  "query_result": [
    {
        "inference_id": "1",
        "objects_detected.class_id": 0,
        "objects_detected.confidence": 0.6,
        "objects_detected.top_left_x": 23,
        "objects_detected.top_left_y": 45,
        "objects_detected.width": 20,
        "objects_detected.height": 30
    },
    {
        "inference_id": "1",
        "objects_detected.class_id": 1,
        "objects_detected.confidence": 0.6,
        "objects_detected.top_left_x": 23,
        "objects_detected.top_left_y": 45,
        "objects_detected.width": 20,
        "objects_detected.height": 30
    },
    {"inference_id": 2,
     "...": "..."}
    ]
}

You can also specify only a single nested field:

{
  "select": [
        {
                "property": "inference_id"
        },
        {
            "property": "objects_detected.class_id"
        },
        {
            "property": "objects_detected.confidence"
        }
    ]
}

The response will have 1 object per bounding box.

{
  "query_result": [
    {
        "inference_id": "1",
        "objects_detected.class_id": 0,
        "objects_detected.confidence": 0.6
    },
    {
        "inference_id": "1",
        "objects_detected.class_id": 1,
        "objects_detected.confidence": 0.6
    },
    {"inference_id": 2, 
     "...":  "..."}
    ]
}

Mean Average Precision

Calculates Mean Average Precision for an object detection model. This is used as measure of accuracy for object detection models.

threshold determines the minimum IoU value to be considered a match for a label. predicted_property and ground_truth_property are optional parameters and should be the names of the predicted and ground truth attributes for the model. They default to "objects_detected" and "label" respectively if nothing is specified for these parameters.

Query Request:

{
    "select": [
        {
            "function": "meanAveragePrecision",
            "alias": "<alias_name> [Optional]",
            "parameters": {
                "threshold": "<threshold> [float]",
                "predicted_property": "<predicted_property> [str]",
                "ground_truth_property": "<ground_truth_property> [str]"
            }
        }
    ]
}

Query Response:

{
    "query_result": [
        {
            "<function_name/alias_name>": "<result> [float]"
        }
    ]
}

Example:

{
    "select": [
        {
            "function": "meanAveragePrecision",
            "parameters": {
                "threshold": 0.5,
                "predicted_property": "objects_detected",
                "ground_truth_property": "label"
            }
        }
    ]
}

Query Response:

{
    "query_result": [
        {
            "meanAveragePrecision": 0.78
        }
    ]
}

Generative Text

Token Likelihood

TokenLikelihoods attributes yield two queryable columns for that attribute with suffixes “_tokens” and “_likelihoods” appended to the attribute's name. For example, a model with a TokenLikelihoods attribute named summary_token_probs yields two queryable columns: summary_token_probs_tokens and summary_token_probs_likelihoods which represent an array of the selected tokens and an array of their corresponding likelihoods.

query =  {"select": [
        {"property": "summary_token_probs_tokens"},
        {"property": "summary_token_probs_likelihoods"}
    ]}
[{ "summary_token_probs_likelihoods": [
                0.3758265972137451,
                0.6563436985015869,
                0.32000941038131714,
                0.5629857182502747],
   "summary_token_probs_tokens": [
                "this",
                "is",
                "a",
                "summary"] }]

Bias

Bias Mitigation

Calculates mitigated predictions based on conditional thresholds, returning 0/1 for each inference.

Query Request:

{
    "select":
    [
        {
            "function": "biasMitigatedPredictions",
            "alias": "<alias_name> [Optional]",
            "parameters":
            {
                "predicted_property": "<predicted_property> [str]",
                "thresholds":
                [
                    {
                        "conditions":
                        {
                            "property": "<attribute_name> [string or nested]",
                            "comparator": "<comparator> [string] Optional: default 'eq'",
                            "value": "<string or number to compare with property>"
                        },
                        "threshold": "<threshold> [float]"
                    }
                ]
            }
        }
    ]
}

Query Response:

{
    "query_result": [
        {
            "<function_name/alias_name>": "<result> [int]"
        }
    ]
}

Example:

{
    "select":
    [
        {
            "function": "biasMitigatedPredictions",
            "parameters":
            {
                "predicted_property": "prediction_1",
                "thresholds":
                [
                    {
                        "conditions":
                        [
                            {
                                "property": "SEX",
                                "value": 1
                            }
                        ],
                        "threshold": 0.4
                    },
                    {
                        "conditions":
                        [
                            {
                                "property": "SEX",
                                "value": 2
                            }
                        ],
                        "threshold": 0.6
                    }
                ]
            }
        }
    ]
}

Response:

{
    "query_result":
    [
        {
            "SEX": 1,
            "biasMitigatedPredictions": 1
        },
        {
            "SEX": 2,
            "biasMitigatedPredictions": 0
        },
        {
            "SEX": 1,
            "biasMitigatedPredictions": 0
        }
    ]
}

Ranked List Outputs

Performance Metrics

Precision at k

Precision is an indicator of the efficiency of a supervised machine learning model. If one model gets all the relevant items by recommending fewer items than another model, it has a higher precision. For item recommendation models, precision at k measures the fraction of all relevant items among top-k recommended items.

Query Request

{
    "select": [
        {
            "function": "precisionAtK",
            "parameters": {
                "predicted_property": "predicted_items",
                "ground_truth_property": "relevant_items",
                "k": 5
            }
        }
    ]
}

Query Response

{
    "query_result": [
        {
            "precisionAtK": 0.26666666666666666
        }
    ]
}

Recall at k

Recall is an indicator of the effectiveness of a supervised machine learning model. The model which correctly identifies more of the positive instances gets a higher recall value. In case of recommendations, the recall at k is measured as the fraction of all relevant items that were recovered in top k recommendations.

Query Request

{
    "select": [
        {
            "function": "recallAtK",
            "parameters": {
                "predicted_property": "predicted_items",
                "ground_truth_property": "relevant_items",
								"k": 5
            }
        }
    ]
}

Query Response

{
    "query_result": [
        {
            "recallAtK": 0.26666666666666666
        }
    ]
}

Mean Average Precision at k (MAP @ k)

The MAP@K metric is the most commonly used metric for evaluating recommender systems. It calculates the precision at every location 1 through k where there is a relevant item. Average Precision is calculated per inference, then the per inference values are averaged across a group of inferences to create Mean Average Precision.

Query Request

{
    "select": [
        {
            "function": "mapAtK",
            "parameters": {
                "predicted_property": "predicted_items",
                "ground_truth_property": "relevant_items",
								"k": 5
            }
        }
    ]
}

Query Response

{
    "query_result": [
        {
            "mapAtK": 0.26666666666666666
        }
    ]
}

Normalized Discounted Cumulative Gain at k (nDCG @ k)

nDCG measures the overall reward at all positions that hold a relevant item. The reward is an inverse log of the position (i.e. higher ranks for relevant items would lead to better reward, as desired).

Similar to MAP@k, this metric calculates a value per inference, then is aggregated across inferences using an average.

Query Request

{
    "select": [
        {
            "function": "nDCGAtK",
            "parameters": {
                "predicted_property": "predicted_items",
                "ground_truth_property": "relevant_items",
								"k": 5
            }
        }
    ]
}

Query Response

{
    "query_result": [
        {
            "nDCGAtK": 0.26666666666666666
        }
    ]
}

AUC

In the case of ranked list metrics, AUC measures the likelihood that a random relevant item is ranked higher than a random irrelevant item. Higher the likelihood of this happening implies a higher AUC score meaning a better recommendation system. We calculate this likelihood empirically based on the ranks given by the algorithm to all items — out of all possible pairs of type (relevant-item, non-relevant-item), AUC is a proportion of pairs where the relevant item was ranked higher than the irrelevant item from that pair.

This metric is calculated per-inference, then aggregated as an average over a group of inferences.

Query Request

{
    "select": [
        {
            "function": "rankedListAUC",
            "parameters": {
                "predicted_property": "predicted_items",
                "ground_truth_property": "relevant_items"
            }
        }
    ]
}

Query Response

{
    "query_result": [
        {
            "rankedListAUC": 0.26666666666666666
        }
    ]
}

Mean Reciprocal Rank (MRR)

Mean Reciprocal Rank quantifies the rank of the first relevant item found in the recommendation list. It takes the reciprocal of this “first relevant item rank”, meaning that if the first item is relevant (i.e. the ideal case) then MRR will be 1, otherwise it will be lower.

Query Request

{
    "select": [
        {
            "function": "meanReciprocalRank",
            "parameters": {
                "predicted_property": "predicted_items",
                "ground_truth_property": "relevant_items"
            }
        }
    ]
}

Query Response

{
    "query_result": [
        {
            "meanReciprocalRank": 0.26666666666666666
        }
    ]
}

back to top