Ranked List output data is typically used in recommender systems, which is a type of machine learning model that generates suggestions about “relevant” ranked items based on some input data. An example of a recommender system using ranked list data is a model that recommends relevant movies to a viewer based on metadata generated from their watch history.

Formatted Data in Arthur

Ranked List output models require the following data formatting:

[
	{ // first recommended item
	  "item_id": "item1",
	  "score": 0.324,
    "label": "apple"
  },
  { // second recommended item
    "item_id": "item2",
    "score": 0.024,
    "label": "banana"
 ]

In this formatting, the score must be a float value, whereas the label and item_id must be string values. The label is an optional, readable version of item_id and score is an optional score/probability for a given item. If one of these optional metadata fields are specified in one inference, it must be specified for all of them.

Arthur expects the list of ranked items to be sorted in rank order, such that the highest ranked item is first. Each ranked list output model in Arthur can have max 1000 total unique recommended items in its reference dataset. Additionally, each ranked list output model can have max 100 recommendations per inference/ground truth.

Recommender Systems Ground Truth

The ground truth for ranked list output models is an array of strings representing the items that have been determined “relevant” for a given inference.

Available Metrics

When onboarding recommender system models, you have a number of default metrics available to you within the UI. You can learn more about each specific metric in the metrics section of the documentation.

Out-of-Box Metrics

The following metrics are automatically available in the UI (out-of-the-box) when teams onboard a ranked list model. Find out more about these metrics in the Performance Metrics section.

Metric	Metric Type
Precision at k	Performance
Recall at k	Performance
nDCG at k	Performance
Mean Reciprocal Rank	Performance
Ranked List AUC	Performance
Inference Count	Ingestion

Drift Metrics

In the Arthur platform, drift metrics are calculated compared to a reference dataset. So, once a reference dataset is onboarded for your model, these metrics are available out of the box for comparison. Find out more about these metrics in the Drift and Anomaly section.

Note: Teams are able to evaluate drift for inference data at different intervals with our Python SDK and query service (for example data coming into the model now, compared to a month ago).

PSI	Feature Drift
Time Series Drift	Feature Drift
Prediction Drift	Prediction Drift

User-Defined Metrics

Whether your team uses a different performance metric, wants to track defined data segments, or needs logical functions to create a metric for external stakeholders (like product or business metrics). Learn more about creating metrics with data in Arthur in the User-Defined Metrics section.