Setting A Reference

For data drift and anomaly detection, you need to set your model’s training data to serve as the baseline. All new inferences are compared to this baseline set in order to quantify drift and stability of incoming data streams. The reference set should include:

  • inputs

  • model predictions

  • ground truth [optional]

If you created your model using the ArthurModel.build() method, the DataFrame you pass into that method will be used as the reference data.

If you created your model another way (e.g. using ArthurModel.from_dataframe()), you can manually set the reference data:

# get all input columns
reference_set = df.copy()

# set ground truth labels
reference_set["consumer_credit_score_gt"] = Y_train

# get model predictions
preds = sklearn_model.predict_proba(X_train)
reference_set["consumer_credit_score_prediction"] = preds[:, 1]

Now we set the baseline data.

arthur_model.set_reference_data(data=reference_set)

A Note About Large Batches

If your reference set is larger than might fit in memory in a pd.DataFrame, you can specify a directory containing parquet files to upload a batch.

arthur_model.set_reference_data_(directory_path='./data/batch_reference_files/')