Setting A Reference

For data drift and anomaly detection, you need to set your model’s training data to serve as the baseline. All new inferences are compared to this baseline set in order to quantify drift and stability of incoming data streams. The reference set should include:

  • inputs

  • model predictions

  • ground truth [optional]

# get all input columns
reference_set = df.copy()

# set ground truth labels
reference_set["consumer_credit_score_gt"] = Y_train

# get model predictions
preds = sklearn_model.predict_proba(X_train)
reference_set["consumer_credit_score_prediction"] = preds[:, 1]

Now we set the baseline data.

arthur_model.set_reference_data(data=reference_set)

A Note About Large Batches

If your reference set is larger than might fit in memory in a pd.DataFrame, you can specify a directory containing parquet files to upload a batch.

arthur_model.set_reference_data_(directory_path='./data/batch_reference_files/')