Setting A Reference¶
For data drift and anomaly detection, you need to set your model’s training data to serve as the baseline. All new inferences are compared to this baseline set in order to quantify drift and stability of incoming data streams. The reference set should include:
ground truth [optional]
If you created your model using the
ArthurModel.build() method, the DataFrame you pass into that method will be used
as the reference data.
If you created your model another way (e.g. using
ArthurModel.from_dataframe()), you can manually set the reference
# get all input columns reference_set = df.copy() # set ground truth labels reference_set["consumer_credit_score_gt"] = Y_train # get model predictions preds = sklearn_model.predict_proba(X_train) reference_set["consumer_credit_score_prediction"] = preds[:, 1]
Now we set the baseline data.
A Note About Large Batches¶
If your reference set is larger than might fit in memory in a pd.DataFrame, you can specify a directory containing parquet files to upload a batch.