Registering Model Attributes Manually
This guide is useful for teams that want to manually register all their model attributes or register a model without a reference dataset.
Setting Up To Individually Onboard Model Attributes
Setting up your model to add each attribute individually is similar to the typical starting point for Creating Arthur Model Objects. Teams must define their Arthur Model Object:
model = arthur.model(partner_model_id=f"CreditRisk_Batch_QS-{datetime.now().strftime('%Y%m%d%H%M%S')}",
display_name="Credit Risk Batch",
input_type=InputType.Tabular,
output_type=OutputType.Multiclass,
is_batch=True)
More information on defining this and all the steps can be found on the Creating Arthur Model Objects page here.
Set up Arthur Input Attributes
Teams looking to manually onboard models can start by manually adding all PipelineInput
and NonInput
attributes to their Arthur Model.
Sending Null Values for Attributes
Unless otherwise specified below in different types, Null values are allowed for different input types. Null and NaN values are allowed for onboarding through the Arthur SDK. On the other hand, Null values are allowed when onboarding with the API, but Null values are not.
Numerical Attribute
Numerical attributes are input attributes meant to track continuous numerical values. They can be manually added to a model in Arthur using the add_attribute
function.
from arthurai.common.constants import Stage, ValueType
# adds a float input attribute directly to the model
arthur_model.add_attribute(
name="Num_Attr_Name",
value_type=ValueType.Float,
stage=Stage.ModelPipelineInput
)
Inferring Numerical Attributes as Categorical
When Arthur is inferring the model schema,
Float
andInteger
columns are assumed to be categorical if there are fewer than 20 unique values and ifFloat
values are all whole numbers. String and boolean columns are always assumed to be categorical for Tabular models.
## Ensure that numerical attributes are valued as numerical and not categorical
arthur_model.get_attribute("Num_Attr_Name", stage=Stage.ModelPipelineInput).categorical = False
Teams may also choose to specify the exact value type. The options are "Integer"
or "Float"
.
arthur_model.get_attribute("Num_Attr_Name", stage=Stage.ModelPipelineInput).value_type = 'INTEGER'
Categorical Attribute
Categorical Attributes are attributes that represent a finite group of values (or categories). They can be manually added to a model in Arthur using the add_attribute
function.
from arthurai.common.constants import Stage, ValueType
# adds a float input attribute directly to the model
arthur_model.add_attribute(
name="Cat_Attr_Name",
value_type=ValueType.String,
stage=Stage.ModelPipelineInput
)
Ensure All Possible Production Attributes Are Specified
Attributes that are set to categorical must have at least one column. In edge cases where this is not possible, the category list can be set using a single "dummy" category (e.g., ["n/a"]). While new categories will be taken in by the platform, they will not be utilized in drift calculations or segmented visualization in the UI. So, it is important to ensure that all potential categories are listed before onboarding.
Setting Possible Categories
Based on the callout above, teams may manually specify the potential categories.
## Set Categories to N/A
arthur_model.get_attribute("Cat_Attr_Name", stage=Stage.ModelPipelineInput).categories = ["n/a"]
## Set Categories to a List
arthur_model.get_attribute("Cat_Attr_Name", stage=Stage.ModelPipelineInput).categories = ["n/a", "bachelors","masters","highschool"]]]
Setting Attribute Labels
When teams have set up numerical encoding for their categorical variables, providing the mapping back to human understanding for the Arthur platform may be useful. This will make it easier for end users to utilize the UI to understand categorical attributes better.
# labels the value 0 for the attribute 'education_level'
# to have the label 'elementary', etc.
arthur_model.set_attribute_labels(
'education_level',
{0 : 'elementary', 1 : 'middle', 2 : 'high', 3 : 'university'}
)
Timestamp Attribute
Timestamp Attributes are model features that represent a date/time. These are frequently found in time series models.
from arthurai.common.constants import Stage, ValueType
# adds a timestamp input attribute directly to the model
arthur_model.add_attribute(
name="Timestamp_Attribute_Name",
value_type=ValueType.Timestamp,
stage=Stage.ModelPipelineInput
)
It is important to note that the DateTime object being put into Arthur, must be in a DateTime format and include a timezone. A common example of how to set up these transformations can be seen below:
## Example of Converting to Pandas DateTime
## This function will need to change depending on how your time strings are formatted
def get_timestamps(x):
new_time = x.split('.')[0]
return datetime.strptime(new_time, '%Y-%m-%d %H:%M:%S')
df['timestamps'] = df['timestamps'].apply(lambda x: get_timestamps(x))
## Ensure Appropriate tzinfo Timezone Added
df['timestamp'] = df['timestamp'].apply(lambda x: x.replace(tzinfo=pytz.UTC))
Null and NaN Values Are Not Allowed
For Timestamp attributes, Null and NaN values are not allowed within Arthur.
Time Series Attribute
Time Series Attributes are model features that represent a value over time.
from arthurai.common.constants import Stage, ValueType
# adds a time series input attribute directly to the model
arthur_model.add_attribute(
name="Time_Series_Attribute_Name",
value_type=ValueType.TimeSeries,
stage=Stage.ModelPipelineInput
)
It is important to note that the TimeSeries object being put into Arthur must be formatted as a list of dicts with "timestamp"
and "value"
keys. The timestamps must be formatted according to the restrictions on Timestamp attributes. The values must be floats.
There must be data for every timestamp on a regular time interval
Each time series attribute should be considered to have some regular time interval (eg. 1 day, 1 week, etc.) at which the value it is recording is polled. The value thus must be recorded on consistent time intervals, and if data is not recorded on a given timestamp in that consistent interval, a data point with a
Null
value must still be recorded.
Text (NLP) Attribute
Unstructured Text Attributes refer to input text for NLP models. They can be manually added to a model in Arthur using the add_attribute
function. The ArthurAttribute
type of UnstructuredText
is designed to be used for NLP models only.
from arthurai.common.constants import Stage, ValueType
# adds a float input attribute directly to the model
arthur_model.add_attribute(
name="NLP_Input_Attribute_Name",
value_type=ValueType.UnstructuredText,
stage=Stage.ModelPipelineInput
)
Generative Text Models
Teams wanting to monitor generative text models should refer to the Generative Text Model Onboarding Guide. This provides a step-by-step walkthrough in manually onboarding those models.
Image Attribute
Image Attributes refer to the images input to computer vision models. They can be manually added to a model in Arthur using the add_image_attribute
function.
model.add_image_attribute("ImageColumnName")
The ImageColumnName
string contains the column's name in your future reference or inference data frames containing the path to each image.
Unique Identifier
A Unique Identifier Attribute within Arthur is created to specify unique values within the platform. These String-type categorical attributes within Arthur specify unique values for every category.
from arthurai.common.constants import Stage, ValueType
# adds a float input attribute directly to the model
arthur_model.add_attribute(
name="Cat_Unique_Attr_Name",
value_type=ValueType.String,
stage=Stage.ModelPipelineInput
)
## Specify this category as unique
arthur_model.get_attribute("Cat_Unique_Attr_Name", stage=Stage.ModelPipelineInput).unique = True
## Specify no categories
arthur_model.get_attribute("Cat_Unique_Attr_Name", stage=Stage.ModelPipelineInput).categories = []
Do Not Onboard Your Partner Inference ID with Reference Data
While the Partner Inference ID (your internal teams inference identifier) is the most common unique identifier within Arthur, this is something that you can specify (or Arthur will create) when you send inferences onto the platform. This is not specified when building out a reference dataset.
Set up Arthur Predicted/Ground Truth Attributes
After sending all input attribute information, teams can specify their model's predicted and ground truth attributes. This will depend on the model task type they are trying to onboard.
Null Values Are Not Allowed for Predicted or Ground Truth Attributes
Null values are not supported for predicted or ground truth attributes.
Classification
To add output attributes for classification tasks, teams must first specify what type of classification model they want to onboard. They can choose between:
- Binary Classification with columns for each predicted attribute and ground truth value
- Multi-Class Classification with columns for each predicted attribute and ground truth value
- Either Binary or Multi-Class Classification with columns for each predicted attribute but only a single column for ground truth
The type of classification you choose should be based on the schema you expect for onboarding reference or inference data later.
Binary Classification
If you expect your inference schema to consist of two predictions and two ground truth columns for your binary classification task, then you should utilize the add_binary_classifier_output_attributes
function. In this function, you need to provide:
- Prediction to Ground Truth Mapping: Mapping of each predicted column to its corresponding ground truth column
- Positive Predicted Attribute: The positive class is the class that is related to your objective function. For example, if you want to classify whether the objects are present in a given scenario. So all the data samples where objects are predicted present will be considered positively predicted.
# map PredictedValue attributes to their corresponding GroundTruth attributes
PRED_TO_GROUND_TRUTH_MAP = {'pred_0' : 'gt_0',
'pred_1' : 'gt_1'}
# add the ground truth and predicted attributes to the model
# specifying that the `pred_1` attribute is the
# positive predicted attribute, which means it corresponds to the
# probability that the binary target attribute is 1
arthur_model.add_binary_classifier_output_attributes(positive_predicted_attr='pred_1',
pred_to_ground_truth_map=PRED_TO_GROUND_TRUTH_MAP)
Multi-Class Classification
If you expect your inference schema to consist of multiple predictions and their corresponding multiple ground truth columns for your classification task, then you should utilize the add_multiclass_classifier_output_attributes
function. In this function, you need to provide:
- Prediction to Ground Truth Mapping: Mapping of each predicted column to its corresponding ground truth column
- Positive Predicted Attribute: The positive class is the class that is related to your objective function. For example, if you want to classify whether the objects are present in a given scenario. So all the data samples where objects are predicted present will be considered positively predicted.
# map PredictedValue attributes to their corresponding GroundTruth attributes
PRED_TO_GROUND_TRUTH_MAP = {
"dog": "dog_gt",
"cat": "cat_gt",
"horse": "horse_gt"
}
# add the ground truth and predicted attributes to the model
arthur_model.add_multiclass_classifier_output_attributes(
pred_to_ground_truth_map = PRED_TO_GROUND_TRUTH_MAP
)
Single Column Classification
Single-column classification is very similar to previous techniques; however, there is only a single ground truth column in this technique.
- Prediction to Ground Truth Mapping: Mapping of each predicted column to its corresponding ground truth value
- Positive Predicted Attribute: The positive class is the class that is related to your objective function. For example, if you want to classify whether the objects are present in a given scenario. So all the data samples where objects are predicted present will be considered positively predicted.
- Ground_Truth_Column: You must specify the single-column ground truth
# Map PredictedValue attribute to its corresponding GroundTruth attribute value.
# This tells Arthur that the `pred_survived` column represents
# the probability that the ground truth column has the value 1
PRED_TO_GROUND_TRUTH_MAP = {
"pred_value": 1
}
# Add the ground truth and predicted attributes to the model,
# specifying which attribute represents ground truth and
# which attribute represents the predicted value.
arthur_model.add_classifier_output_attributes_gtclass(
positive_predicted_attr = 'pred_value',
pred_to_ground_truth_class_map = PRED_TO_GROUND_TRUTH_MAP,
ground_truth_column = 'gt_column'
)
Regression
To manually specify your regression output, teams need to specify a prediction to ground truth mapping with the following:
- Predicted Value: The column that contains your numerical predicted output
- Ground Truth Value: The column that contains the ground truth
from arthurai.common.constants import ValueType
# map PredictedValue attributes to their corresponding GroundTruth attributes
PRED_TO_GROUND_TRUTH_MAP = {
"pred_value": "gt_value",
}
# add the ground truth and predicted attributes to the model
arthur_model.add_regression_output_attributes(
pred_to_ground_truth_map = PRED_TO_GROUND_TRUTH_MAP,
value_type = ValueType.Float
)
Object Detection
To manually specify your object detection models, teams need to specify the following:
- Predicted Attribute Name: This is the column name that will store your predicted bounding boxes
- Ground Truth Attribute Name: This is the name of the column with the true labeled bounding boxes
- Class Labels: All potential object labels for that your model is detecting
predicted_attribute_name = "objects_detected"
ground_truth_attribute_name = "label"
class_labels = ['cat', 'dog', 'person']
arthur_model.add_object_detection_output_attributes(
predicted_attribute_name,
ground_truth_attribute_name,
class_labels)
Generative Text (LLM)
Teams wanting to monitor generative text models should refer to the Generative Text Model Onboarding Guide. This provides a step-by-step walkthrough in manually onboarding those models.
Setting Reference Data Later
For teams that have chosen to manually onboard all of their model attributes to ensure that they were inferred correctly but still want to include a reference dataset for drift calculations, they can! After manually creating the model schema above, this can be done by setting the reference dataset.
# reference dataframe of model inputs
reference_set = pd.DataFrame(....)
# produce model predictions on reference set
# in this example, the predictions are classification probabilities
preds = model.predict_proba(reference_set)
# assign the column corresponding to the positive class
# as the `pred` attribute in the reference data
reference_set["pred"] = preds[:, 1]
# set ground truth labels
reference_set["gt"] = ...
# configure the ArthurModel to use this dataframe as reference data
arthur_model.set_reference_data(data=reference_set)
Updated about 1 year ago