Explainability

Once enabled, Arthur will automatically calculate explanations (feature importances) for every prediction your model makes. In order to make this possible we package up your model in a way that allows us to call its predict function, which allows us to calculate explanations. We require a few things from your end:

  • A python script that that wraps your models predict function.

    • For Image models, a second function, load_image is also required. Details are under the “Load Image Function” section in this document.

  • A directory containing the above file, along with any serialized model files, and other supporting code

  • A requirements.txt with the depencies to support the above

This guide will walk through setting everything up, and then using the SDK to enable explainability.

Setting up your project directory

Your project can be set up in any way you like, but there are a few things you need to let us know about.

Requirements File

Your project requirements and dependencies can be stored in any format you like, such as the typical requirements.txt file, or another form of dependency management.

This should contain all packges your model and predict function need to run.

Note: You do not need to include arthurai here, we supply that

# example_requirements.txt

pandas==0.24.2
numpy==1.16.4
scikit-learn==0.21.3
torch==1.3.1
torchvision==0.4.2

It is advised to pin the specific versions your model requires. If no version is pinned we will use the latest version. This can cause issues if the latest version is not compatible with the version used to build your model.

Prediction Function

We need to be able send new inferences to your model to get predictions and generate explanations. For us to have access to your model, you need to create an entrypoint file that defines a predict() method.

The exact name of the file isn’t strict, so long as you specify the correct name when you enable explainability (see below). The only thing that does matter is that this file implements a predict() method. In most cases, if you have a previously trained model, this predict() method will likely just invoke the prediction from your trained model.

# example_entrypoint.py

sk_model = joblib.load("./serialized_model.pkl")

def predict(x):
    return sk_model.predict_proba(x)

See SparkML Integration Guide for an example using a SparkML model.

This predict method can be as simple or complicated as you need, so long as you can go from raw input data to a model output prediction. Specifically, in the case of a binary classifier, we expect a 2-d array where the first column indicates probability_0 for each input, and the second column indicates probability_1 for each input. In the case of a multiclass classifier with n possible labels, we expect an n-d array where column i corresponds to the predicted probability that each input belongs to class i.

Commonly, a fair amount of feature processing and transformation will need to happen prior to invoking your actual model.predict(). This might include normalizations, rescaling, one-hot encoding, embedding, and more. Whatever those transformations are, you can make them a part of this predict() method. Alternatively, you can wrap all those transformations into a helper function.

# example_entrypoint.py
from utils import pipeline_transformations

sk_model = joblib.load("./serialized_model.pkl")

def predict(x):
    return sk_model.predict_proba(pipeline_transformations(x))

Load Image Function

NOTE: Explainability is not available for Object Detection Image models.

For Multiclass Image models, there is a second function which is required: load_image(). This function should take in a string, which is a path to an image file. The function should return the image in a numpy array. Any image processing, such as converting to grey scale, should also happen in this function. This is because Lime (the explanation algorithm used behind the scenes) will create variations of this array to generate explanations. However any transformation resulting in a non-numpy array, should happen in the predict function, such as converting to a Tensor.

However no image resizing is required. As part of onboarding an image model, pixel_height and pixel_width are set as metadata on the model. When ingesting, we will automatically resize the image to the configured size, and pass this resized image path to the load_image function.

This function should be in the same file as the predict function.

Below is a full example file for an Image model, with both load_image and predict defined. Imports and class definitions are ommmited for brevity.

# example_entrypoint.py
import ...

class MedNet(nn.Module):
    ...

# load model using custom user defined class
net = MedNet()
path = pathlib.Path(__file__).parent.absolute()
net.load_state_dict(torch.load(f'{path}/pretrained_model'))

# helper function for transforming image
def quantize(np_array):
    return np_array + (np.random.random(np_array.shape) / 256)

def load_image(image_path):
    """Takes in single image path, and returns single image in format predict expects
    """
    return quantize(np.array(Image.open(image_path).convert('RGB')) / 256)

def predict(images_in):
    """Takes in numpy array of images, and returns predictions in numpy array.

    Can handle both single image in `numpy` array, or multiple images.
    """
    batch_size, pixdim1, pixdim2, channels = images_in.shape
    raw_tensor = torch.from_numpy(images_in)
    processed_images = torch.reshape(raw_tensor, (batch_size, channels, pixdim1, pixdim2)).float()
    net.eval()
    with torch.no_grad():
        return net(processed_images).numpy()

Project Structure

Here is an example of what your project directory might look like.

|-- fraud_model/
|   |-- data/
|   |  |-- training_data.csv
|   |  |-- testing_data.csv
|   |-- example_requirements.txt
|   |-- example_entrypoint.py
|   |-- utils.py
|   |-- serialized_model.pkl

Enabling Explainability

Anytime after onboarding a model, you can enable explainability by letting us know about these details specific to your project. You also provide a sample of your training data (in a pandas DataFrame) so that we can set up the Explainer properly.

# optionally exclude directories within `project_dir` from being bundled with predict function
ignore_dirs = [
    'dir_to_ignore_within_project_directory',     
    'dir_within_project_dir_still_included/excluded_dir'
]

arthur_model.enable_explainability(
    df=X_train.head(50),
    project_directory="/path/to/fraud_model/",
    requirements_file="example_requirements.txt",
    user_predict_function_import_path="example_entrypoint",
    ignore_dirs=ignore_dirs
)

The above provides a simple example. For a list of all configuration options and details around them, see the Explainability entry in the Enrichment guide.

Notes about above example:

  1. joblib is a Python library that will allow you to reconstruct your model from a serialized pickle file

  2. X_train is your trained model dataframe

  3. user_predict_function_import_path is the Python path to import the entrypoint file as if you imported it into the python program that is running enable_explainability

Troubleshooting

AttributeError When Loading Predict Function

While this can be an issue with any model type, it is common to see when using sk-learn objects that take in custom user functions. We will use TfidfVectorizer as an example, which is a commonly used vectorizer for NLP models, that often utilizes custom user functions.

A TfidfVectorizer accepts a user defined tokenize function, which is used to split a text string into tokens.

Problem

Say this code was used to create your model.

# make_model.py

def tokenize(text):
    # tokenize and lemmatize
    doc = nlp(txt)
    tokens = []
    for token in doc:
        if not token.is_stop and not token.is_punct \
        and not token.is_space and token.lemma_ != '-PRON-':
            tokens.append(token.lemma_)
    return tokens

def make_model():
    # here we pass a custom function to an sklearn object
    vectorizer = TfidfVectorizer(tokenizer=tokenize)
    vectorizer.fit(X_train)
    model = LogisticRegression()
    model.fit(vectorizer.transform(X_train))

    pipeline = make_pipeline(vectorizer, model)
    joblib.dump(pipeline, 'model.pkl')

if __name__ == "__main__":
    make_model()

Now you create this entrypoint file to enable explainability:

# entrypoint.py

model = joblib.load("./model.pkl")

def predict(fv):
    return model.predict_proba(fv)

Now when the SDK imports entrypoint to test the function, the following error gets thrown:

AttributeError: module '__main__' has no attribute 'tokenize'

What happens is that Python failed to serialize the custom function, only the reference to how it was imported. Which in this case, it was just top level in the model creation script (hence __main__.tokenize in the error). This function doesn’t exist in entrypoint, and so the error is thrown.

Solution

To solve, you need to pull out tokenize into its own module, that can be imported from both create_model.py and also in entrypoint.py

# model_utils.py

def tokenize(text):
    # tokenize and lemmatize
    doc = nlp(txt)
    tokens = []
    for token in doc:
        if not token.is_stop and not token.is_punct \
        and not token.is_space and token.lemma_ != '-PRON-':
            tokens.append(token.lemma_)
    return tokens
# create_model.py
from model_utils import tokenize

def make_model():
    # here we pass a custom function to an sklearn object
    vectorizer = TfidfVectorizer(tokenizer=tokenize)
    vectorizer.fit(X_train)
    model = LogisticRegression()
    model.fit(vectorizer.transform(X_train))

    pipeline = make_pipeline(vectorizer, model)
    joblib.dump(pipeline, 'model.pkl')

if __name__ == "__main__":
    make_model()
# entrypoint.py
from model_utils import tokenize

model = joblib.load("./model.pkl")

def predict(fv):
    return model.predict_proba(fv)

Now, when Python serializes the model, it stores the reference as model_utils.tokenize, which is also imported within entrypoint.py and therefore no error is thrown.

Now everything will work, but both model_utils.py AND entrypoint.py must be included in the directory passed to enable_explainability()