Troubleshooting Explainability

Troubleshooting

AttributeError When Loading Predict Function

While this can be an issue with any model type, it is common to see when using sk-learn objects that take in custom user functions.
We will use TfidfVectorizer as an example, which is a commonly used vectorizer for NLP models, that often utilizes custom user functions.

A TfidfVectorizer accepts a user defined tokenize function, which is used to split a text string into tokens.

Problem

Say this code was used to create your model.

# make_model.py

def tokenize(text):
    # tokenize and lemmatize
    doc = nlp(txt)
    tokens = []
    for token in doc:
        if not token.is_stop and not token.is_punct \
        and not token.is_space and token.lemma_ != '-PRON-':
            tokens.append(token.lemma_)
    return tokens

def make_model():
    # here we pass a custom function to an sklearn object
    vectorizer = TfidfVectorizer(tokenizer=tokenize)
    vectorizer.fit(X_train)
    model = LogisticRegression()
    model.fit(vectorizer.transform(X_train))

    pipeline = make_pipeline(vectorizer, model)
    joblib.dump(pipeline, 'model.pkl')

if __name__ == "__main__":
    make_model()

Now you create this entrypoint file to enable explainability:

# entrypoint.py

model = joblib.load("./model.pkl")

def predict(fv):
    return model.predict_proba(fv)

Now when the SDK imports entrypoint to test the function, the following error gets thrown:

AttributeError: module '__main__' has no attribute 'tokenize'

What happens is that Python failed to serialize the custom function, only the reference to how it was imported. Which in this case, it was just top level in the model creation script (hence __main__.tokenize in the error).
This function doesn't exist in entrypoint, and so the error is thrown.

Solution

To solve, you need to pull out tokenize into its own module, that can be imported from both create_model.py
and also in entrypoint.py.

# model_utils.py

def tokenize(text):
    # tokenize and lemmatize
    doc = nlp(txt)
    tokens = []
    for token in doc:
        if not token.is_stop and not token.is_punct \
        and not token.is_space and token.lemma_ != '-PRON-':
            tokens.append(token.lemma_)
    return tokens

# create_model.py
from model_utils import tokenize

def make_model():
    # here we pass a custom function to an sklearn object
    vectorizer = TfidfVectorizer(tokenizer=tokenize)
    vectorizer.fit(X_train)
    model = LogisticRegression()
    model.fit(vectorizer.transform(X_train))

    pipeline = make_pipeline(vectorizer, model)
    joblib.dump(pipeline, 'model.pkl')

if __name__ == "__main__":
    make_model()

# entrypoint.py
from model_utils import tokenize

model = joblib.load("./model.pkl")

def predict(fv):
    return model.predict_proba(fv)

Now, when Python serializes the model, it stores the reference as model_utils.tokenize, which is also imported within entrypoint.py and therefore no error is thrown.

Now everything will work, but both model_utils.py AND entrypoint.py must be included in the directory passed to enable_explainability().