NLP Onboarding

This page walks through the basics of setting up a natural language processing (NLP) model and onboarding it to
Arthur Scope to monitor language-specific performance.

Getting Started

The first step is to import functions from the arthurai package and establish a connection with Arthur Scope.

# Arthur imports
from arthurai import ArthurAI
from arthurai.common.constants import InputType, OutputType, Stage

arthur = ArthurAI(url="https://app.arthur.ai", 
                  login="<YOUR_USERNAME_OR_EMAIL>")

Registering an NLP Model

Each NLP model is created with a name and with input_type = InputType.NLP. Here, we register a classification model on text specifying a text_delimiter of NOT_WORD:

arthur_nlp_model = arthur.model(name="NLPQuickstart",
                               input_type=InputType.NLP,
                               model_type=OutputType.Multiclass,
                               text_delimiter=TextDelimiter.NOT_WORD)

The different OutputType values currently supported for NLP models are classification, multi-labeling, and regression.

Text Delimiter

NLP models optionally allow specifying a text_delimiter, which specifies how a raw document is split into tokens.

If a text delimiter is not provided, a default text_delimiter will be TextDelimiter.NOT_WORD. This delimiter will ignore punctuation and tokenize text based only on the words present. However, suppose punctuation and non-word text needs to be considered by your model. In that case, you should consider using other options for a delimiter to ensure those other pieces of text are processed by your NLP model.

For a full list of available text delimiters with examples, see the
TextDelimiter constant documentation in our SDK reference.

Additionally, Arthur supports sending the pre-tokenized text. For steps on registering tokens with Arthur, see our generative text walkthrough.

Formatting Reference/Inference Data

Column names can contain only alphanumeric and underscore characters. The rest of the string values can have
additional characters as raw text.

    text_attr                      pred_value   ground_truth    non_input_1   
0   'Here-is some text'            0.1          0               0.2  
1   'saying a whole lot'           0.05         0              -0.3 
2   'of important things!'         0.02         1               0.7     
3   'With all kinds of chars?!'    0.2          0               0.1    ...
4   'But attribute/column names'   0.6          1              -0.6
5   'can only use underscore.'     0.9          1              -0.9
                                ...

Reviewing the Model Schema

Before you register your model with Arthur by calling arthur_model.save(), you can call arthur_model.review() the model schema to check that your data is parsed correctly.

For an NLP model, the model schema should look like this:

     name           stage                 value_type          categorical   is_unique  
0    text_attr      PIPELINE_INPUT        UNSTRUCTURED_TEXT   False         True   
1    pred_value     PREDICTED_VALUE       FLOAT               False         False      ...
2    ground_truth   GROUND_TRUTH          INTEGER             True          False   
3    non_input_1    NON_INPUT_DATA        FLOAT               False         False   
                                        ...

Finishing Onboarding

Once you have finished formatting your reference data and your model schema looks correct using arthur_model.review(), you are finished registering your model and its attributes - so you are ready to complete onboarding your model.

To finish onboarding your NLP model, the following steps apply, which is the same for NLP models as it is for models
of any InputType and OutputType:

Enrichments

For an overview of configuring enrichments for NLP models, see the {doc}/user-guide/walkthroughs/enrichments guide.

For a step-by-step walkthrough of setting up the explainability Enrichment for NLP models, see
{ref}nlp_explainability.