Generative Text
This page discusses the basics of setting up a generative text model and onboarding it to Arthur Scope to monitor generative performance.
Getting Started
The first step is to import functions from the arthurai
package and establish a connection with Arthur.
# Arthur imports
from arthurai import ArthurAI
from arthurai.common.constants import InputType, OutputType, Stage
arthur = ArthurAI(url="https://app.arthur.ai",
login="<YOUR_USERNAME_OR_EMAIL>")
Preparing Data for Arthur
Arthur Scope does not need your model object itself to monitor performance - only predictions are required
All you need to monitor your model with Arthur is to upload the predictions your model makes. Here's how to format
predictions for common generative text model schemas.
Use the Arthur data type TOKENS
for tokenized input and output texts. Arthur expects a list of strings as below for
tokenized data.
[
{
"input_text": "this is the raw input to my model",
"input_tokens": ["this", "is", "the", "raw", "input", "to", "my", "model"],
"output_text": "this is model generated text",
"output_tokens": ["this", "is", "model", "generated", "text"]
}
]
Use the Arthur data type TOKEN_LIKELIHOODS
for generated outputs of tokens and their likelihoods. Arthur expects this data type to be formatted as an array of maps from token strings to float likelihoods. Each array index should correspond to one token in the generated sequence. If supplying both TOKENS
and TOKEN_LIKELIHOODS
for predicted values, the two arrays must be equal in length.
[
{
"input_text": "this is the raw input to my model",
"input_tokens": ["this", "is", "the", "raw", "input", "to", "my", "model"],
"output_text": "this is model generated text",
"output_tokens": ["this", "is", "model", "generated", "text"],
"output_probs": [
{"this": 0.4, "the": 0.5, "a": 0.1},
{"is": 0.8, "could": 0.1, "may": 0.1},
{"model": 0.33, "human": 0.33, "robot": 0.33},
{"generated": 0.9, "written": 0.03, "dreamt": 0.07},
{"text": 0.7, "rant": 0.2, "story": 0.1}
]
}
]
Arthur supports maps of up to 5 token - float key pairs.
The Arthur SDK provides helper functions for mapping OpenAI response objects or log tensor arrays to Arthur format.
See the SDK reference for more guidance on usage.
Registering a Generative Text Model
Each generative text model is created with a name and with output_type = OutputType.TokenSequence
. We also need to specify an input type, which in this case will be InputType.NLP
for a text to text model. Here, we register a token sequence model with NLP input specifying a text_delimiter
of NOT_WORD
:
arthur_nlp_model = arthur.model(name="NLPQuickstart",
input_type=InputType.NLP,
model_type=OutputType.TokenSequence,
text_delimiter=TextDelimiter.NOT_WORD)
Arthur uses the text delimiter to tokenize model input texts and generated texts and track derived insights like sequence length. You can also register your own pre-tokenized values with Arthur for more complex tokenizers. If the registered model uses a custom tokenizer, this is the recommended process outlined in the below section on building a generative text model.
Below, we show different ways of building a generative text model that depends on which attributes you want to monitor for your model.
Building a Generative Text Model
To build a generative text model in the Arthur SDK, use the build_token_sequence_model
method on the Arthur Model.
Here we add one attribute for the input text and one attribute for the model output or generated text.
Both of these attributes will have the UNSTRUCTURED_TEXT
value type in the ArthurModel after calling this method - this means that this data is saved as a string in each inference.
You should build your model this way if you will only monitor its input and output text and not monitor any of its token processing or likelihood scores.
arthur_nlp_model.build_token_sequence_model(input_column='input_text',
output_text_column='generated_text')
Registering Pre-tokenized Text
Optionally, token sequence models also support adding token information. In the below example, the tokenized input text is specified in theinput_token_column
and the final tokens selected for the generated output are specified in theoutput_token_column
.
This method builds a model with four attributes to monitor for your generative text model.
While the text attributes will still have the UNSTRUCTURED_TEXT
value type, the token attributes will have the TOKENS
value type means that these attributes are represented as a list of tokens for each inference.
You should build your model this way if you are going to monitor the inferences in their tokenized form as well as in their text form - this may help distinguish performance behaviors due to the base model from performance behaviors due to the tokenization.
arthur_nlp_model.build_token_sequence_model(input_column='input_text',
output_text_column='generated_text',
input_token_column='input_tokens',
output_token_column='output_tokens')
Registering Tokens With Likelihoods
You can attach likelihoods to the generated tokens by specifying the output_likelihood_column
:
arthur_nlp_model.build_token_sequence_model(input_column='input_text',
output_text_column='generated_text',
input_token_column='input_tokens',
output_token_column='output_tokens',
output_likelihood_column='output_probs')
It is not required to specify both a output_token_column
and an output_likelihood_column
-
if only the output_likelihood_column
is specified, greedy decoding will be assumed.
Registering a Ground Truth Sequence
Lastly, adding a ground truth sequence to the model is optional. Ground truth has the same tokenization support as model input and output texts.
arthur_nlp_model.build_token_sequence_model(input_column='input_text',
output_text_column='generated_text',
ground_truth_text_column='ground_truth_text')
Adding Inference Metadata
We now have a model schema with input, predicted value, and ground truth data defined. Additionally, we can add non-input data attributes to track other information associated with each inference but not necessarily part of the model pipeline.
For generative text models, tracking production signals as performance feedback is often of interest. Here,
we add one continuous attribute and one boolean attribute to measure the success of our model for a use case.
arthur_nlp_model.add_attribute(name='edit_duration', value_type=ValueType.Float, stage=Stage.NonInputData)
arthur_nlp_model.add_attribute(name='accepted_by_user', value_type=ValueType.Boolean, stage=Stage.NonInputData)
Reviewing the Model Schema
Before you register your model with Arthur by calling arthur_model.save()
you can call arthur_model.review()
the model schema to check that it is correct.
For a TokenSequence model with NLP input, the model schema should look similar to this:
name stage value_type categorical is_unique
0 text_attr PIPELINE_INPUT UNSTRUCTURED_TEXT False True
1 pred_value PREDICTED_VALUE UNSTRUCTURED_TEXT False False ...
2 pred_tokens PREDICTED_VALUE TOKEN_LIKELIHOODS False False
3 non_input_1 NON_INPUT_DATA FLOAT False False
...
Finishing Onboarding
Once you have finished formatting your reference data and your model schema looks correct use thearthur_model.review()
, you are finished registering your model and its attributes, ready to complete onboarding your model.
To finish onboarding your TokenSequence model, the following steps apply, which is the same for NLP models as it is for models of any InputType
and OutputType
:
Sending Inferences
Since we've already formatted the data, we can use the send_inferences
method of the SDK to upload the inferences to Arthur. This functionality is also available directly through the API.
arthur_nlp_model.send_inferences([
{
"input_text": "this is the raw input to my model",
"input_tokens": ["this", "is", "the", "raw", "input", "to", "my", "model"],
"output_text": "this is model generated text",
"output_tokens": ["this", "is", "model", "generated", "text"],
"output_probs": [
{"this": 0.4, "the": 0.5, "a": 0.1},
{"is": 0.8, "could": 0.1, "may": 0.1},
{"model": 0.33, "human": 0.33, "robot": 0.33},
{"generated": 0.9, "written": 0.03, "dreamt": 0.07},
{"text": 0.7, "rant": 0.2, "story": 0.1}
]
}
])
Arthur supports maps of up to 5 token - float key pairs.
The Arthur SDK provides a helper function to map tensor arrays into an Arthur format.
See the SDK reference for more guidance on usage
Enrichments
For an overview of configuring enrichments for NLP models, see the {doc}/user-guide/walkthroughs/enrichments
guide.
Explainability is not currently supported for TokenSequence models, but anomaly detection will be enabled by default.
Updated about 1 year ago