Generative text model outputs are computer-generated text that mimics human language patterns and structures based on patterns learned from a large dataset. In Arthur, these models are listed under the Token Sequence model type.

Some common examples of Generative Text models are:

Headline summarization
Question and Answering (chatbots)

Formatted Data in Arthur

Depending on how you built your generative model and what you are looking to track, there are different types of data that you can track in the platform.

Attribute (User Text Input)	Output (Text Output)	Output Likelihood (Token Likelihood) (Optional)	Non Input Attribute (numeric or categorical)
Dulce est desipere in loco	Acta, non verba	[{"Acta": 0.7, ",": 0.3, "non": 0.8, "verba": 0.34}]	Political Topic
Si vis amari ama	Castigat ridendo mores	[{"Cast": 0.56, "igat": 0.4, "ridendo": 0.24, "mores": 0.67}]	Entertainment Topic

Predict Function and Mapping

Teams need to specify the relationship between the prediction and ground truth columns to help set up their Arthur model's environment to calculate default performance metrics. Here is an example of what that might look like:

## Registering columns 
arthur_model.build_token_sequence_model(
    input_column="user_input", 
    output_text_column="output_text"
  
  ## optional for model types with token likelihood 
    output_likelihood_column="token_likelihoods"
)

Available Metrics

When onboarding Token Sequence models, you have a number of default metrics available to you within the UI. You can learn more about each specific metric in the metrics section of the documentation.

Out-of-the-Box Metrics

The following metrics are automatically available in the UI (out-of-the-box) when teams onboard a Token Sequence model. Find out more about how to use these metrics in the Performance Metrics section. For metric definitions, check out the Glossary.

Metric	Metric Type
Average Token Likelihood	Performance
Likelihood Stability	Performance
Average Sequence Length	Performance
Inference Count	Ingestion

Drift Metrics

In the platform, drift metrics are calculated compared to a reference dataset. So, once a reference dataset is onboarded for your model, these metrics are available out of the box for comparison. Find out more about these metrics in the Drift and Anomaly section.

Some things of note: For unstructured data types (like text and image), feature drift is calculated for non-input attributes. Additionally, generative text models create text input and output that can be tracked with multivariate drift.

PSI	Feature Drift
KL Divergence	Feature Drift
JS Divergence	Feature Drift
Hellinger Distance	Feature Drift
Hypothesis Test	Feature Drift
Multivariate Drift for Prompts (Text Input)	Multivariate Drift
Multivariate Drift for Predictions (Text Output)	Multivariate Drift

Note: Teams can evaluate drift for inference data at different intervals with our Python SDK and query service (for example, data coming into the model now compared to a month ago).

User-Defined Metrics

Whether your team uses a different performance metric, wants to track defined data segments, or needs logical functions to create a metric for external stakeholders (like product or business metrics). Learn more about creating metrics with data in Arthur in the User-Defined Metrics section.

Available Enrichments

The following enrichments can be enabled for this model type:

Anomaly Detection	Hot Spots	Explainability	Bias Mitigation
X