Text input models are a type of machine learning model that operates on text data, such as natural language text found in documents, emails, social media posts, and other forms of written communication. These models are designed to learn the patterns and relationships between words and phrases in order to perform tasks such as sentiment analysis, language translation, and text classification. Text input models can be built using a variety of techniques, including neural networks, decision trees, and support vector machines.

Embeddings in Arthur

Text models in Arthur take in Text input, i.e. raw text of the documents or social media posts teams are using to predict. While there are a few enrichments (namely Anomaly Detection) that use model embeddings, Arthur computes these embeddings internally. Currently, Arthur does not take in embeddings or vector inputs.

Tokenization in Arthur

Text inputs are tokenized in Arthur for both anomaly detection and explainability. To create these tokens, Arthur has a few different text delimiters.

Text Delimiters

Here are the different delimiters available in Arthur.

Name	Defined in Arthur	Description
COMMA	","	Splits on a single comma.
COMMA_PLUS	",+"	Splits on one or more commas.
NOT_WORD	"\W+"	Splits on any character that is not a word.
PIPE	"\|"	Splits on a single pipe.
PIPE_PLUS	"\|+	Splits on one or more pipes.
WHITESPACE	"\s+"	Splits on whitespace.