Concepts and Terminology

The following definitions are specific to the Arthur platform, though in most cases are applicable to ML more broadly.

Arthur Inference

Container class for inferences uploaded to the Arthur platform. An inference is composed of input features, prediction values, and (optionally) ground truth values and any Non-Input data.


ground_truth = {
    "Consumer Credit Score": 652.0
inference = arthur_model.get_inference(external_id)

Related terms: inference, ArthurModel

Arthur Model

Model object used for sending and retrieving data pertinent to a deployed ML system. The ArthurModel object is separate from the underlying model that is trained and which makes predictions; it serves as a wrapper for the underlying model to access Arthur platform functionality.

An ArthurModel contains at least a name, an InputType and a ModelType.


arthur_model = connection.model(name="New_Model",
arthur_model = connection.get(model_id)


A variable associated with a model. Can be input, prediction, ground truth or ancillary information (these groupings are known as Stages in the Arthur platform). Can be categorical or continuous. Example:

The attribute age is an input to the model, whereas the attribute creditworthy is the target for the model.

Synonyms: variable, {predictor, input}, {ouput, target}, prediction.

Related terms: input, stage, prediction, ground truth


While bias is an overloaded term in stats&ML, we refer specifically to where a model’s outcomes have the potential to differentially harm certain subgroups of a population.


This credit approval model tends to lead to biased outcomes: men are approved for loans at a rate 50% higher than women are.

Related terms: bias detection, bias mitigation, disparate impact

Bias Detection

The detection and quantification of algorithmic bias in an ML system, typically as evaluated on a model’s outputs (predictions) across different populations of a sensitive attribute. Many definitions of algorithmic bias have been proposed, including group fairness and individual fairness defintions.


Common metrics for group fairness include Demographic Parity, Equalized Odds, and Equality of Opportunity.

Related terms: bias mitigation

Bias Mitigation

Automated techniques to mitigating bias in a discriminatory model. Can be charachterized by where the techniques sites in the model lifecycle:

  • Pre Processing: Techniques that analyze datasets and often modify/resample training datasets so that the learned classifier is less discriminatory.

  • In Processing: Techniques for training a fairness-aware classifier (or regressor) that explicitly trades off optimizing for accuracy and also maintaing fairness across sensitive groups.

  • Post Processing: Techniques that only adjust the output predictions from a discriminatory classifier, without modifying the training data or the classifier.

Related terms: bias detection

Binary Classification

A modeling task where the target variable belongs to a discrete set with two possible outcomes.


This binary classifier will predict whether or not a person is likely to default on their credit card.

Related terms: model-type, classification, multilabel classification

Categorical Attribute

An attribute whose value is taken from a discrete set of possibilities.


A person’s blood type is a categorical attribute: it can only be A, B, AB, or O.

Synonyms: discrete attribute

Related terms: attribute, continuous, classification

Continuous Attribute

An attribute whose value is taken from an ordered continuum, which can be bounded or unbounded.


A person’s height, weight, income, IQ can all be through of as continuous attributes.

Synonyms: numeric attribute

Related terms: attribute, continuous, regression


A modeling task where the target variable belongs to a discrete set with a fixed number of possible outcomes.


This classification model will determine whether an input image is of a cat, a dog, or fish.

Related terms: model-type, binary classification, multilabel classification

Data Drift

Refers to the problem arising when, after a trained model is deployed, changes in the external world lead to degradation of model performance and the model becoming stale. Detecting data drift will provide a leading indicator about data stability and integrity.

Data drift can be quantified with respect to a specific reference set (eg. the model’s training data), or more generally over any temporal shifts in a variable with respect to past time windows.

Your project can query data drift metrics through the ArthurAI API. This section will provide overview of the available data drift metrics in ArthurAI’s query service.

Related terms: out of distribution


P and Q

We establish some mathematical housekeeping for the below metrics. Let \(P\) be the reference distribution and \(Q\) be the target distribution. These are both probability distributions that can be approximated by binning the underlying reference and target sets. Generally, \(P\) is an older dataset and \(Q\) is a new dataset of interest. We’d like to quantify how far the the distributions differ to see if the reference set has gone stale and algorithms trained on it shoud not be used to perform inferences on the target dataset.


Let \(H(P)\) be the entropy of distribution P. It is interpreted as the expected (i.e. average) number of bits (if log base 2) or nats (if log base \(e\)) required to encode information of a datapoint from distribution \(P\). ArthurAI applications use log base \(e\), so interpretation will be in nats.

\[ \begin{align*} H(P) = -\sum_{i=1}^N P(x_i)*\text{log}P(x_i) = -E_P[\text{log}P(x_i)] \end{align*} \]

KL Divergence

Let \(D(P \parallel Q)\) be the Kullback-Leibler (KL) Divergence from \(P\) to \(Q\). It is interpreted as the nats of information we expect to lose in using \(Q\) instead of \(P\) for modeling data \(X\). KL Divergence is not symmetrical, i.e. \(D(P \parallel Q) \neq D(Q \parallel P)\), and should not be used as a distance metric.

\[\begin{split} \begin{align*} D(P||Q) = \sum_{i=1}^N P(x_i)*(\text{log}P(x_i)-\text{log}Q(x_i)) \\ = E_P[\text{log}P(x)-\text{log}Q(x)] \end{align*} \end{split}\]

Population Stability Index (PSI)

Let \(PSI(P,Q)\) be the Population Stability Index (PSI) between \(P\) and \(Q\). It is interpreted as the roundtrip loss of nats of information we expect to lose from \(P\) to \(Q\) and then from \(Q\) returning back to \(P\), and vice versa. PSI smooths out KL Divergence since the return trip information loss is included, and this metric is popular in financial applications.

\[\begin{split} \begin{align*}PSI(P,Q) = D(P||Q) + D(Q||P) \\ = \sum_{i=1}^N (P(x_i)-Q(x_i))*(\text{log}P(x_i)-\text{log}Q(x_i)) \\ = E_P[\text{log}P(x)-\text{log}Q(x)]+E_Q[\text{log}Q(x)-\text{log}P(x)] \end{align*} \end{split}\]

JS Divergence

Let \(JSD(P,Q)\) be the Jensen-Shannon (JS) Divergence between \(P\) and \(Q\). It smooths out KL divergence using a mixture of the base and target distributions and is interpreted as the entropy of the mixture \(M=\frac{P+Q}{2}\) minus the mixture of the entropies of the individual distributions.

\[\begin{split} \begin{align*}JSD(P,Q) = \frac{1}{2}D(P||M) + \frac{1}{2}D(Q||M) \\ = H(\frac{P+Q}{2})-\frac{H(P)+H(Q)}{2} \end{align*} \end{split}\]

Hellinger Distance

Let \(HE(P,Q)\) be the Hellinger Distance between \(P\) and \(Q\). It is interpreted as the Euclidean norm of the difference of the square root distributions of \(P\) and \(Q\).

\[\begin{split} \begin{align*} HE(P,Q) = {\frac {1}{\sqrt {2}}}{\bigl \|}{\sqrt {P}}-{\sqrt {Q}}{\bigr \|}_{2} \\ = {\frac {1}{\sqrt {2}}}{\sqrt {\sum _{i=1}^{N}\left({\sqrt {P(x_i)}}-{\sqrt {Q(x_i)}}\right)^{2}}} \end{align*} \end{split}\]

Hypothesis Test

Hypothesis testing uses different tests depending on whether a feature is categorical or continuous.

For categorical features, let \(\chi_K^2(P,Q)\) be the chi-squared test statistic for \(P\) and \(Q\), with \(K\) being the number of categories of the feature, i.e. degrees of freedom. Let \(N_{Pk}\) and \(N_{Qk}\) be the count of occurences of feature being \(k\), with \(1\leq k \leq K\), for \(P\) and \(Q\) respectively. The chi-squared test statistic is the summation of the standardized differences of expected counts between \(P\) and \(Q\).

\[\begin{split} \begin{align*} \chi_K^2(P,Q) = \sum_{k=1}^K \frac{(N_{Qk}-N_{Pk})^2}{N_{Pk}}\\ \end{align*} \end{split}\]

For continuous features, let \(KS(P, Q)\) be the Kolmogorov-Smirnov test statistic for \(P\) and \(Q\). Let \(F_P\) and \(F_Q\) be the empirical cumulative density, for \(P\) and \(Q\) respectively. The Kolmogorov-Smirnov test is a nonparametric, i.e. distribution-free, test that compares the empirical cumulative density functions of \(P\) and \(Q\).

\[\begin{split} \begin{align*} KS(P,Q) = \sup_x (F_P(x) - F_Q(x)) \\ \end{align*} \end{split}\]

The returned test statistic is then compared to cutoffs for significance. A higher test statistic indicates more data drift. We’ve abstracted the calculations away for you in our query endpoint.

For HypothesisTest, the returned value is transformed as -log_10(P_value) to maintain directional parity with the other data drift metrics. That is, lower P_value is more significant and implies data drift, reflected in a higher -log_10(P_value).


Arthur also offers a multivariate Anomaly Score through the Anomaly Detection Enrichment. These scores are computed by training a model on the reference set you provide to Arthur, and using that model to assign an Anomaly Score to each inference you send to Arthur. Scores of 0.5 are given to “typical” examples from your reference set, while higher scores are given to more anomalous inferences and lower scores are given to instances that the model judges as similar to the reference data with high confidence.

Disparate Impact

Legal terminology orginally from Fair Lending case law. This constraint is strictly harder than Disparate Treatment and asserts that model outcomes must not be discriminatory across protected groups. That is, the outcome of a decisioning process should not be substantially higher (or lower) for one group of a protected class over another.

While there does not exist a single threshold for establishing the presence or absence of disparate impact, the so-called “80% rule” is commonly referenced. However, we strongly recommend against adopting this rule-of-thumb, as these analyses should be grounded in use-case specific analysis and the legal framework pertinent to a given industry.


Even though the model didn’t take gender as input, it still results in disparate impact when we compare outcomes for males and females.

Related terms: bias, disparate treatment

Disparate Treatment

Legal terminology originally from Fair Lending case law. Disparate Treatment asserts that you are not allowed to consider protected variables (eg race, age, gender) when approving or denying an applicant for a credit card loan. In practical terms, this means that a data scientist cannot include these attributes as inputs to a credit decisioning model.

Adherence to Disparate Treatment is not a sufficient condition for actually acheiving a fair model (see proxy and bias detection). “Fairness through unawareness” is not good enough.

Related terms: bias, disparate impact


Generally used to describe data or metrics added to raw data after ingestion. Arthur provides various enrichments such as Anomaly Detection and Explainability. See Enrichments for details around using enrichments within Arthur.


An individual attribute that is an input to a model


The credit scoring model has features like “home_value”, “zip_code”, “height".

Ground Truth

The true label or target-variable (Y) corresponding to inputs (X) for a dataset.


pred = sklearn_model.predict_proba(X)
  predicted_values={1:pred, 0: 1-pred})

Related terms: prediction

Image Data

Imagery data commonly used for computer vision models.

Related terms: attribute, model type, Stage


One row of a dataset. An inference refers to passing a single input into a model and computing the model’s prediction. Data associated with that inference might include (1) input data, (2) model’s prediction, (3) corresponding ground truth. With respect to the Arthur platform, the term inference denotes any and all of those related components of data for a single input&prediction.

Related terms: ArthurInference, stage


A single instance of data, upon which a model can calculate an output prediction. The input consists of all relevant features together.


The input features for the credit scoring model consist of “home_value”, “zip_code”, “height".

Related terms: feature, model

Input Type

For an ArthurModel, this field declares what kind of input datatype will be flowing into the system.

Allowable values are defined in the InputType enum:


arthur_model = connection.model(name="New_Model",

Related terms: model type, tabular data, nlp data

Model Type

For an ArthurModel, this field declares what kind of output predictions will be flowing out of the system.

Allowable values are defined in the ModelType enum:

  • Regression

    • appropriate for continuous-valued targets

  • Multiclass

    • appropriate for both binary classiers and multiclass classifiers

  • Multilabel

    • appropriate for multilabel classifiers


arthur_model = connection.model(name="New_Model",

Related terms: input type

Multilabel Classification

A modeling task where each input is associated with two or more labels, from a fixed set of possible labels.


This computer vision model can detect common road signs see on US highways. The model is trained on example images which contain any of 250 different road signs in each image.

Related terms: model-type, multiclass clasification

NLP Data

Unstructured text sequences commonly used for Natural Language Processing models.

Related terms: attribute, model type, Stage

Out of Distribution Detection

Refers to the challenge of detecting when a input (or set of inputs) is substantially different than the distribution of a larger set of reference inferences. This term commonly arises in the context of data drift, where we want to detect if new inputs are different than the training data (and distribution thereof) for a particular model. OOD Detection is a relevant challenge for Tabular data as well as unstructured data such as images and sequences.

Related terms: data drift


The output prediction (y_hat) of a trained model for any input.


pred = sklearn_model.predict_proba(X)
  predicted_values={1:pred, 0: 1-pred})

Related terms: ground truth

Protected Attribute

An attribute of an inference that is considered sensitive with respect to model bias. Common examples include race, age, and gender. The term “protected” comes from the Civil Right Act of 1964.

Synonyms: sensitive attribute

Related terms: bias, proxy


An input attribute in a model (or combination thereof) that is highly correlated with a protected attribute such as race, age, or gender. The presence of proxies in a dataset makes it difficult to rely only on [Disparate Treatment] as a standard for fair ML.


In most US cities, zip code is a strong proxy for race. Therefore, one must be cautious when using zip code as an input to a model.

Related terms: bias, disparate impact, disparate treatment


A modeling task (or model) where the target variable is a continuous variable.


This regression model predicts what the stock price of $APPL will be tomorrow.

Related terms: model type


Taxonomy used by the Arthur platform to delineate how attributes contribute to the model computations. Allowable values are defined in the Stage enum:

  • ModelPipelineInput : Input to the entire model pipeline. This will most commonly be the Stage used to represent all model inputs. Will contain base input features that are familiar to the data scientist: categorical and continuous columns of a tabular dataset.

  • PredictFunctionInput: Potential alternative input source, representing direct input into model’s predict() method. Therefore, data here will have already undergone all relevant transformations including scaling, one-hot-encoding, or embedding.

  • PredictedValue: The predictions coming out of the model.

  • GroundTruth: The ground truth (or target) attribute for a model.

  • NonInput: Ancillary data that can be associated with each inference, but not necesaarily a direct input to the model. For example, sensitive attributes like age, sex, or race might not be direct model inputs, but will useful to associate with each prediction.

Tabular Data

Data type for model inputs where the data can be thought of as a table (or spreadsheet) composed of rows and columns. Each column represents an input attribute for the model and each row represents a separate record that composes the training data. In supervised learning, exactly one of the columns acts as the target.


This credit scoring model is trained on tabular data. The input attributes are income, country, and age and the target is FICO score.

Related terms: attribute, model type, Stage

Sensitive Attribute

See protected attribute