Connector Details

This section describes how to configure each connector, and what the required permissions are to use it in the external system. Additionally, it documents the dataset locator for each connector, which represents how datasets are configured within a connector.

BigQuery

The BigQuery connector allows users to monitor models whose data resides in BigQuery datasets.

Permissions

The connector requires the following roles in the GCP project:

BigQuery Data Viewer - allows the connector to list the datasets and tables in the project
BigQuery Job User - allows the connector to read data in the project's tables

Configuration

The connector supports the following configuration:

Project ID (required) - the id string of the GCP project
Credentials (optional) - users can upload a JSON GCP Service Account credential the connector will use.
If none are provided, the connector will use the runtime environment to look them up.
Location (optional) - the GCP location to use when communicating with the BigQuery API

Dataset Locator

Dataset ID (required) - ID of the dataset in BigQuery. Do not include the project ID as that is already set in the
connector configuration.
Table Name (required) - Name of the table in the BigQuery dataset.

Google Cloud Storage Bucket (GCS)

The GCS connector allows users to monitor models whose data resides in GCS buckets. Today it supports both Parquet and JSON file formats.

Permissions

The connector requires the following roles on the bucket:

Storage Object Viewer - allows the connector to list and read the objects in the bucket

Configuration

The connector supports the following configuration:

Project ID (required) - the id string of the GCP project
Bucket (required) - the name of the GCS bucket
Credentials (optional) - users can upload a JSON GCP Service Account credential the connector will use.
If none are provided, the connector will use the runtime environment to look them up.
Location (optional) - the GCP location to use when communicating with the GCS API

Dataset Locator

File Prefix (required) - The prefix of the dataset's files in the bucket. This prefix should
include strftime placeholders for year, month, day, and optionally, hour. The Arthur platform uses this format string to efficiently look up data for specific time ranges when calculating metrics. An example value might be: /year=%Y/month=%m/day=%d/hour=%H/. Note, do not include the bucket name in the prefix, as it is already set in the connector configuration.
File Type (required) - One of json or parquet. Specifies the format of the data files in the dataset.
File Suffix (optional) - If there are multiple kinds of files in the prefix, this option can be used to filter out
files based on a suffix regex. For example, to limit the results to only files ending in .json, set this value to .*\.json. The regex syntax for this option follows Python's re library. If not set, no filtering is applied.
Timestamp time zone (optional) - The time zone to use when populating the file prefix time placeholders.
Defaults to UTC.

S3 Bucket

The S3 connector allows users to monitor models whose data resides in S3 buckets. Today it supports both Parquet and JSON file formats.

Permissions

The connector requires the following permissions policy on the bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "SID",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:GetObjectVersion",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::<bucket>/*",
        "arn:aws:s3:::<bucket>"
      ]
    }
  ]
}

Configuration

The connector supports the following configuration:

Bucket (required) - the name of the GCS bucket
Access Key ID (optional) - the AWS Access Key ID. Only needed if using access key authentication with AWS. If this
is not set, the connector will attempt to load credentials from its runtime environment.
Secret Access Key (optional) - the AWS Secret Access Key. Only needed if using access key authentication with AWS.
If this is not set, the connector will attempt to load credentials from its runtime environment.
Role ARN (optional) - set if accessing the bucket requires assuming a role. If this is not set, the connector will
use access keys, or attempt to load credentials from its runtime environment.
External ID (optional) - if using the assume role option, it is recommended to set an External ID in the role's
trust policy to prevent the confused deputy problem.
Role Duration Seconds (optional) - if using the assume role option, this specifies how long the session is valid.
It defaults to 3600, one hour, but some role policies require it to be smaller.
AWS Region (optional) - the AWS region where the bucket resides

Dataset Locator

File Prefix (required) - The prefix of the dataset's files in the bucket. This prefix should
include strftime placeholders for year, month, day, and optionally, hour. The Arthur platform uses this format string to efficiently look up data for specific time ranges when calculating metrics. An example value might be: /year=%Y/month=%m/day=%d/hour=%H/. Note, do not include the bucket name in the prefix, as it is already set in the connector configuration.
File Type (required) - One of json or parquet. Specifies the format of the data files in the dataset.
File Suffix (optional) - If there are multiple kinds of files in the prefix, this option can be used to filter out
files based on a suffix regex. For example, to limit the results to only files ending in .json, set this value to .*\.json. The regex syntax for this option follows Python's re library. If not set, no filtering is applied.
Timestamp time zone (optional) - The time zone to use when populating the file prefix time placeholders.
Defaults to UTC.

Arthur Shield Instance

The Arthur Shield connector allows users to monitor Generative AI models as tasks in an Arthur Shield instance.

Permissions

The connector requires the following role in the Shield API Key:

ORG-AUDITOR - allows the connector to list the tasks and read task inferences

Configuration

The connector supports the following configuration:

Endpoint (required) - the url host for the Arthur Shield instance. e.g. https://shield.arthur.ai
API Key (required) - the string API key for Shield with at least the ORG-AUDITOR role

Dataset Locator

Task ID (required) - the UUID of the task to be monitored in Shield