about 1 month ago

January 2026 Release Notes

by Pranav Shikarpur

Whether you're shipping your first agent or scaling an entire AI ecosystem, this release gives you even more tools to go from prototype to production — with confidence and control.

[New] Agent Development Toolkit: An end-to-end toolkit for building, debugging, evaluating, and shipping AI agents—designed to move seamlessly from prototype to production.
- Getting Started & Observability
- Configure your model providers with full control over sourcing and access
- Create and manage tasks that mirror real-world agent behavior
- Capture OpenTelemetry-based traces across agent runs
- Inspect executions in the Trace Viewer, including step-by-step agent actions
- Search and filter traces to quickly identify errors, failures, and regressions
- View sessions and chat threads, with deep linking from external applications
- Track token usage and cost by agent, user, session, or conversation
Advanced Agent & RAG Workflows
- Configure connections to Weaviate vector stores
- Run RAG notebooks and RAG experiments with supervised evals
- Execute end-to-end agent experiments and notebooks with evaluation built in
Prompt-Centric Workflows
- Manage prompts with versioning, tagging, promotion, and audit history with full traceability
- Quickly test ideas with the prompt playground
- Run structured comparisons with prompt experiments
- Iterate collaboratively with interactive prompt notebooks
- Promote prompts into production with a single step
- Manage prompts with versioning, tagging, promotion, and audit history
- Run completions through the Arthur Engine using streaming and batch APIs
- Compare prompt changes using Prompt Experiments for regression testing and bulk assessment
Unified Evaluation for Online + Offline
- Run online evals continuously on live traces in production
- Upload datasets for offline evaluation before deployment
- Seamlessly explore evaluation results in Trace Viewer and dashboards
- Add and manage datasets directly in-platform
- Collect traces directly into datasets for test case generation
- Create and manage custom evaluators for supervised and automated testing
- Provide human feedback on traces to enrich evaluation signals
- Explore eval results seamlessly in Trace Viewer and dashboards
Arthur x Google Cloud
- Arthur’s ADG platform is now live on the Google Cloud Marketplace. , making it easier than ever to discover, govern, monitor, and evaluate AI systems — all within your existing GCP environment.
Arthur Engine OSS Enhancements
- Model Source Control:Configure GenAI models to be pulled from secure, customer-managed repositories instead of public sources like Hugging Face.
- Advanced Metric Segmentation: Segment metrics by user ID, conversation ID, and more for deeper analysis.
- Improved ODBC Connector Support: Better database view handling, more reliable primary key detection, and configurable connection/login timeouts.
- Bootstrapping Reliability: Improved performance and resilience for GenAI model setup and execution.

3 months ago

December 2025 Release Notes

by Pranav Shikarpur

Arthur Platform

Arthur is solving the agent visibility gap with the launch of the industry’s first Agentic Discovery & Governance (ADG) Platform. Arthur’s ADG platform was built to turn agent chaos into a structured, scalable operation.
- Learn more about building your company's ADG strategy.
New agentic features are arriving in January 2026 that will provide powerful tools for testing, tracing, and deploying agent based workflows.
- Interested in trying it first? Email [email protected] to join the early access group.

Arthur Evals Engine OSS

New Features

Test & Preview Custom Metrics Before Saving: Users can now validate their custom metrics directly within the creation and editing workflow.
Users can run the metric against available datasets to preview results and confirm the logic behaves as expected before saving.

Bug Fixes

Sketch metrics can now be created and calculated without specifying any dimension columns. Frontend No Longer Overwrites User-Defined Metadata for Reported Metrics.

Community

Venture Backed Startup?

Join Arthur’s Start Up Partner Program: If you’re building a venture-backed startup that uses AI Agents and are trying to figure out how to reliably ship them to production, this program is perfect for you. Apply Today

4 months ago

November 2025 Release Notes

by Pranav Shikarpur

Arthur Platform

Improved support for custom metrics: You can now test a custom metric before creating it
New agentic features are arriving in January 2026 that will provide powerful tools for testing, tracing, and deploying agent based workflows.
- Interested in trying it first? Email [email protected] to join the early access group. Arthur Evals Engine OSS

Enhancements

Enhancements to PII detection model to improve date/time identification.
Docker configuration has been updated to use Postgres version 15, ensuring compatibility & preventing initialization errors during new engine setup.

Bug Fixes

Fixed an issue where some metrics were missing from the selection list for custom datasets.
Increase ML engine aggregation timeout to support segmentation of larger & more complex datasets.

Community

Venture Backed Startup?
- Join Arthur’s Start Up Partner Program: If you’re building a venture-backed startup that uses AI Agents and are trying to figure out how to reliably ship them to production, this program is perfect for you. Apply Today

4 months ago

October 2025 Release Notes

by Pranav Shikarpur

New from Arthur

Start Up Partner Program: If you’re building a venture-backed startup that uses AI Agents and are trying to figure out how to reliably ship them to production, this program is perfect for you. Apply Today

New Platform Features:

Custom Metrics: You can now define and manage custom metrics using SQL. Custom metrics can be reused across models and projects, and integrate seamlessly with dashboards, alerts, and queries in the Arthur platform. Versioning ensures you can update metric logic while preserving historical data accuracy. Learn more
Snowflake Connector: Added support for selecting Snowflake as a data source in the connector workflow.
Agent Trace Viewer: Improved filters — users can now filter by metric evaluation results, span type, and more.

Engine Enhancements:

Added support for creating custom metrics on data with nested columns.
GenAI Engine now runs as a non-root user.
Updated telemetry ORM models, update migrations to enforce non-null timestamps.
Improved pagination handling for MSSQL.
Added status_code and session_id to spans.

6 months ago

September 2025 Release Notes

by Pranav Shikarpur

New Platform Features

Custom Metrics (Evals)
Dashboard Versioning feature for easier roll backs
New Workspace home

Engine Release Notes

Enhancements

Span Query Improvements:
- New GET endpoint /v1/spans/query: allows filtering spans by type.
- Added support for span name column: improves query flexibility and performance.
- Optimized span queries: added indexes to frequently queried columns.
- Improved ingestion stability: fixed batch ingestion when root spans are present.
Improved developer experience by unifying our API schema and client libraries across the GenAI & Ml
Engines as well as the Arthur platform.
ML engine is run as non-root user.
Pushed ML Engine Artifacts to Nexus.

7 months ago

August 2025 Release Notes

by Pranav Shikarpur

New Platform Features

Sneek Peak: Support for Agentic AI is now available in the Arthur Platform

Engine Release Notes

New Features

Agentic monitoring is now supported in the GenAI Engine: Building on the recently added /traces/ API, this release introduces support for monitoring agentic behavior:
- Tasks now include an is_agentic flag to enable targeted analysis and evaluation.
- Metrics and traces APIs have been upgraded to support structured outputs, trace reconstruction, and intelligent defaults.
- The engine selectively computes metrics for agentic tasks, improving the precision of evaluations.
Added support for new Database connector: We’ve introduced a new ODBC-based Database connector with support for MSSQL, PostgreSQL, Oracle, and MySQL. This includes enhanced configuration options (e.g., table name, dialect) and standardized field naming for easier integration and future extensibility.

Enhancements/Bug Fixes

Added CloudFormation launch button with pre-populated client ID
Addressed API key validation latencies for users with large numbers of API keys.
Converted hallucination LLM call to structured output to improve accuracy
Added possible_segmentation tag to improve model segmentation diagnostics.
Addressed a bug related to incorrect function renaming after a refactor.
Guardrails Enterprise
- Vulnerability Fix:Patched pillow vulnerability CVE-2025-48379
- Enhancement:Introduced a feature flag for the PII rule that enables the administrators to toggle between the standard and the "strict" mode
- Bug fix:Fixed the email address PII detection issue that was not catching addresses with a certain format

8 months ago

July 2025 Release Notes

by Pranav Shikarpur

New Features:

Added support for Multimodal CV evals with metrics + visualizing inferences in the Arthur Platform.
Users can now optionally configure attributes to segment over when defining metrics.
Engine Installation flow now supports non docker installation methods.
Support for consuming OTEL traces emitted from LLM + Agentic Applications.
Support for segmenting metrics on values (inc. model version id, prompt version id, etc.)
Support for experiment tracking in the Arthur Dashboard. Now offers the ability to segment metrics (eg: by prompt-version, model-version).
New navigation bar to improve usability and discoverability of platform functionality.
Support for non-docker installation methods for the Arthur Engine.

Enhancements:

Made significant performance improvements to the PII detection model, resulting in fewer false positives.
The inference deep dive table now returns up to 50 rows per page.
Improved hallucination detection for numbered lists and other structured formats.
Introduced configurable max-token limit for hallucination checks, helping users fine-tune thresholds for context.
Metrics task ID now exposed in the GenAI model UX
Filters UI fixes for Inference Deep Dive.