Product DocumentationAPI and Python SDK ReferenceRelease Notes
Schedule a Demo
Product Documentation
Schedule a Demo

Monitoring Best Practices

The Arthur Scope product is used to monitor machine learning models. It runs on Kubernetes and is able to scale on-demand. There are several components that should be monitored so the platform stays healthy.

Some recommended best practices for monitoring the various Scope components are as follows:

Kubernetes

  • Pods
    • CPU/Memory utilization
      • Pods are the smallest building blocks in Kubernetes. It's always advised to ensure pods have sufficient resources(CPU and Memory) available for them to run.
    • Number of Restarts
      • Pods getting restarted frequently is a sign of an buggy code or bad configuration.
    • Pods in Pending/Unknown/Unavailable/CLBO state
      • Pods not in a Ready state is a sign of hardware degradation or connectivity failures to external systems.
  • Persistent Volumes
    • IOPS
      • Ensure the storage backing Persistent Volumes have enough throughput provisioned and there is no throttling being experienced.
    • Available Disk Space
      • Ensure attached Persistent Volumes have enough disk space.
    • VolumeAttachment Errors
      • Ensure there are no VolumeAttachment errors observed in Persistent Volumes. This is particularly critical in multi-AZ deployments.
  • Nodes
    • Sufficient nodes in each AZ
      • Ensure there are required number of **nodes per AZ for each deployment.
    • Max nodes per cluster
      • Monitoring the total number of nodes a cluster is scaled to ensures performance and costs are in optimal.

Datastores

  • Meta Database(External)
    • Disk Space
      • Ensure there is enough disk space for the database.
    • IOPS
      • Monitor for any throttling of performance for the database disk and adjust IOPS accordingly.
    • CPU
      • Monitor for any throttling of performance for the database cpu and adjust it accordingly.
  • OLAP Database
    • Replication Lag
      • The OLAP database is usually deployed in a 3 node setup, which are synced via replication. A lag happens when data is not consistent across all nodes.
    • Delayed/Rejected Inserts
      • This usually happens when a large number of INSERTS are sent too quickly. This can lead to data loss or corruption.
    • ZooKeeper Exceptions
      • These should generally not happen and is sometimes an indication of bad hardware.

Messaging Middleware

  • Kafka
    • Consumer Lag
      • Producers write data and Consumers read data from the messaging middleware. If consumers are not able to keep up with the producers, it will lead to a lag which can mean poor performance for the platform.
    • Under Replicated partitions
      • Follower replicas get data from Leader replicas using replication. Due to resource exhaustion or Leader failure, it is possible the Follower replicas don’t keep up with the Leader replicas.
  • Kafka Connect
    • Connector failures
      • These failures mean data is not being written to data stores, which can lead to data loss.
    • Task failures
      • These failures mean data is not matching the configurations, which can lead to data loss/corruption.
  • ZooKeeper
    • Outstanding Requests
      • This is the number of requests waiting to be processed by ZooKeeper.

Workflow Scheduler

  • Failed Steps
    • This usually implies a bad configuration or being unable to communicate with external systems.
  • Failed Workflows
    • Failed Steps or bad configurations could lead to failed workflows.
  • Queued Workflows
    • Workflows being queued could mean there is a lack of resources on the cluster.

Microservices

  • Rate of 4XX/5XX HTTP response status
    • Bad HTTP status codes could happen due to various reasons (bugs, pod restarts, invalid creds, access etc.).
  • Response times
    • Elevated response times can happen due to various reasons (bugs, pod restarts etc.).