Monitoring Best Practices

The Arthur Scope product is used to monitor machine learning models. It runs on Kubernetes and is able to scale on-demand. There are several components that should be monitored so the platform stays healthy.

Some recommended best practices for monitoring the various Scope components are as follows:

Kubernetes

Pods
- CPU/Memory utilization
  - Pods are the smallest building blocks in Kubernetes. It's always advised to ensure pods have sufficient resources(CPU and Memory) available for them to run.
- Number of Restarts
  - Pods getting restarted frequently is a sign of an buggy code or bad configuration.
- Pods in Pending/Unknown/Unavailable/CLBO state
  - Pods not in a Ready state is a sign of hardware degradation or connectivity failures to external systems.
Persistent Volumes
- IOPS
  - Ensure the storage backing Persistent Volumes have enough throughput provisioned and there is no throttling being experienced.
- Available Disk Space
  - Ensure attached Persistent Volumes have enough disk space.
- VolumeAttachment Errors
  - Ensure there are no VolumeAttachment errors observed in Persistent Volumes. This is particularly critical in multi-AZ deployments.
Nodes
- Sufficient nodes in each AZ
  - Ensure there are required number of **nodes per AZ for each deployment.
- Max nodes per cluster
  - Monitoring the total number of nodes a cluster is scaled to ensures performance and costs are in optimal.

Datastores

Meta Database(External)
- Disk Space
  - Ensure there is enough disk space for the database.
- IOPS
  - Monitor for any throttling of performance for the database disk and adjust IOPS accordingly.
- CPU
  - Monitor for any throttling of performance for the database cpu and adjust it accordingly.
OLAP Database
- Replication Lag
  - The OLAP database is usually deployed in a 3 node setup, which are synced via replication. A lag happens when data is not consistent across all nodes.
- Delayed/Rejected Inserts
  - This usually happens when a large number of INSERTS are sent too quickly. This can lead to data loss or corruption.
- ZooKeeper Exceptions
  - These should generally not happen and is sometimes an indication of bad hardware.

Messaging Middleware

Kafka
- Consumer Lag
  - Producers write data and Consumers read data from the messaging middleware. If consumers are not able to keep up with the producers, it will lead to a lag which can mean poor performance for the platform.
- Under Replicated partitions
  - Follower replicas get data from Leader replicas using replication. Due to resource exhaustion or Leader failure, it is possible the Follower replicas don’t keep up with the Leader replicas.
Kafka Connect
- Connector failures
  - These failures mean data is not being written to data stores, which can lead to data loss.
- Task failures
  - These failures mean data is not matching the configurations, which can lead to data loss/corruption.
ZooKeeper
- Outstanding Requests
  - This is the number of requests waiting to be processed by ZooKeeper.

Workflow Scheduler

Failed Steps
- This usually implies a bad configuration or being unable to communicate with external systems.
Failed Workflows
- Failed Steps or bad configurations could lead to failed workflows.
Queued Workflows
- Workflows being queued could mean there is a lack of resources on the cluster.

Microservices

Rate of 4XX/5XX HTTP response status
- Bad HTTP status codes could happen due to various reasons (bugs, pod restarts, invalid creds, access etc.).
Response times
- Elevated response times can happen due to various reasons (bugs, pod restarts etc.).