Monitoring Best Practices
The Arthur Scope product is used to monitor machine learning models. It runs on Kubernetes and is able to scale on-demand. There are several components that should be monitored so the platform stays healthy.
Some recommended best practices for monitoring the various Scope components are as follows:
Kubernetes
- Pods
- CPU/Memory utilization
- Pods are the smallest building blocks in Kubernetes. It's always advised to ensure pods have sufficient resources(CPU and Memory) available for them to run.
- Number of Restarts
- Pods getting restarted frequently is a sign of an buggy code or bad configuration.
- Pods in Pending/Unknown/Unavailable/CLBO state
- Pods not in a Ready state is a sign of hardware degradation or connectivity failures to external systems.
- CPU/Memory utilization
- Persistent Volumes
- IOPS
- Ensure the storage backing Persistent Volumes have enough throughput provisioned and there is no throttling being experienced.
- Available Disk Space
- Ensure attached Persistent Volumes have enough disk space.
- VolumeAttachment Errors
- Ensure there are no VolumeAttachment errors observed in Persistent Volumes. This is particularly critical in multi-AZ deployments.
- IOPS
- Nodes
- Sufficient nodes in each AZ
- Ensure there are required number of **nodes per AZ for each deployment.
- Max nodes per cluster
- Monitoring the total number of nodes a cluster is scaled to ensures performance and costs are in optimal.
- Sufficient nodes in each AZ
Datastores
- Meta Database(External)
- Disk Space
- Ensure there is enough disk space for the database.
- IOPS
- Monitor for any throttling of performance for the database disk and adjust IOPS accordingly.
- CPU
- Monitor for any throttling of performance for the database cpu and adjust it accordingly.
- Disk Space
- OLAP Database
- Replication Lag
- The OLAP database is usually deployed in a 3 node setup, which are synced via replication. A lag happens when data is not consistent across all nodes.
- Delayed/Rejected Inserts
- This usually happens when a large number of INSERTS are sent too quickly. This can lead to data loss or corruption.
- ZooKeeper Exceptions
- These should generally not happen and is sometimes an indication of bad hardware.
- Replication Lag
Messaging Middleware
- Kafka
- Consumer Lag
- Producers write data and Consumers read data from the messaging middleware. If consumers are not able to keep up with the producers, it will lead to a lag which can mean poor performance for the platform.
- Under Replicated partitions
- Follower replicas get data from Leader replicas using replication. Due to resource exhaustion or Leader failure, it is possible the Follower replicas don’t keep up with the Leader replicas.
- Consumer Lag
- Kafka Connect
- Connector failures
- These failures mean data is not being written to data stores, which can lead to data loss.
- Task failures
- These failures mean data is not matching the configurations, which can lead to data loss/corruption.
- Connector failures
- ZooKeeper
- Outstanding Requests
- This is the number of requests waiting to be processed by ZooKeeper.
- Outstanding Requests
Workflow Scheduler
- Failed Steps
- This usually implies a bad configuration or being unable to communicate with external systems.
- Failed Workflows
- Failed Steps or bad configurations could lead to failed workflows.
- Queued Workflows
- Workflows being queued could mean there is a lack of resources on the cluster.
Microservices
- Rate of 4XX/5XX HTTP response status
- Bad HTTP status codes could happen due to various reasons (bugs, pod restarts, invalid creds, access etc.).
- Response times
- Elevated response times can happen due to various reasons (bugs, pod restarts etc.).
Updated about 1 year ago