Monitor Domino and infrastructure

Monitoring Domino involves tracking several key application metrics. These metrics reveal the health of the application and can provide advance warning of any issues or failures of Domino components.

This list is not exhaustive.

Domino deployments include a pre-configured Grafana instance which can be used for monitoring . You can also use several other application monitoring tools in addition to Grafana to track these metrics, including:

Domino runs in Kubernetes, which is an orchestration framework for containerized applications. In this model, the following are the distinct layers with their own relevant metrics:

Domino application: This is the top layer, representing Domino application components running in containers that are deployed and managed by Kubernetes. The content in this Admin guide focuses on operations in this layer.
Kubernetes cluster: This is the Kubernetes software-defined hardware abstraction and orchestration system that manages the deployment and lifecycle of Domino application components. Cluster operations are handled a layer below Domino, but do have to consider the Domino architecture and cluster requirements. For guidance about general cluster administration, see the official Kubernetes documentation.
Host infrastructure: This is the bottom layer, that represents the virtual or physical host machines that are doing work as nodes in the Kubernetes cluster. IT owners of the infrastructure are responsible for operations in this layer, including management of compute and storage resources, as well as OS patching. Domino does not have any unique or unusual requirements in this layer.

The following tables start from the underlying infrastructure and build up in layers to the Domino core services. Each table also includes descriptions with considerations. However, everyone’s cluster is different so you must monitor and adjust for your environment. For example, consider how long it might take you to respond when storage size is increasing. You might want to set this value to 50% and escalate at 80%.

Also, remember that Kubernetes manages itself so momentary bursts can cause alerts that might not be a concern.

Domino recommends tracking these metrics in priority order:

Metric Suggested threshold Description

Metric	Suggested threshold	Description
Latency to `/health`	1000ms	Measures the time to receive a response to a request to the Domino API server. If the response time is too high, this suggests that the system is unhealthy and that user experience might be impacted. This can be measured by calls to the Domino application at a path of `/health`.
Dispatcher pod availability from metrics server	`nucleus-dispatcher` pods available = 0 for >10 minutes	If the number of pods in the `nucleus-dispatcher` deployment is 0 for greater than 10 minutes, it’s an indication of critical issues that Domino will not automatically recover from, and functionality will be degraded.
frontend pod availability from metrics server	`nucleus-front-end` pods available < 2 for >10 minutes	If the number of pods in the `nucleus-front-end` deployment is less than two for greater than 10 minutes, its an indication of critical issues that Domino will not automatically recover from, and functionality will be degraded.

Latency to /health

1000ms

Measures the time to receive a response to a request to the Domino API server. If the response time is too high, this suggests that the system is unhealthy and that user experience might be impacted. This can be measured by calls to the Domino application at a path of /health.

Dispatcher pod availability from metrics server

nucleus-dispatcher pods available = 0 for >10 minutes

If the number of pods in the nucleus-dispatcher deployment is 0 for greater than 10 minutes, it’s an indication of critical issues that Domino will not automatically recover from, and functionality will be degraded.

frontend pod availability from metrics server

nucleus-front-end pods available < 2 for >10 minutes

If the number of pods in the nucleus-front-end deployment is less than two for greater than 10 minutes, its an indication of critical issues that Domino will not automatically recover from, and functionality will be degraded.

Infrastructure

Observe and monitor the following for each node in your Kubernetes cluster.

Metric	Suggested threshold	Description
Average CPU usage	>80% for 15 minutes	Average node CPU usage must not be significantly high for long periods of time
Average memory usage	>90% for 15 minutes	Average node memory usage must not be significantly high for long periods of time
Disk usage	FS >85% for 15 minutes	Local disk can be used for both the underlying operating system functionality as well as Kubernetes and the containers running on it. It might spike during high runs of containers and dip. This is normal behavior, but it should not be consistently high.
Node not ready status	>0 for 30 minutes	If a node is in a not ready state, it cannot accept containers, so your Kubernetes platform will not be run at full capacity.
Shared file system sizes	FS>75% (Warn) FS >90% (Critical)	Domino uses shared file systems for backing a number of its persistent volumes. These must be monitored and increased as workloads and volumes grow.

Metric

Suggested threshold

Description

Average CPU usage

>80% for 15 minutes

Average node CPU usage must not be significantly high for long periods of time

Average memory usage

>90% for 15 minutes

Average node memory usage must not be significantly high for long periods of time

Disk usage

FS >85% for 15 minutes

Local disk can be used for both the underlying operating system functionality as well as Kubernetes and the containers running on it. It might spike during high runs of containers and dip. This is normal behavior, but it should not be consistently high.

Node not ready status

>0 for 30 minutes

If a node is in a not ready state, it cannot accept containers, so your Kubernetes platform will not be run at full capacity.

Shared file system sizes

FS>75% (Warn)

FS >90% (Critical)

Domino uses shared file systems for backing a number of its persistent volumes. These must be monitored and increased as workloads and volumes grow.

General Kubernetes

Observe the following settings across the entire Kubernetes platform. If the thresholds are hit, it might be an early warning sign of an issue on the platform. This can lead to an issue with Domino for users.

Metric	Suggested threshold	Description
Failed pods count	Dependent on cluster (some failed pods in a development environment might be expected).	Observe the number of pods in a failed state. Depending on the type of environment you are in and what else runs on your platform it might be normal to have a few failed pods. Configure the threshold accordingly.
Containers running out of disk space	FS >75% for 5 minutes (Warn) FS >90% for 5 minutes (Critical)	As well as the underlying operating system disk filling the containers running on your platform, use disk, both ephemeral and persistent, and significant increases in this or running for high intervals with high usage may impact service.
Container memory usage	>90% for 5 minutes (Warn) >95% for 5 minutes (Critical)	Containers consume memory from the underlying operating system. They are typically configured with requests and limits to prevent one container from consuming all the memory from the system. As workloads grow, the limits might be reached and need to be adjusted. If they aren’t set, you want to ensure that containers are not consuming too much node memory.
Container CPU usage	>90% for 5 minutes (Warn) >95% for 5 minutes (Critical)	Container CPU works under the same basis as container memory previously described.
Pods unschedulable	>0 for 7 minutes	If a pod can’t be scheduled, there might be issues on the cluster. You might not have enough nodes so there isn’t enough capacity. There might also be an issue with a specific type of node or a constraint not being met for the pod deployment, such as storage availability. Check these because it can be an early warning sign of an issue.
Pods not ready	>0 for 10 minutes	Pods must be ready and available in a reasonable time frame. If they are taking significant time to become ready, this can be a sign that something is not running as expected.
OOM Killed events	>0 for 15 minutes	If a pod consumes too much memory and surpasses its quotas and limits, or significantly impacts the node, an out of memory error can kill it. If this happens, review the application to see if the memory must be adjusted. Also, adjust quotas and limits accordingly. It also might be an underlying issue with the application.
Pod evictions	>0 for 10 minutes	These occur when a node is resource starved. It might be Kubernetes rebalancing itself and scaling up nodes or shifting workload to another node that is not at capacity. However, it might be an indication that you must manually scale your cluster.
Replicaset count	>0 pods missing from replicaset	Replicasets are a deployment type that specify a set number of pods that must be running. If the number of replica pods is less than the count according to the replicaset, something is likely wrong.
ImagePullBackOff	>10 count for 5 minutes	All pods run an image that comes from a registry, either directly from an upstream Domino registry or from some form of proxied internal registry. If you are getting ImagePullBackOff failures this might indicate an issue with the network issue connecting to it, the registry, or an authentication problem to the registry.

Metric

Suggested threshold

Description

Failed pods count

Dependent on cluster (some failed pods in a development environment might be expected).

Observe the number of pods in a failed state. Depending on the type of environment you are in and what else runs on your platform it might be normal to have a few failed pods. Configure the threshold accordingly.

Containers running out of disk space

FS >75% for 5 minutes (Warn)

FS >90% for 5 minutes (Critical)

As well as the underlying operating system disk filling the containers running on your platform, use disk, both ephemeral and persistent, and significant increases in this or running for high intervals with high usage may impact service.

Container memory usage

>90% for 5 minutes (Warn)

>95% for 5 minutes (Critical)

Containers consume memory from the underlying operating system. They are typically configured with requests and limits to prevent one container from consuming all the memory from the system. As workloads grow, the limits might be reached and need to be adjusted. If they aren’t set, you want to ensure that containers are not consuming too much node memory.

Container CPU usage

>90% for 5 minutes (Warn)

>95% for 5 minutes (Critical)

Container CPU works under the same basis as container memory previously described.

Pods unschedulable

>0 for 7 minutes

If a pod can’t be scheduled, there might be issues on the cluster. You might not have enough nodes so there isn’t enough capacity. There might also be an issue with a specific type of node or a constraint not being met for the pod deployment, such as storage availability. Check these because it can be an early warning sign of an issue.

Pods not ready

>0 for 10 minutes

Pods must be ready and available in a reasonable time frame. If they are taking significant time to become ready, this can be a sign that something is not running as expected.

OOM Killed events

>0 for 15 minutes

If a pod consumes too much memory and surpasses its quotas and limits, or significantly impacts the node, an out of memory error can kill it. If this happens, review the application to see if the memory must be adjusted. Also, adjust quotas and limits accordingly. It also might be an underlying issue with the application.

Pod evictions

>0 for 10 minutes

These occur when a node is resource starved. It might be Kubernetes rebalancing itself and scaling up nodes or shifting workload to another node that is not at capacity. However, it might be an indication that you must manually scale your cluster.

Replicaset count

>0 pods missing from replicaset

Replicasets are a deployment type that specify a set number of pods that must be running. If the number of replica pods is less than the count according to the replicaset, something is likely wrong.

ImagePullBackOff

>10 count for 5 minutes

All pods run an image that comes from a registry, either directly from an upstream Domino registry or from some form of proxied internal registry. If you are getting ImagePullBackOff failures this might indicate an issue with the network issue connecting to it, the registry, or an authentication problem to the registry.

Domino services

Many of the metrics and suggested alert thresholds that follow are duplicates of the overall Kubernetes metrics. However, to ensure that we can identify the issue to a Domino core service and ensure the health of Domino itself, it’s worth monitoring specific events for some of the core services.

To learn more about what each service is responsible for, see Architecture.

Nucleus frontend

Metric	Suggested threshold	Description
Evicted pods	>0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical)	Pods being evicted
Frontend Pods not ready	>0 for 5 minutes	Pods in a not ready state
High GC CPU usage	>15% for 15 minutes	The nucleus frontend is a Java application so it’s important to monitor the standard container metrics that we also monitor JVM health. To do this, use the metric high garbage collection CPU usage.

Metric

Suggested threshold

Description

Evicted pods

>0 count for 5 minutes (Warn)

>5 count for 5 minutes (Critical)

Pods being evicted

Frontend Pods not ready

>0 for 5 minutes

Pods in a not ready state

High GC CPU usage

>15% for 15 minutes

The nucleus frontend is a Java application so it’s important to monitor the standard container metrics that we also monitor JVM health. To do this, use the metric high garbage collection CPU usage.

Nucleus dispatcher

Metric	Suggested threshold	Description
Evicted pods	>0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical)	Pods being evicted
Pods not ready	>0 for 5 minutes	Frontend pods in a not ready state
High GC CPU usage	>15% for 15 minutes	Dispatcher, much like the frontend, is a Java-based application, so you must use the Garbage Collection metric to observe the Java application health.

Metric

Suggested threshold

Description

Evicted pods

>0 count for 5 minutes (Warn)

>5 count for 5 minutes (Critical)

Pods being evicted

Pods not ready

>0 for 5 minutes

Frontend pods in a not ready state

High GC CPU usage

>15% for 15 minutes

Dispatcher, much like the frontend, is a Java-based application, so you must use the Garbage Collection metric to observe the Java application health.

MongoDB

Metric	Suggested threshold	Description
Evicted pods	>0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical)	See notes on Kubernetes
Pods not ready	>0 for 5 minutes	See notes on Kubernetes
Replica Set degraded	>80% for 5 minutes	See notes on Kubernetes
MongoDB high CPU usage	>85% count for 10 minutes (Warn) >100% count for 10 minutes (Critical)	High CPU usage of Mongo might indicate that it is not behaving as expected
High PVC usage	>75% count for 15 minutes (Warn) >80% count for 15 minutes (Critical)	Mongo uses persistent storage and, as the database grows, this will fill the storage. This might have to be increased over time.
High PVC inode usage	>80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical)	As well as filling space, it will continually read and write to disk. High inode usage can lead to a degradation of performance.
mongo.mongod.queryexecutor.scannedPerSecond / mongo.mongod.document.returnedPerSecond	<1	A value >1 indicates there’s an issue with indexing on the collection.

Metric

Suggested threshold

Description

Evicted pods

>0 count for 5 minutes (Warn)

>5 count for 5 minutes (Critical)

See notes on Kubernetes

Pods not ready

>0 for 5 minutes

See notes on Kubernetes

Replica Set degraded

>80% for 5 minutes

See notes on Kubernetes

MongoDB high CPU usage

>85% count for 10 minutes (Warn)

>100% count for 10 minutes (Critical)

High CPU usage of Mongo might indicate that it is not behaving as expected

High PVC usage

>75% count for 15 minutes (Warn)

>80% count for 15 minutes (Critical)

Mongo uses persistent storage and, as the database grows, this will fill the storage. This might have to be increased over time.

High PVC inode usage

>80% count for 15 minutes (Warn)

>90% count for 15 minutes (Critical)

As well as filling space, it will continually read and write to disk. High inode usage can lead to a degradation of performance.

mongo.mongod.queryexecutor.scannedPerSecond / mongo.mongod.document.returnedPerSecond

A value >1 indicates there’s an issue with indexing on the collection.

Git

Metric	Suggested threshold	Description
Evicted pods	>0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical)	See notes on Kubernetes
Pods not ready	>0 for 5 minutes	See notes on Kubernetes
Replica Set degraded	>80% for 5 minutes	See notes on Kubernetes
Git high CPU usage	>85% count for 10 minutes (Warn) >100% count for 10 minutes (Critical)	High CPU usage of Git can be an indicator that it is not behaving as expected
High PVC usage	>75% count for 15 minutes (Warn) >80% count for 15 minutes (Critical)	Git uses persistent storage and, as the number of commits grows, this will fill the storage. This might have to be increased over time.
High PVC inode usage	>80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical)	As well as filling space, it will continually read and write to disk. High inode usage can lead to a degradation of performance.
Git Error rates	>1 count for 5 minutes	Git performs functions such as init, download, and upload as part of its service. Monitoring for errors from these events is an indicator that users are experiencing issues with version control.

Metric

Suggested threshold

Description

Evicted pods

>0 count for 5 minutes (Warn)

>5 count for 5 minutes (Critical)

See notes on Kubernetes

Pods not ready

>0 for 5 minutes

See notes on Kubernetes

Replica Set degraded

>80% for 5 minutes

See notes on Kubernetes

Git high CPU usage

>85% count for 10 minutes (Warn)

>100% count for 10 minutes (Critical)

High CPU usage of Git can be an indicator that it is not behaving as expected

High PVC usage

>75% count for 15 minutes (Warn)

>80% count for 15 minutes (Critical)

Git uses persistent storage and, as the number of commits grows, this will fill the storage. This might have to be increased over time.

High PVC inode usage

>80% count for 15 minutes (Warn)

>90% count for 15 minutes (Critical)

As well as filling space, it will continually read and write to disk. High inode usage can lead to a degradation of performance.

Git Error rates

>1 count for 5 minutes

Git performs functions such as init, download, and upload as part of its service. Monitoring for errors from these events is an indicator that users are experiencing issues with version control.

Docker registry

Metric	Suggested threshold	Description
Evicted pods	>0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical)	See notes on Kubernetes
Not ready	>0 for 5 minutes	See notes on Kubernetes
Replica Set degraded	>80% for 5 minutes	See notes on Kubernetes
High CPU usage	>85% count for 10 minutes (Warn) >100% count for 10 minutes (Critical)	If using the deployed Docker registry, you must monitor its CPU usage because significant high usage for prolonged times can be an indicator that it is not behaving as expected.
Docker registry error rates	>1 unit for 15 minutes	The Docker registry is exposed as an https/http service. Connection failures to the service indicate there might be an issue with images being stored or pulled.
Docker registry high latency	>80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical)	High latency to the service will impact pull and push times for images and lead to a degradation of 0 service.
Evicted pods	>1 count for 5 minutes	See notes on Kubernetes

Metric

Suggested threshold

Description

Evicted pods

>0 count for 5 minutes (Warn)

>5 count for 5 minutes (Critical)

See notes on Kubernetes

Not ready

>0 for 5 minutes

See notes on Kubernetes

Replica Set degraded

>80% for 5 minutes

See notes on Kubernetes

High CPU usage

>85% count for 10 minutes (Warn)

>100% count for 10 minutes (Critical)

If using the deployed Docker registry, you must monitor its CPU usage because significant high usage for prolonged times can be an indicator that it is not behaving as expected.

Docker registry error rates

>1 unit for 15 minutes

The Docker registry is exposed as an https/http service. Connection failures to the service indicate there might be an issue with images being stored or pulled.

Docker registry high latency

>80% count for 15 minutes (Warn)

>90% count for 15 minutes (Critical)

High latency to the service will impact pull and push times for images and lead to a degradation of 0 service.

Evicted pods

>1 count for 5 minutes

See notes on Kubernetes

RabbitMQ

Metric	Suggested threshold	Description
Evicted pods	>0 count for 5 minutes (Warn) >5 count for 5 minutes (Critical)	See notes on Kubernetes
Pods not ready	>0 for 5 minutes	See notes on Kubernetes
Replica Set degraded	>80% for 5 minutes	See notes on Kubernetes
High pod memory usage	>75% count for 15 minutes (Warn) >90% count for 15 minutes (Critical)	See notes on Kubernetes
High queue rate	>1000 count for 10 minutes	Rabbit must be continuously sending messages. An increased queue count indicates it cannot send messages and a service is not behaving as expected.
RabbitMQ low memory	>90 for 10 minutes	Rabbit is a high-memory consuming application. It’s memory usage will be constantly high. A drop in this might indicate it’s not functioning as expected.
Available TCP sockets	>90% for 10 minutes	Rabbit is the message distributor for all services in Domino. It must be connected to all the services to be able to communicate. If the TCP socket amount free is significantly low it might struggle to create those connections.
High PVC usage	>75% count for 15 minutes (Warn) >85% count for 15 minutes (Critical)	Rabbit uses persistent storage. This might have to be increased over time.
High PVC inode usage	>80% count for 15 minutes (Warn) >90% count for 15 minutes (Critical)	As well as filling space, it will continually read and write to disk. High inode usage can lead to a degradation of performance.

Metric

Suggested threshold

Description

Evicted pods

>0 count for 5 minutes (Warn)

>5 count for 5 minutes (Critical)

See notes on Kubernetes

Pods not ready

>0 for 5 minutes

See notes on Kubernetes

Replica Set degraded

>80% for 5 minutes

See notes on Kubernetes

High pod memory usage

>75% count for 15 minutes (Warn)

>90% count for 15 minutes (Critical)

See notes on Kubernetes

High queue rate

>1000 count for 10 minutes

Rabbit must be continuously sending messages. An increased queue count indicates it cannot send messages and a service is not behaving as expected.

RabbitMQ low memory

>90 for 10 minutes

Rabbit is a high-memory consuming application. It’s memory usage will be constantly high. A drop in this might indicate it’s not functioning as expected.

Available TCP sockets

>90% for 10 minutes

Rabbit is the message distributor for all services in Domino. It must be connected to all the services to be able to communicate. If the TCP socket amount free is significantly low it might struggle to create those connections.

High PVC usage

>75% count for 15 minutes (Warn)

>85% count for 15 minutes (Critical)

Rabbit uses persistent storage. This might have to be increased over time.

High PVC inode usage

>80% count for 15 minutes (Warn)

>90% count for 15 minutes (Critical)

As well as filling space, it will continually read and write to disk. High inode usage can lead to a degradation of performance.

Execution layer

Metric	Suggested threshold	Description
Model pods scheduled	>0 for 15 minutes	If model pods are scheduled for a significant amount of time, it might indicate that they will fail to start and must be investigated.
Zombie Runs	>0 for 15 minutes (Warn)	When a Run completes, the pod must shut itself down. If they continue to run as a zombie pod, this can lead to excess workload on your cluster. Investigate this to identify why the run did not terminate upon completion.

Metric

Suggested threshold

Description

Model pods scheduled

>0 for 15 minutes

If model pods are scheduled for a significant amount of time, it might indicate that they will fail to start and must be investigated.

Zombie Runs

>0 for 15 minutes (Warn)

When a Run completes, the pod must shut itself down. If they continue to run as a zombie pod, this can lead to excess workload on your cluster. Investigate this to identify why the run did not terminate upon completion.