Understanding Node Problem Detector
Jan 4 2024Monitoring your kubernetes cluster is as important as containerizing your application code while running workloads on it. In other words, it’s indispensable. To understand, why it can’t be ignored we need to think what we could be doing if there’s no montioring for one’s cluster? Although there are modern monitoring solutions like Prometheus, Dynatrace, NewRelic, Datadog but at some point these could be processing a lot of information which we may not need actively. And storing all the metrics, logs can also be one overhead as it may add towards the cost in longer term. But what if we are to use any tool native to K8s ecosystem which can help in analysing our node related troubles in a little more better way? We can do that with the help of node-problem-detector. Let’s talk about it in a little more detail in this post.
node-problem-detector or npd is not a new tool and have been around from last couple of years now. It has been adopted in different ways like as an add-on to the k8s cluster or running it as a systemd service on host. The node-problem-detector, mainly focuses on making node problem visible to upstream layers (kube-apiserver or any external monitoring solution). It’s executed as a daemon running on each worker node and reports the problem back to kube-apiserver. node-problem-detector has three main components, Problem API, Problem Daemon & Exporter. We’ll discuss these in a bit more detail to understand their role.
The above image is a representation of npd pod when it’s deployed as a daemonset inside a K8s cluster. The worker node or the host has the pod running on it, Problem API is how npd would communicate to kube-apiserver and report different node problems. Problem API is further classified into two types: Event and NodeCondition. Whenever we do a kubectl describe node for a given node into the cluster, at the bottom what we get are the Events and a little above we also get few NodeCondition parameters. Problem Daemon, is classified into four types: SystemLogMonitor, SystemStatsMonitor, CustomPluginMonitor, HealthChecker. And the Exporter, which is the other subprocess of npd, if connected to external sources, like Kubernetes Exporter, Prometheus Exporter, StackDriver Exporter. Let’s discuss all these components in a little more detail below.
Problem Daemon is a subprocess of npd which is designed to monitor specific kinds of problems and further report those back to npd service. Following are different types of problem daemon: SystemLogMonitor, this monitors the system logs and reports problems and metrics according to predefined rules. Here log sources could filelogs, kmsg logs, abrt or systemd logs.
SystemStatsMonitor, this collects various health related systems stats as metrics only.
CustomPluginMonitor, it inovkes and checks various node problems by running user-defined scripts.
HealthChecker, it checks the health of kubelet and container runtime on a node.
One can access all the related config(json) files from this location.
Now coming onto the Exporter. The exporter is a component which is used to report node related problems to certain backends. And following are the supported exporters:
Kubernetes Exporter, it reports node problems to the K8s api-server. Temporary problems are reported as Events and permanent problems are reported as NodeConditions.
Prometheus Exporter, it reports node problems and metrics locally as promtheus metrics. We can specify the ip address and port for the exporter using command line arguments.
StackDriver Exporter, this exporter reports node problems and metrics to the Stackdriver Monitoring API.
Moving onto the installation part. There are three main ways to install and run node-problem-detector as a part of your cluster:
- Easiest is using helm install. The charts are available in this repo.
- Also there are couple of yaml files which can be applied manually using kubectl apply command available here.
- Antoher way to run npd is to install it as a systemd service running on host. This comes as by default on GKE clusters and works best without any hassle to maintain separate K8s resource configurations.
Remedy Systems
Remedy systems are the ones that are integrated with node-problem-detector and works as a next step after any problems or events have been reported. When used node-problem-detector should always be looked as a first step towards solving any node related issue which is bound to cause further damage if ignored. It’s not an ultimate solution for all the node related troubles in itself but once integrated with Remedy systems, can help our cluster recover to some extent. There are few systems like: Descheduler, MediK8s (Poison Pill) or Draino etc. These help in issuing further corrective actions like node reboot/reloads, or restart any particular service such as container runtime or kubelet running on a host. Such remedy systems when used effectively, can definitely make a difference in mitigating node related issues along with some quick actions before manual intervention.
References
- Project Link: https://github.com/kubernetes/node-problem-detector
- NPD with GKE: https://cloud.google.com/anthos/clusters/docs/on-prem/latest/troubleshooting/toolbox