Pravar Agrawal Technology & Travel

Monitoring With Prometheus (Part-II)

In my last post, we discussed about what Prometheus is, how it works and some generic configuration around it. In this post, we’ll take a detailed look at Instrumentation, Dashboards & Grafana, Labels, Node Exporter and few other components of Prometheus.

There are various benefits of using Prometheus by adding Instrumentation to any of our applications. For example, we can use Python3 to instrument with Prometheus for our application. A simple pip install prometheus_client would do that. And another benefit is, metrics are automatically registered with the client library in the default registry. We don’t need to pull the metric back to the start_http_server. Also, if there is a transient dependency which includes Prometheus instrumentation, it will appear on metrics quite automatically as well. More info on Prometheus Python client can be found here. Prometheus client libraries provide functionalities to count exceptions as well and can even take care of details like thread-saftey & book-keeping for us. A very basic question - What should I instrument? - instrument either services or libraries.

Let’s talk about labels as they are an integral part of Prometheus monitoring configuration. Labels are key-value pairs associated with time-series that in addition to the metric name, identifies them uniquely. Labels comes from sources like, Instrumentation Labels - coming from instrumentation and Target Labels - identifying a monitoring target that Prometheus scrapes. Instrumentation Labels are about something like HTTP requests in our code or which database the call talks to. Whereas, Target Labels relates more to our architecture and include which application it is or any exact instance of the application it is. Let’s look at an example of labels below:

Labels

In the above image, if we take a look at the query process_resident_memory_bytes{instance=“localhost:9090”, job=“prometheus”} here, instance & job denotes two labels. One more interesting fact about labels is that names begining with dunders(double underscores) are reserved. And, in Prometheus the metrics name is just another label called name. For example, expression up is more like ~ {name=“up”}. Another fact, if we remove any label from instrumentation, it’s always a breaking change. Removing a label removes a distinction a user may have been depending upon. There are exceptions to this as well, like info metrics.

One important component of Prometheus stack is Node Exporter. It exposes system level metrics like CPU, memory, disk space, disk I/O, network bandwidth and also motherboard temperatures. Node Exporter is basically meant to be used only with Unix based systems and Windows users should use wmi_exporter instead. Also, the Node Exporter is intended to monitor machine itself and not any individual process or services on it. Apart from Prometheus, few other monitoring systems tend to have a component called uberagent which is a single monitoring entity for a service, processes or applications. But in Promtheus, each of the services exposes metrics on it’s own using exporter if needed, and which are then scraped by the Prometheus. That is why uberagent is not required in Prometheus and this even helps in boosting the performance of scraping metrics. The Node Exporter is designed to be run as non-root user and should be run on the machine directly in the same way we run cron or sshd. It is not recommended to run Node Exporter on docker since docker tries to isolate a container from inner workings of the machine. And, this doesn’t work well with the Node Exporter.

Now let’s talk about how Prometheus does Service Discovery, focusing mainly on the targets. Service discovery enables us to provide Prometheus all the relative information about our services, machines and the key part of knowing what our monitoring targets are and what should be scraped. A good service discovery mechanism provides us with metadata which could be the name of a service, it’s description, who owns it, structured tags about it or any other info that might be useful. There are two categories broadly for service discovery: bottom-up and top-down. Those services which instances register with service discovery such as Consul, are bottom-up. And those where service-discovery knows what should be there such as EC2 instances, are top-down. Let’s look at examples for service discovery configurations:

scrape_configs:
  - job_name: prometheus
    static_configs:
    - targets:
      - localhost: 9090

Above is an example of static config where targets are provided direclty in the prometheus.yaml. Although we can even mix and match service_discovery mechanisms with the scrape config, but that is unlikely to result in an understandable configuration:

scrape_configs:
- job_name: node
  static_configs:
  - targets:
    - host1: 9100
  - targets:
    - host2: 9100

Apart from above, we can even define scrape targets into a file either in a .json or .yaml format. This is called File service discovery or file SD.

Prometheus UI is quite limited in terms of graphs, dashboards or managing alerts but when coupled with a famous dashboarding tool like Grafana, this becomes a beautiful pair. Are monitoring dashboards really that necessary? - Yes, definitely. If we are to check the performance of our system, dashboards will be the first port of call. Dashboard is a set of graphs, tables and other visual aids of your system. There could be a dashboard for global traffic, serving how much traffic and with what latency. Also, a dashboard for cpu usage, memory usage, and service specific metrics. Originally, Prometheus had its' own dashboarding tool called Promdash. Even though it was better around that time, developers at Prometheus decided to call for Grafana, rather than enhancing more on Promdash. A basic, dashboard with least panels could look like below:

grafana

One can decorate the dashboard with numerous panels as per usage or requirements.

Prometheus is intended to be run on the same network as what it is monitoring as this reduces the ways in which things can fail. This even provides low latency, high bandwidth network access to the targets that it is scraping. Also, it is common to have separate Prometheus servers for network, infrastructure, and application monitoring. This concept is known as vertical sharding and one of the best ways to scale Prometheus too. We can even have a federation of multiple servers, wherein a global Prometheus pulls aggregated metrics from our datacenter Prometheus servers. We can probably look at tools like Thanos or Cortex which are proven solution for scaling and long term storage of Prometheus.

There is a lot we can leverage by effectively using Prometheus for our Infrastructure or Application monitoring.