This is the first post of a series about how to design a cluster, which is a group of servers working together to support a service and allow them to scale horizontally.

It is about monitoring, what allows us to detect issues and solve them before they become a problem and make businesses lose money.

Introduction

Imagine your company has been months preparing a big sales day like Black Friday and your applications fail on the big day and your competitors get your missed sales. Or that a release doesn’t go as expected and your customers cannot buy anymore your key products. Surely you want to be alerted as soon as this happens (or even before) so you can promptly correct any issue and ensure your customers are happy and you get the desired sales.

Monitoring and alerting help with it and there are many aspects to consider. Please find below the most relevant ones:

Types of monitoring

There are different ways to monitor your applications. Some tools analyse the processes running on the server, detect changes (performance issues, stopped processes, increase of failure rates…). Others monitor applications from outside, checking health endpoints every few minutes to see if the applications are still working. And others look at the logs and detect error patterns that may not be noticed by other tools.

The usage depends on the application to monitor but it is usually good to have a combination of them, thinking about how to cover each error scenario. E.g. using Dynatrace or Prometheus to get real-time metrics of the system and cover most errors, Kibana alerts to detect errors in the logs and an external tool to check endpoints are accessible via CDNs.

Types of alerting

Some errors are more critical than others and require different levels of attention. E.g. it is not the same if a cluster goes down as if there is a minor connectivity issue, or when disk usage goes over each threshold (warning, critical…). Depending on it we may want a soft alert (e.g. an email or a notification on a chat application) or an automated call out to the engineers on rota. We should identify which are the error scenarios, who should be notified for each of them and the process to escalate issues using applications such as PagerDuty.

Types of errors to consider

Increase on memory/disk/cpu: this is key as processes will stop when they reach certain thresholds or if they don’t have enough space to write logs
Stopped processes: most applications recover automatically after an error (e.g. Kubernetes restarts unhealthy pods) but others may not have this mechanism and may have to be restarted manually or via a monitoring script. And there are scenarios when an app cannot recover automatically and requires manual intervention, e.g. if an LDAP or database account gets locked.
Sudden increases in the number of errors: we should monitor it as a new release may have introduced bugs and has to be rolled back, or if there are new authentication issues that could be a caused by hackers trying to break into an application
Messages in error queues: many applications use failure and dead-letter queues to handle different types of issues (remote queues not reachable, bad message format…). We can use queue depth threshold detection for them.
Performance issues: applications may start having worse performance after a release (e.g. due to a bad SQL query) or due to infrastructure issues (e.g. an overloaded database or Elasticsearch instance). We should detect and correct it as soon as possible to prevent an impact on the user experience that could make the company lose sales.

Dashboards to configure

Tools like Dynatrace or Grafana allow to configure dashboards in which we can easily get an overview of the status of the system

System overview: to get an idea about how the services are working and detect trends and unexpected changes. E.g. an unexpected memory increase after a release, disk space usage growing faster than expected, etc.
Errors: processes restarting frequently, services with most errors, error queues with messages, …
Business dashboards: e.g. regions with most/least sales, times when call centres receive most calls in each country, user navigation steps where sales drop, partners driving most income…

Real vs synthetic monitoring

We should focus our monitoring efforts on the general behaviour of the systems and how they behave with normal usage. However, we can also have a set of predefined tests that simulate usual user journeys and see how the applications behave. This is quite useful before and after releases to identify components whose performance may have changed or that are not working as expected. It is called synthetic monitoring and can be done with automated end-to-end tests and configured in the monitoring tool to differentiate it from real customer usage.

Automation

There are certain recurrent tasks that can be automated to reduce the manual work. For example:

Stopping alerting for a set of services just before they are released and resuming them just after it to ensure you are not flooded with emails during the release. This can be done adding steps to your pipelines that interact with the alerting API
Configure maintenance windows on some boxes ahead of planned maintenance: server patching, server reboot, etc.
Configure monitoring or alerting for thousands of services or queues: these tools allow to have different monitoring and alerting settings for each service but it is not practical to set them manually for each of them as their volume grows, and most of the services may have similar settings. We can have a script or a pipeline that interacts with the monitoring API to configure it at scale.

Configuration needed to support monitoring

Plugins to install: we may have to install an agent to collect metrics and send them to the monitoring tool. Depending on the technology it may be done inside the cluster, inside a container or as a sidecar container. This may require specific permissions and configuration, and to configure automated restart using systemd or similar so it is started automatically when the server or the pod is rebooted.
Log ingestion: we may want to send logs to a centralised tool like Kibana. In this case, we may have to configure log ingestion pipelines to get data from multiple sources (e.g. in Logstash), components to get the new logs and send them (using Fliebeat to read from files, a JMS plugin to read from queues), etc.
Log/data rotation: applications can generate millions of metric and log entries per hour and it may not be feasible to keep them forever in memory or disk. We have to think how often it is automatically changed from hot (active generation) to warm (read-only), cold (stored but only accessible upon demand) and finally deleted.

This covers the basic about monitoring and alerting and I am sure you may be using other tools or approaches. Feel free to share them in the comment section below

Rafael Borrego

Consultant and security champion specialised in Java, with experience in architecture and team management in both startups and big corporations.

Disclaimer: the posts are based on my own experience and may not reflect the views of my current or any previous employer

IT consulting

Rafael Borrego's blog

Cluster design – monitoring