Why monitoring?_Real-World SRE-QQ阅读男生玄幻网

书名：Real-World SRE
作者名：Nat Welch
本章字数：1182字
更新时间：2025-02-18 08:55:20

Why monitoring?

Everyone tells you to go to the doctor regularly, but why should you? What are the benefits? My parents would say that you should go to the doctor to catch signs of things that you do not necessarily pay attention to, or notice, by yourself. This could be things like cholesterol levels, blood pressure, and skin cancer. I also like to use doctor visits as a time to think about and talk about changes that I have noticed in my body. For example, if I have had an upset stomach frequently.

These examples work as good comparisons of the two separate types of monitoring that software often needs. The first type is metrics and the second is logs. Metrics, in this case, are number measurements. Traditionally, metrics focused on performance numbers, like the percentage of disk space used, or the number of packets received, or the CPU load. These days, they can be used to represent just about anything that can be defined as a numerical value. They can come from any piece of software in your system. Logs are events, which can have numbers and other data attached, but are often less structured. Some logs are complete JSON blobs of data, while others are just human-formatted strings of text. They can also be anything in between.

Let's say that I go to see the doctor and my blood pressure is recorded as 120/70 mmHg. My cholesterol is 190 mg/dL. There is also a new mole on my back and my stomach has been feeling upset. The metrics, in this case, are my blood pressure and cholesterol. My doctor collects them every time I visit and there is a documented history of them. They also have simple ranges for the human adult that are considered safe. This fact is not too relevant right now but will be useful later, when we think about alerting in the next chapter. The mole and stomach issues are closer to events.

We are stretching the metaphor a bit thin here, but the mole has data around it and the doctor is making a gut decision based on its size, location, and time present to decide if the hospital should biopsy it or not. My doctor does not have regular data, but he has a one-off measurement. For the stomach issue, the doctor has a few statistics that are partially remembered by me. These are things that I have eaten and approximately how long, how intense, and how frequent the pain is.

Image of a human with monitoring labels. Each label is an example metric you might collect about a person's body.

For the metrics, the doctor records them and if they are abnormal, or out of bounds, the doctor may recommend some changes to my lifestyle. For the log data, the doctor will probably start by either collecting more data, by sending the mole to a lab, or by asking me to keep a record of what I am eating.

As much as we might want them to take care of themselves, applications are not humans. So, we measure applications slightly differently. For a web application, the most common metrics are error counts, request counts, and request duration. The most common logs are error stack traces.

Note

If you want help remembering these three metrics, they can be remembered as ERD or RED. Some people also call them REL (requests, errors, latency).

So, why are these metrics often the starting point? We can increment a counter every time an error occurs and write that error out to a log, with a timestamp. A counter is one of the most fundamental forms of monitoring. We just count the number of times something has happened. Some services let you store metadata with your counter increment, tying the log to the counter, but often you just write the logs and the counter increment with the same timestamp. This counter is useful because you want to know when you are serving an error to a user and to look at your logs to evaluate what the errors are. You increment the counter so that you can calculate what percentage of the requests that you serve are errors, and so that you can quickly view long-term error progress.

A total request count is useful because we know how often our application is being used. I am using the example of a basic HTTP 1.1 web application, so the total request count is an accurate view of how much work a server is doing. If the server is a streaming server, then often a team counts bytes or packets instead of requests, so they have a view of how usage changes over time, because in their case a request can represent more than a single unit of work.

Note

See Chapter 9, Networking Foundations for more on how HTTP 1.1 works and how it differs from other versions of HTTP. We also cover what a packet is in that chapter.

In this example, request duration is just the length of a single HTTP request. We are measuring how long it takes the server to process, from when it receives the full request, until it has sent out the full response. Request duration is used for a bunch of things. Firstly, you can use it to figure out whether certain types of requests are taking longer than others. If you tag each duration recording not just with the time it took, but also with the URL hit, the method (GET, POST, HEAD, and so on), and the status code you returned, then you could dig into the metrics.

You could see that, on average, all requests that returned code 404 took one second longer than requests that returned code 200. Secondly, you could use this method to see how similar requests change over time. For instance, you could compare how requests to https://example.com/ performed in November with how they performed in December.

A graph of a service's total request count for the months of November and December. The December line shows traffic was slightly higher than November's traffic for most days. The days where this is not true are a large spike in the beginning of both months, and a slight slump in traffic near the end of the month in December.

We have mainly talked about how monitoring is useful and not necessarily why it is important. I propose some questions for you—how do you know a service is working? How do you know it is not working? How do you define what working is? This is what we are trying to solve with monitoring. Monitoring is important because it provides us with a data-driven view of our application and proves to us it is working, without us having to sit there and constantly check the application every minute of our lives. With that in mind, I believe the best approach is always the practical approach. Let's try creating a simple application and instrumenting it with monitoring.