- Real-World SRE
- Nat Welch
- 1634字
- 2021-07-23 19:24:00
Collecting and saving monitoring data
Once you have instrumented your application, you will need to store your data somewhere. As I mentioned in the Instrumenting an application section of this chapter, there are many possible tools you can use. I will be talking about some of the tools, but be aware that there are many others. Talk to your friends, do research online, and try different tools to figure out what is best for you, your team, and your organization.
I tend to organize monitoring tools into two buckets. This is often a simplification of these systems, but it helps me to think about how they work. These two buckets are polling applications and push applications.
Polling applications
Polling (also known as pull) applications scrape data from a service and then store and display the data. Some of the complaints against polling applications are that you need to keep some record of all of your services to scrape. There's nothing wrong with polling applications, but this is just something to think about: how do you know what services are running on your infrastructure? Polling applications are used by all sorts of companies and individuals around the world, from Google to people who just want to monitor their home Wi-Fi router.
A very simple example is the following Go program:
package main import ( "encoding/json" "fmt" "log" "math/rand" "net/http" "time" ) func main() { rand.Seed(time.Now().Unix()) http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) { data := map[string]float64{ "hello": 1.0, "now": float64(time.Now().Unix()), "rand": rand.Float64(), } json, err := json.Marshal(data) if err != nil { log.Fatal(err) } w.Header().Set("Content-Type", "application/json; charset=utf-8") fmt.Fprintf(w, "%s", json) }) log.Println("Starting server...") log.Println("Running on http://localhost:8080") log.Fatal(http.ListenAndServe(":8080", http.DefaultServeMux)) }
The preceding code outputs a JSON object that looks like the following:
$ curl http://localhost:8080/ {"hello":1,"now":1527602257,"rand":0.7835047124910645}
If you weren't using a monitoring framework or anything similar, you could scrape this data with another service and save it to a database. It wouldn't provide you with much valuable data (as knowing random numbers is not too useful in most cases), but it would serve as the basis for all polling applications. Go actually publishes a package in the standard library, called expvar
, which publishes memory information about the current application on the path /debug/vars
in a JSON format, for you to scrape and collect if you want to. It also lets you add in arbitrary data as needed.
Now that you have a rough idea of how polling systems work, let's walk through some open-source polling services.
Nagios
Nagios is one of the oldest monitoring tools. Many of the complaints about polling services come from people who have had bad experiences with Nagios. That being said, Nagios is incredibly extensible and very popular. It runs on just about everything and is very well documented. It is written in C and supports plugins in many languages. There are also a bunch of forks of Nagios out there that have attempted to modernize the system. One of the more popular forks is Icinga, which uses different data storage and a more modern API.
It also stores data in an RRD database. RRD stands for round robin database and uses a circular buffer to keep the database a constant size. RRDTool, the most prevalent implementation of RRD transforms an RRD into a viewable graph. Like other older systems, Nagios has some built-in scraping tools, but more often than not, people end up writing their own scrapers to get the data they want into Nagios.
Prometheus
We gave an example of Prometheus and described its basic attributes in the Instrumenting an application section. It is maintained by the Cloud Native Computing Foundation (CNCF). The CNCF is an open-source foundation promoting software focused at improving infrastructure in the cloud. PromQL (Prometheus's query language) takes vectors of data and uses filters and functions to simplify the data.
rate(http_requests_total{environment=~"staging|development",method!="GET"}[5m])
In this example, the metric is http_requests_total and
, the filter is {environment=~"staging|development",method!="GET"}
. [5m]
is bunching data into five-minute intervals, and rate()
is taking the rate of change between each of those bunches.
If you visit the /metrics
page for the application, it will be outputting data that looks something like:
# HELP http_request_count_total HTTP request counts # TYPE http_request_count_total counter http_request_count_total{method="GET",path="/"} 1
This is then scraped by Prometheus and saved to the datastore. When you write a query, you are querying the datastore, not the application you are getting the metrics from.
Some find Prometheus's query language to have a bit of a learning curve, but it is very powerful. Prometheus is still young but has a large and active community behind it. It implements its own datastore for storing data but has tools for exporting and backing up data to external datasources. It also has its own alert manager for sending notifications to people who are on call.
Cacti
Cacti is built with PHP and MySQL, but tends to need a decent amount of coding to get up and running. You will need to write scripts to do most of your data aggregation, something that comes for free in many other systems. That being said, it is very popular in smaller setups and in the Internet of Things community. Like Nagios, Cacti stores data in an RRD database.
Sensu
Sensu is often seen as a modern version of Nagios. It is written in Ruby and has a dedicated company behind it, which also sells an enterprise-hosted version with additional features and integrations. It stores the state in Redis but integrates with lots of other backends for actual metric data.
Push applications
Push applications are the inverse of pull applications. Instead of the monitoring application getting metrics from services, services write to the monitoring application. Often, there is an intermediary service, which translates or aggregates the metrics before sending them to the central monitoring application. One of the complaints with push applications is that they can have issues if lots of services are writing to them at the same time. There is nothing wrong with push applications though—they just have different architecture. Like polling applications, many very large and also very small companies use push applications, and they are just as effective.
An example push monitoring system is just a simple application that creates a new row, with metric data in a database, every time you want to monitor data. I've created that in the following Go program:
package main import ( "database/sql" "fmt" _ "github.com/lib/pq" "log" "net/http" "time" ) func main() { http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) { db, err := sql.Open("postgres", "host=localhost dbname=sretest sslmode=disable") if err != nil { log.Fatal(err) } defer db.Close() stmt, err := db.Prepare("INSERT INTO data (metric, value, created) VALUES ($1, $2, $3)") if err != nil { log.Fatal(err) } defer stmt.Close() res, err := stmt.Exec("GET /", 1, time.Now()) if err != nil || res == nil { log.Fatal(err) } w.Header().Set("Content-Type", "application/json; charset=utf-8") fmt.Fprintf(w, "{\"hello\": \"world\"}") }) log.Println("Starting server...") log.Println("Running on http://localhost:8080") log.Fatal(http.ListenAndServe(":8080", http.DefaultServeMux)) }
For this to work, you'll need to create a PostgreSQL database named sretest
locally and run create table data (metric text, value float, created timestamp);
to create the table to insert data into. Then, after loading the page a few times, you can query the database with the following query to get all of the metrics logged today:
sretest=# select metric, sum(value) from data where created >= now()::date group by metric; metric | sum --------+----- GET / | 6 (1 row)
However, instead of implementing this yourself, there are lots of great push-based applications you can use that are much more robust than just shoving time series data into a PostgreSQL database.
StatsD
We gave an example of StatsD and described its basic attributes in the Instrumenting an application section. It was written in JavaScript, but its specification has since been reimplemented in many languages to add new features and improved performance. Many companies sell StatsD-based monitoring infrastructure. Traditionally, people used Graphite as the backend storage for StatsD, but as its popularity has grown, people have begun using many different datastores with StatsD.
Telegraf
Telegraf is built by the company behind InfluxDB. It is built to write to InfluxDB and be the T in their TICK monitoring stack. TICK stands for Telegraf, InfluxDB, Chronograf (their graphing and web UI), and Kapacitor (their alerting framework). That being said, it supports many different backends and inputs for data. It is written in Go.
ELK
ELK is an acronym for Elasticsearch, Logstash, and Kibana. In recent years, this stack has been called the Elastic Stack, as it now includes replacements for Logstash, called Beats. All of these products (and many others) are maintained by the Elastic company.
Elasticsearch is a datastore (it's technically a search engine with a JSON API and a built-in datastore), Logstash is a log processor, and Kibana is a web UI for accessing Elasticsearch's search functionality and turning the output into graphs and tables. I include this because some people enjoy writing log lines for their metrics and then writing Logstash parsing rules to turn log lines into metrics. There are lots of ways to implement this, as Logstash has a bunch of competitors, and you can configure Logstash, or tools like Google's mtail, to write to other metrics services. I personally find log parsing to be fragile, but that feeling comes from the days of having to write parsing logic for lots of different log files that would change without warning and break metric ingestion. Now that people write logs out in various types of formatted lines (such as JSON or Protocol buffers), this is much less of an issue. I still would much rather just use client libraries to write out metrics, but many organizations enjoy log-based metrics, and there can be lots of benefits to having all of the data fidelity logs can provide.