mardi 1 avril 2014

Graphite, capacity planning and cheap monitoring

During this last year, I had to work a lot with Graphite. It is an awesome tool and even if the fronted is a bit ugly, it can be easily replaced by another one such as Graphiti or the Modern-UI-like Dashing.

Graphite installation

Graphite installation is really a pain, so you should be prepared to suffer if you really want to install your own. If not, you should look at virtual machines with a pre-configured instance, such as Docker files or other appliances.

File format

Graphite uses a custom file format named whisper. It is similar to RRD: a data file is allocated to contain all possible values, given your resolution and your retention time.

Basically, you have to decide how often you will store a value and how long you want to keep it. With those two informations, the backend (carbon) can calculate exactly the amount of data you want to store and optimise everything based on this.

For instance, suppose you want to store one value per second and keep these values during one week. It means that your dataset will consist in : 1 * 3600 * 24 * 7 = 604 800 values over the week. Therefore, you could allocate an array of 604 800 elements and then use a pointer to reference the next index you want to write at. Once you reached the last index, you just start again from the beginning and override the previous values.

This mean that you cannot have more than one value per second. If you send two values for the same time frame, then only the last is kept. This is an important point.

Differences with the combo Elasticsearch + Kibana

Now, this should make you think about Elasticsearch + Kibana. They are more and more used for monitoring purposes and are very good tools. The point is that you should not try to compare Elasticsearch + Kibana with Graphite, since they are not addressing the same issues.

The idea with Elasticsearch + Kibana is that, since you do not know the queries you are going to run, you store all the raw data and do the aggregation job at render time. For instance, if 5 queries are received at the same second, you are going to store all 5 of them, not the last one. With this philosophy, you can store an unlimited amount of data point per time frame.

With Graphite, you want to do the aggregation job before inserting the data, therefore you have to identify the questions you want to answer before doing anything. You should not think that Carbon (Graphite backend) as the storage engine for your raw data. It just cannot do that. If you really want to keep all your raw data and use Graphite as a frontend, you should consider using a HDFS storage (or just a plain old hard disk) and scripts to process your data and feed the results into Graphite.

Graphite is ops-friendly

If your operation team already has a storage for log files, then they will probably not like your suggestion of adding Elasticsearch + Kibana, since it duplicates the raw data. I have lived this situation, and I have to agree with ops. If you have a repository (NFS mount, HDFS cluster, whatever) with all of your raw data, then duplicating it is not a good option.

This is where Graphite comes and save your day. Since you are only going to store the processed data, you will need very little disk space, and they will like you for being savy. Plus they are probably already using Graphite for their monitoring.

Capacity planning

Now let’s say that you develop an API with 150 different endpoints (Example : GET /account is not the same as POST /account). You want to store these informations during three years :

  • Every minute
    • Number of queries during the last minute
    • 90th percentile response time over 1 minute
    • 99th percentile response time over 1 minute
    • Maximum response time over 1 minute
  • Every hour
    • Number of queries during the last hour
    • 90th percentile response time over 1 hour
    • 99th percentile response time over 1 hour
    • Maximum response time over 1 hour
  • Every day
    • Number of queries during the last day
    • 90th percentile response time over 1 day
    • 99th percentile response time over 1 day
    • Maximum response time over 1 day

Think about it, how much space do you need ?

Easy : about 11 GB. Yep, that’s it, 11 GB for all these metrics on 150 endpoints during three years.

This is a very simple calculation :

  • Storing one value per minute during three years take ~19MB (~18480KB)
  • Storing one value per hour during three years take ~308KB
  • Storing one value per day during three years take ~16KB

Total space = 150 * ( 4 * 18480 + 4 * 308 + 4 * 16 ) = 11 282 400 KB ~= 11 GB

Monitoring is cheap

This is the part that I like the most about Graphite : it is definitely cheap. I know the questions I want to answer, and the metrics I want to look at. Therefore, I can keep only a few months of raw data and store the aggregated information in Graphite during years.

In my previous example, I could store 3 years of aggregated data on a 16GB SD Card on a Raspberry Pi. This is what I mean by "cheap" : so cheap that no one would argue against it. Usually, project managers tend to allow 3-digits expenses without even looking at them. Here we are talking about 2-digits expenses.

You want to see the response time evolution on that big enterprise project that puts only one new release in production every year ? Simple, think about your queries and store the answers in Graphite !

Another alternative ? Different thoughts about Graphite ? Let’s talk about it in the comments !