2014-02-04

Better Latency Graphs

Nearly all the systems I've worked with to graph metrics about a system focus on lines. If you want to know about the latency of your site, you often graph average, 50th percentile, and 90th percentile lines representing your load time. This gives you a good idea of what most people are experiencing.

For example:

50th percentile is the bottom most graph. It's perfectly flat, showing that half of your page load times are less than 0.15sec or so. The average and 90th% both have bumps at the same place, which means that 10% of your customers are having a significantly longer page load time (longer than ~0.4sec). Imagine instead the 90th% line was flat while the average still had bumps - that would mean that it was less than 10% of your page load times that were much higher. This is all good information, but it is still pretty sparse.

I want to see a system that lets me graph a histogram of page load times for every time slice.

Imagine two scenarios, one with a very simple page load profile. The following graph represents one time slice (say, a 60s window) of how many requests took how long.


Most requests take about 100ms. There's a long tail, but it's a pretty normal curve. This type of traffic is well represented by the avg/50th/90th type graph like above.

Ok, now imagine a different type of page profile. One of the pages in your site loads pretty quickly, but there's a second page that's also frequently loaded that takes about 3x longer.

Here's that site's load time profile:



If you graph the average, 50th%, and 90th% of these requests, it's nearly indistinguishable from the previous graph, despite the two profiles being very different.

I would like to see a framework for capturing this kind of difference. Instead of graphing each time slice like the graph above, showing count against latency in a 2d graph, instead give each count a color (from light to dark) and graph each time slice as a vertical bar, similar to the ganglia graph posted up top.  The result might look something like these:




These graphs make it easy to identify traffic patters, but more importantly, they allow you to easily detect changes in traffic patters much more easily than just an avegerage/50th/90th graph.

Do any of you know how to make these using open source tools?

[UPDATE] So far from twitter, two suggestions that I use R, one that I look at statsite (a fork of statsd which supports histograms, though it appears maybe statsd also supports histograms) and one expression of boredom, ennui, and Limn. I'm mostly ignoring the suggestions that I use R because it feels like they're about the same level of helpfulness as saying "you could use Python!" (though I know that's probably just because I don't know R). The statsd/statsite stuff looks interesting and bears further investigation. Last time I looked at statsd it couldn't do histograms. I'm pretty sure I won't find out anything about Limn until I go buy David some drinks.

No comments: