2014-02-26

Ganglia Web UI Performance Improvement for Large Clusters

There are many transitions people tend to make over time as their Ganglia cluster grows in order to maintain performance. Many of them have been covered in other blogs, but here is a brief summary of the scaling steps I usually take, from the initial cluster up through thousands of hosts and hundreds of thousands of metrics.

All notes in this post apply to gmond and gmetad versions 3.5.0 and the ganglia-webfrontend version 3.3.5

The cluster starts out with gmond installed everywhere (configured for unicast) sending all data to the monitoring host (usually also running nagios). gmond listens there and collects all the metrics for my single cluster. gmetad runs there and writes this data to local disk.

Soon I outgrow a single cluster and so spin up multiple gmond processes, each configured to listen on a separate port. gmetad lists each gmond as a separate data_source and I have multiple clusters in the web UI.

Eventually the monitoring host becomes a bit overloaded and so ganglia (both gmetad and gmond) move to a separate host, leaving nagios and all the other cruft there to itself.

The next step is moving the gmond collector processes to their own host. At this point I usually set up two gmond collector servers for redundancy. Each server has the same configuration - one gmond process per ganglia cluster, listening on a separate port. The ganglia web UI and gmetad live on the same server and both gmond collectors are listed on each data_source line. I also create two gmetad/webui hosts at this point, also for redundancy, both to preserve data if the disk dies on one but also to separate out the one nagios talks to from the one I use as the web UI. Distributing traffic in this way helps the web UI stay snappy for people while letting nagios hammer the other.

As the grid grows and the number of metrics increase, local disk on the gmetad / webui host starts to fail to keep up.  I think this is around 70k metrics, but it depends on your disk. The solution at this point is either to install rrdcached or switch to a ramdisk. rrdcached will get you a long ways, but I think over ~350k metrics, local disk can't even keep up with rrdcached's write load. There are undoubtedly knobs to tweak to push rrdcached much further, but just using a ramdisk works pretty well.

All of these steps successfully scale the ganglia core very well. With redundant gmond processes, redundant gmetads, and some sort of ram buffer or disk for the RRDs, you can get up to ~300k metrics pretty easily.

On to the meat of this post.

Somewhere between 250k and 350k metrics, the ganglia web UI started to slow down dramatically. By the time we got to ~330k, it would take over 2 minutes to load the front page. gmetad claimed it was using ~130% CPU (over 100% because it's multithreaded and using multiple cores) but there was still plenty of CPU available. Because it was on a ramdisk, there was no iowait slowing everything down. It wasn't clear where the bottleneck was.

Individual host pages would load relatively quickly (2-3s). Cluster views would load acceptably (~10s) but the front page and the Views page were unusable.

Vladimir suggested that even though there was CPU available, the sheer number of RRDs gmetad was managing was making it slow to respond to requests to its ports. He suggested running two gmetad processes, one configured to write out RRDs and the other available for the web UI.

A brief interlude - gmetad has two duties: manage the RRDs and respond to queries about the current status of the grid, clusters, and individual hosts. It has two TCP ports it listens to, one that gives a dump of all current state and the other which provides an interactive port for querying about specific resources.  The web UI queries gmetad for the current values of metrics and to determine which hosts are up. It then reads the RRDs for the up nodes and clusters to present to the end user.

By separating the two duties gmetad performs into two separate processes, there is no longer contention between managing the files on disk and responding to the web ui. Making this separation dropped the time needed to load the front page from 2m45s to 0m02s.

Here are the changes necessary to run two gmetad processes in this configuration.
  • duplicate gmetad.conf (eg gmetad-norrds.conf) and make the following changes:
    • use a new xml_port and interactive_port. I use 8661 and 8662 to mirror the defaults 8651 and 8652
    • specify a new rrd_rootdir. The dir must exist and be owned by the ganglia user (or whatever user runs gmetad) but it will remain empty. (this isn't strictly necessary but is a good protection against mistakes.)
    • add the option 'write_rrds off'
  • duplicate the start script or use your config management to make sure two gmetad processes are running, one with each config file
  • edit /etc/ganglia-webfrontend/conf.php (or your local config file) and override the port used to talk to gmetad; specify 8661 instead of the default.

That's it!

p.s. There are some claims that the current pre-release gmetad has many improvements in locking and how it hashes hosts, so the performance problems that leads to this solution may vanish with future releases.

2014-02-04

Better Latency Graphs

Nearly all the systems I've worked with to graph metrics about a system focus on lines. If you want to know about the latency of your site, you often graph average, 50th percentile, and 90th percentile lines representing your load time. This gives you a good idea of what most people are experiencing.

For example:

50th percentile is the bottom most graph. It's perfectly flat, showing that half of your page load times are less than 0.15sec or so. The average and 90th% both have bumps at the same place, which means that 10% of your customers are having a significantly longer page load time (longer than ~0.4sec). Imagine instead the 90th% line was flat while the average still had bumps - that would mean that it was less than 10% of your page load times that were much higher. This is all good information, but it is still pretty sparse.

I want to see a system that lets me graph a histogram of page load times for every time slice.

Imagine two scenarios, one with a very simple page load profile. The following graph represents one time slice (say, a 60s window) of how many requests took how long.


Most requests take about 100ms. There's a long tail, but it's a pretty normal curve. This type of traffic is well represented by the avg/50th/90th type graph like above.

Ok, now imagine a different type of page profile. One of the pages in your site loads pretty quickly, but there's a second page that's also frequently loaded that takes about 3x longer.

Here's that site's load time profile:



If you graph the average, 50th%, and 90th% of these requests, it's nearly indistinguishable from the previous graph, despite the two profiles being very different.

I would like to see a framework for capturing this kind of difference. Instead of graphing each time slice like the graph above, showing count against latency in a 2d graph, instead give each count a color (from light to dark) and graph each time slice as a vertical bar, similar to the ganglia graph posted up top.  The result might look something like these:




These graphs make it easy to identify traffic patters, but more importantly, they allow you to easily detect changes in traffic patters much more easily than just an avegerage/50th/90th graph.

Do any of you know how to make these using open source tools?

[UPDATE] So far from twitter, two suggestions that I use R, one that I look at statsite (a fork of statsd which supports histograms, though it appears maybe statsd also supports histograms) and one expression of boredom, ennui, and Limn. I'm mostly ignoring the suggestions that I use R because it feels like they're about the same level of helpfulness as saying "you could use Python!" (though I know that's probably just because I don't know R). The statsd/statsite stuff looks interesting and bears further investigation. Last time I looked at statsd it couldn't do histograms. I'm pretty sure I won't find out anything about Limn until I go buy David some drinks.