2014-02-26

Ganglia Web UI Performance Improvement for Large Clusters

There are many transitions people tend to make over time as their Ganglia cluster grows in order to maintain performance. Many of them have been covered in other blogs, but here is a brief summary of the scaling steps I usually take, from the initial cluster up through thousands of hosts and hundreds of thousands of metrics.

All notes in this post apply to gmond and gmetad versions 3.5.0 and the ganglia-webfrontend version 3.3.5

The cluster starts out with gmond installed everywhere (configured for unicast) sending all data to the monitoring host (usually also running nagios). gmond listens there and collects all the metrics for my single cluster. gmetad runs there and writes this data to local disk.

Soon I outgrow a single cluster and so spin up multiple gmond processes, each configured to listen on a separate port. gmetad lists each gmond as a separate data_source and I have multiple clusters in the web UI.

Eventually the monitoring host becomes a bit overloaded and so ganglia (both gmetad and gmond) move to a separate host, leaving nagios and all the other cruft there to itself.

The next step is moving the gmond collector processes to their own host. At this point I usually set up two gmond collector servers for redundancy. Each server has the same configuration - one gmond process per ganglia cluster, listening on a separate port. The ganglia web UI and gmetad live on the same server and both gmond collectors are listed on each data_source line. I also create two gmetad/webui hosts at this point, also for redundancy, both to preserve data if the disk dies on one but also to separate out the one nagios talks to from the one I use as the web UI. Distributing traffic in this way helps the web UI stay snappy for people while letting nagios hammer the other.

As the grid grows and the number of metrics increase, local disk on the gmetad / webui host starts to fail to keep up.  I think this is around 70k metrics, but it depends on your disk. The solution at this point is either to install rrdcached or switch to a ramdisk. rrdcached will get you a long ways, but I think over ~350k metrics, local disk can't even keep up with rrdcached's write load. There are undoubtedly knobs to tweak to push rrdcached much further, but just using a ramdisk works pretty well.

All of these steps successfully scale the ganglia core very well. With redundant gmond processes, redundant gmetads, and some sort of ram buffer or disk for the RRDs, you can get up to ~300k metrics pretty easily.

On to the meat of this post.

Somewhere between 250k and 350k metrics, the ganglia web UI started to slow down dramatically. By the time we got to ~330k, it would take over 2 minutes to load the front page. gmetad claimed it was using ~130% CPU (over 100% because it's multithreaded and using multiple cores) but there was still plenty of CPU available. Because it was on a ramdisk, there was no iowait slowing everything down. It wasn't clear where the bottleneck was.

Individual host pages would load relatively quickly (2-3s). Cluster views would load acceptably (~10s) but the front page and the Views page were unusable.

Vladimir suggested that even though there was CPU available, the sheer number of RRDs gmetad was managing was making it slow to respond to requests to its ports. He suggested running two gmetad processes, one configured to write out RRDs and the other available for the web UI.

A brief interlude - gmetad has two duties: manage the RRDs and respond to queries about the current status of the grid, clusters, and individual hosts. It has two TCP ports it listens to, one that gives a dump of all current state and the other which provides an interactive port for querying about specific resources.  The web UI queries gmetad for the current values of metrics and to determine which hosts are up. It then reads the RRDs for the up nodes and clusters to present to the end user.

By separating the two duties gmetad performs into two separate processes, there is no longer contention between managing the files on disk and responding to the web ui. Making this separation dropped the time needed to load the front page from 2m45s to 0m02s.

Here are the changes necessary to run two gmetad processes in this configuration.
  • duplicate gmetad.conf (eg gmetad-norrds.conf) and make the following changes:
    • use a new xml_port and interactive_port. I use 8661 and 8662 to mirror the defaults 8651 and 8652
    • specify a new rrd_rootdir. The dir must exist and be owned by the ganglia user (or whatever user runs gmetad) but it will remain empty. (this isn't strictly necessary but is a good protection against mistakes.)
    • add the option 'write_rrds off'
  • duplicate the start script or use your config management to make sure two gmetad processes are running, one with each config file
  • edit /etc/ganglia-webfrontend/conf.php (or your local config file) and override the port used to talk to gmetad; specify 8661 instead of the default.

That's it!

p.s. There are some claims that the current pre-release gmetad has many improvements in locking and how it hashes hosts, so the performance problems that leads to this solution may vanish with future releases.

5 comments:

Unknown said...

Thank you for the improvement. I've applied it and it works fine.

My question is: Shall we duplicate the data_source directives in the new gmetad instance? Because I am noticing a duplication of network traffic in the gmetad box after the change. Aren't we polling each gmond twice with this improvement?

I tried not to duplicate those directives, but the web UI home page doesn't show the clusters. May be the network traffic overhead is a necessary evil.

ben said...

Yes, it does duplicate the network traffic to the gmond collectors. While this is unfortunate, network bandwidth (at least in my infrastructure) is not a bottleneck for ganglia data.

You're correct that if you don't duplicate the data_source directives in each gmetad.conf file, it won't work. Both gmetad instances need all the data for this trick to work.

I've heard rumors that the latest ganglia builds have a number of improvements that might make this approach unnecessary, though I haven't tested them myself.

Unknown said...

Thanks for the clarification.

I'been running gmetad 3.7.0 (pre-release) for a week, and yes, there are less gaps in my big clusters. But the slow front page loading is still an issue. In fact, that's the main reason I applied this improvement.

Nova Leo said...

Thanks for this cool article. And I have one question:
>>edit /etc/ganglia-webfrontend/conf.php (or your local config file) and override the port used to talk to gmetad; specify 8661 instead of the default.

In gweb2's conf, it use port 8652 by default, not 8651. So we should use 8662 instead of 8661 in your case?

If I use 8661, ganglia web will not show summary value like CPUs total/Hosts up/Hosts down.
8662 seems OK.

Unknown said...

Thank you very much, it is really helpful for us.