Friday, May 15, 2009

Zenoss zencommand daemon overload

I think we have found that the zencommand daemon in Zenoss Core has some very reachable limits with regard to the number of commands it will process. We have been increasing our command monitoring lately and got to a point where zencommand, the daemon responsible for running these types of monitoring functions, had 1,200 commands in each cycle. I noticed that some of the performance templates that graphed the counters fetched by our commands had gaps--sometimes large gaps--and some had simply quit entirely. The odd thing was that we didn't really get any warnings (our VP went looking for graphed data and, ummm, didn't find it).

If I took one of the devices with templates that weren't graphing and manually ran zencommand against that device, it worked perfectly and fetched all the counters from the various commands. Stracing it didn't show any errors either. But with the amount of load we were providing, it was definitely silently dropping commands.

Zencommand seems to be a rather single-threaded beast. Its ability to get everything done is a function of the following:
  • The number of data sources it is processing
  • The number of monitored devices
  • The cycle time of the data sources it is processing
I was able to collapse some of our data sources so that instead of 1,200 commands I got down to around 560, and voila--the graphs that had not been painting suddenly began working correctly. To avoid this issue, I recommend the following:
  • The native Device template uses SNMP (and SNMP Informant on the Windows side) to read the base CPU, memory, and paging counters. To avoid deploying SNMP Informant everywhere, some time ago we had changed to using a different template that used zencommand and remote WMI calls to read these in. I am going to change this back to using Device, which will take quite a bit of load off of zencommand.
  • Watch the cycle time. Does anyone have QoS recommendations for the resolution of performance counters? 60 second cycle times are a bit aggressive, but what works well--3 minutes? 5 minutes?
  • Always validate that the graphs are painting after making changes affecting zencommand. If you roll out a new template, don't just look at that template to make sure it's working--look at other things zencommand is handling after you roll it out. As we discovered, adding too much load can silently break other things.

1 comment:

  1. The Zenoss daemons are indeed single-threaded, a limitation you can get around with multiple collectors. Typically you hit the limitations by only using 1 or 2 protocols. The default resolution for the RRD graphs is set by the collection interval, so for example SNMP collects every 5 minutes (set at the collector), so the RRD step is 300 seconds. zencommand's cycle time is set via a zProperty, so you may want to bump it up to 300 from 60. You'll have to remove your RRD graphs to get the new step, or use one of the rrdtools to update it manually.

    Matt Ray
    Zenoss Community Manager