I think we have found that the zencommand daemon in Zenoss Core has some very reachable limits with regard to the number of commands it will process. We have been increasing our command monitoring lately and got to a point where zencommand, the daemon responsible for running these types of monitoring functions, had 1,200 commands in each cycle. I noticed that some of the performance templates that graphed the counters fetched by our commands had gaps--sometimes large gaps--and some had simply quit entirely. The odd thing was that we didn't really get any warnings (our VP went looking for graphed data and, ummm, didn't find it).
If I took one of the devices with templates that weren't graphing and manually ran zencommand against that device, it worked perfectly and fetched all the counters from the various commands. Stracing it didn't show any errors either. But with the amount of load we were providing, it was definitely silently dropping commands.
Zencommand seems to be a rather single-threaded beast. Its ability to get everything done is a function of the following:
If I took one of the devices with templates that weren't graphing and manually ran zencommand against that device, it worked perfectly and fetched all the counters from the various commands. Stracing it didn't show any errors either. But with the amount of load we were providing, it was definitely silently dropping commands.
Zencommand seems to be a rather single-threaded beast. Its ability to get everything done is a function of the following:
- The number of data sources it is processing
- The number of monitored devices
- The cycle time of the data sources it is processing
- The native Device template uses SNMP (and SNMP Informant on the Windows side) to read the base CPU, memory, and paging counters. To avoid deploying SNMP Informant everywhere, some time ago we had changed to using a different template that used zencommand and remote WMI calls to read these in. I am going to change this back to using Device, which will take quite a bit of load off of zencommand.
- Watch the cycle time. Does anyone have QoS recommendations for the resolution of performance counters? 60 second cycle times are a bit aggressive, but what works well--3 minutes? 5 minutes?
- Always validate that the graphs are painting after making changes affecting zencommand. If you roll out a new template, don't just look at that template to make sure it's working--look at other things zencommand is handling after you roll it out. As we discovered, adding too much load can silently break other things.