/dev/arthur: zenoss

Showing posts with label zenoss. Show all posts

Wednesday, March 17, 2010

Error moving the Zenoss database to a new MySQL server

We recently grew our Zenoss 2.3.3 installation by installing a dedicated MySQL server and moving the events database to that server. Events were being collected and written, but we could not move any events to history, and the zenactions log showed the following:

2010-03-17 05:31:26 ERROR zen.Events: (1449, "The user specified as a definer ('zenoss'@'localhost') does not exist")

It turns out this error was coming from MySQL. There is a trigger, status_delete, on the events.status table that controls moving events to history; apparently, the application code is designed to simply delete the status table rows when instructed, and the trigger then moves the rows to the history table. The trigger has a definer specified that indicates the zenoss user must be @locahost (yes, that bold code is commented out):

CREATE
    /*!50017 DEFINER = 'zenoss'@'localhost' */
    TRIGGER `status_delete` BEFORE DELETE ON `status`
    FOR EACH ROW INSERT INTO history SET
            dedupid=OLD.dedupid,
            evid=OLD.evid,
            device=OLD.device,
            component=OLD.component,
            eventClass=OLD.eventClass,
            eventKey=OLD.eventKey,
            summary=OLD.summary,
            message=OLD.message,
            severity=OLD.severity,
            eventState=OLD.eventState,
            eventClassKey=OLD.eventClassKey,
            eventGroup=OLD.eventGroup,
            stateChange=OLD.stateChange,
            firstTime=OLD.firstTime,
            lastTime=OLD.lastTime,
            COUNT=OLD.count,
            prodState=OLD.prodState,
            suppid=OLD.suppid,
            manager=OLD.manager,
            agent=OLD.agent,
            DeviceCLass=OLD.DeviceClass,
            Location=OLD.Location,
            Systems=OLD.Systems,
            DeviceGroups=OLD.DeviceGroups,
            ipAddress=OLD.ipAddress,
            facility=OLD.facility,
            priority=OLD.priority,
            ntevid=OLD.ntevid,
            ownerid=OLD.ownerid,
            deletedTime=NULL,
            clearid=OLD.clearid,
            DevicePriority=OLD.DevicePriority,
            eventClassMapping=OLD.eventClassMapping,
            monitor=OLD.monitor;
$$

DELIMITER ;

I fixed it by changing it to the following, and then dropping and recreating the trigger:

CREATE
    TRIGGER `status_delete` BEFORE DELETE ON `status`
    FOR EACH ROW INSERT INTO history SET
            dedupid=OLD.dedupid,
            evid=OLD.evid,
            device=OLD.device,
            component=OLD.component,
            eventClass=OLD.eventClass,
            eventKey=OLD.eventKey,
            summary=OLD.summary,
            message=OLD.message,
            severity=OLD.severity,
            eventState=OLD.eventState,
            eventClassKey=OLD.eventClassKey,
            eventGroup=OLD.eventGroup,
            stateChange=OLD.stateChange,
            firstTime=OLD.firstTime,
            lastTime=OLD.lastTime,
            COUNT=OLD.count,
            prodState=OLD.prodState,
            suppid=OLD.suppid,
            manager=OLD.manager,
            agent=OLD.agent,
            DeviceCLass=OLD.DeviceClass,
            Location=OLD.Location,
            Systems=OLD.Systems,
            DeviceGroups=OLD.DeviceGroups,
            ipAddress=OLD.ipAddress,
            facility=OLD.facility,
            priority=OLD.priority,
            ntevid=OLD.ntevid,
            ownerid=OLD.ownerid,
            deletedTime=NULL,
            clearid=OLD.clearid,
            DevicePriority=OLD.DevicePriority,
            eventClassMapping=OLD.eventClassMapping,
            monitor=OLD.monitor;
$$

DELIMITER ;

I did try just dropping and recreating the trigger without the change, but it seems the "commented out" definer was responsible for our error, as I duplicated the same error when trying to delete events from the mysql client directly as user zenoss. I then removed the comment completely and recreated the trigger, and the “move to history” function began working.

Note that the age_events stored procedure also has a localhost definer. If you script it out and change the definer specification to the IP of the Zenoss server, that too will function.

Monday, March 08, 2010

Deleting all devices from Zenoss with zendmd

We upgraded our Zenoss 2.3.3 installation to 2.5.2 in a test environment, but wanted to retain only a few of the hundreds of devices that were monitored by the system. We tried deleting these from the web UI, but it would not complete--it always timed out and never seemed to delete anything. I then turned to a zendmd script, which worked wonders:

for dev in dmd.Devices.getSubDevices():
print dev.id
dev.deleteDevice()
commit()
reindex()
commit()

This ran quickly and took out all of our devices. We then added back in the ones we wanted. Note: this does not delete performance graphs or events for the affected devices, so you can add them back in and still have this information available.

Monday, September 07, 2009

Best practice around Zenoss template bindings

After working with Zenoss Core for a few years now, I wanted to pass along a tip that would have saved me a lot of time had I known it in the beginning: do not bind performance templates to individual devices, or make local modified copies of templates on devices. Once you do this, the device no longer inherits template bindings from the parent device class, but you have no good way to know that this condition exists unless you drill into the device and examine its template bindings.

When you move a device from one class to another, if it lacks local bindings or overridden templates, it will inherit the bindings of the new class. However, this does not occur if you have made any local changes as above. Also, if you bind new performance templates to its parent, it ignores these new bindings if you have made any local changes. I highly recommend that if you need local changes to a device, make a new device class and make the template binding changes to the class. Then, move the device to the new class, and it will pick up the new bindings.

To clear this condition on a device once you have made local changes, you must reset bindings to be those of its container. On the device, go to More | Templates, then use the drop-down arrow next to "Performance templates for device," and finally Reset Bindings.

Tuesday, July 28, 2009

Zenoss 2.3.3 error moving device organizers

I encountered an interesting error today with Zenoss 2.3.3 (yes, 2.4 has been out for some time, but we've been having trouble getting a clean upgrade process). I had the following device classes with devices:

/Server/Windows/Sync/DatacenterA
- device1
- device2
- device3
/Server/Windows/Sync/DatacenterB
- device4
- device5
- device6

As these were all production machines, I wanted to differentiate them from the other dev and QA machines. I created the organizer /Server/Windows/Sync/Production and moved the ../DatacenterA and ../DatacenterB organizers with their devices under the new /Server/Windows/Sync/Production and...

...everything broke for the moved devices. All zencommand and zenperfsnmp-based monitoring just stopped cold. When starting zenperfsnmp in debug mode, I noticed errors on its first run like:

2009-07-27 17:15:45 WARNING zen.zenperfsnmp: Error loading config for devices ['device1', 'device2']
2009-07-27 17:15:45 WARNING zen.zenperfsnmp: Error loading config for devices ['device3', device4']
2009-07-27 17:15:45 WARNING zen.zenperfsnmp: Error loading config for devices ['device5', device6']

To correct this, I ended up moving the organizers back to their original locations, creating new organizers with the structure I wanted, and moving the individual devices (not the organizers) to the new organizer structure.

Now if I could just get this issue fixed I'd have no more zenperfsnmp/zencommand mysteries!

Friday, May 15, 2009

Zenoss zencommand daemon overload

I think we have found that the zencommand daemon in Zenoss Core has some very reachable limits with regard to the number of commands it will process. We have been increasing our command monitoring lately and got to a point where zencommand, the daemon responsible for running these types of monitoring functions, had 1,200 commands in each cycle. I noticed that some of the performance templates that graphed the counters fetched by our commands had gaps--sometimes large gaps--and some had simply quit entirely. The odd thing was that we didn't really get any warnings (our VP went looking for graphed data and, ummm, didn't find it).

If I took one of the devices with templates that weren't graphing and manually ran zencommand against that device, it worked perfectly and fetched all the counters from the various commands. Stracing it didn't show any errors either. But with the amount of load we were providing, it was definitely silently dropping commands.

Zencommand seems to be a rather single-threaded beast. Its ability to get everything done is a function of the following:

The number of data sources it is processing
The number of monitored devices
The cycle time of the data sources it is processing

I was able to collapse some of our data sources so that instead of 1,200 commands I got down to around 560, and voila--the graphs that had not been painting suddenly began working correctly. To avoid this issue, I recommend the following:

The native Device template uses SNMP (and SNMP Informant on the Windows side) to read the base CPU, memory, and paging counters. To avoid deploying SNMP Informant everywhere, some time ago we had changed to using a different template that used zencommand and remote WMI calls to read these in. I am going to change this back to using Device, which will take quite a bit of load off of zencommand.
Watch the cycle time. Does anyone have QoS recommendations for the resolution of performance counters? 60 second cycle times are a bit aggressive, but what works well--3 minutes? 5 minutes?
Always validate that the graphs are painting after making changes affecting zencommand. If you roll out a new template, don't just look at that template to make sure it's working--look at other things zencommand is handling after you roll it out. As we discovered, adding too much load can silently break other things.

Friday, December 19, 2008

Zenoss 2.3.2 LDAP authentication with Ubuntu 8.04 and the stack installer

I was able to get the Active Directory authentication module loaded for our Ubuntu Server 8.04 stack installer-based Zenoss 2.3.2 installation. There is a bit of confusion about how to do this, as the wiki instructions for setup assume you are using the RPM-based installer or have installed from source. This turned out to not be too difficult given that the Ubuntu 8.04 distribution comes with the python-ldap package. In summary, you need to link in the distribution's installed python-ldap components into the site packages path for Zenoss's local Python 2.4 runtime and compile them. Here are the steps (these assume you have already downloaded and placed the LDAPUserFolder and LDAPMultiPlugins packages in the path identified in the wiki instructions):

Install python-ldap
(As root)

aptitude install python-ldap

Link python-ldap components to Zenoss's site packages path
We need the _ldap.so binary compiled against Python 2.4 and the source files. As the zenoss user:

#The Zenoss local Python site package path is $ZENHOME/lib/python!
cd $ZENHOME/lib/python
mkdir ldap
mkdir ldap/schema
ln -s /usr/share/pyshared/ldif.py
ln -s /usr/share/pyshared/ldapurl.py
ln -s /usr/lib/python2.4/site-packages/_ldap.so
cd ldap
ln -s /usr/share/pyshared/ldap/async.py
ln -s /usr/share/pyshared/ldap/controls.py
ln -s /usr/share/pyshared/ldap/filter.py
ln -s /usr/share/pyshared/ldap/__init__.py
ln -s /usr/share/pyshared/ldap/modlist.py
ln -s /usr/share/pyshared/ldap/cidict.py
ln -s /usr/share/pyshared/ldap/dn.py
ln -s /usr/share/pyshared/ldap/functions.py
ln -s /usr/share/pyshared/ldap/ldapobject.py
ln -s /usr/share/pyshared/ldap/sasl.py
cd schema
ln -s /usr/share/pyshared/ldap/schema/__init__.py
ln -s /usr/share/pyshared/ldap/schema/models.py
ln -s /usr/share/pyshared/ldap/schema/subentry.py
ln -s /usr/share/pyshared/ldap/schema/tokenizer.py

Compile .py files
Now that we have the files linked in from the global shared Python path (where the python-ldap deb installer put them), we need to compile all of the .py files using Zenoss's local python 2.4 installation:

cd $ZENHOME/lib/python
python /usr/local/zenoss/python/lib/python2.4/py_compile.py ldif.py
python /usr/local/zenoss/python/lib/python2.4/py_compile.py ldapurl.py
cd ldap
python /usr/local/zenoss/python/lib/python2.4/py_compile.py *.py
cd schema
python /usr/local/zenoss/python/lib/python2.4/py_compile.py *.py

Now that everything is compiled, restart zope (as zenoss, zopectl restart) and you can proceed with the rest of the instructions in the above wiki article. You will now see the ActiveDirectory Multi Plugin in the plugin list on the http://zenoss-installation:8080/zport/acl_users/manage_workspace page.

Tuesday, April 15, 2008

Make zenwin and zenwinmodeler ignore WMI errors

(This tip is also on the Zenoss wiki.)

At least in version 2.1.1, zenwin, zenwinmodeler, and zeneventlog have (IMO) a critical defect: if there are any /Status/WMI/Conn issues not in history for the device, they ignore the device. On our network, for some reason we end up with a lot of these events ('timegenerated' errors, various intermittent failures to connect, etc.). This causes the monitoring of our Windows servers to dramatically fall off as the system runs, and we miss critical issues.

I changed the behavior of these three systems to go ahead and attempt monitoring even if WMI issues are encountered. I learned that most of the time these WMI issues are spurious and successful monitoring CAN still be attempted. If you use this code, I recommend combining it with event commands to restart the zenoss daemons when it finds them dead.

Also, in zenwin, I added/improved the exception handling; a failure to create the watcher object occurs outside of a try block. Much of this code is an attempt to keep zenwin from crashing if it tries to monitor a Windows Server 2008 machine (Zenoss is not compatible with WS 2008 or Vista's WMI interface, and zenwin cannot monitor services on these devices). I ended up adding a hardcoded exclusion list so I can otherwise monitor the machine but have zenwin skip it. For some reason, zeneventlog seems to not crash, although it is not able to retrieve events from the WS 2008 machine either.

Please see the Zenoss wiki for the zenwin and zenwinmodeler diffs.

Sunday, April 06, 2008

Find Zenoss event classes with transforms

(This tip is also on the Zenoss wiki.)

If you have entered transforms but can't remember where you entered them, type the following in zendmd (run zendmd from the command line on the Zenoss server as the zenoss user):

>>> for ec in dmd.Events.getSubOrganizers():
...      if ec.transform:
...      print ec.getOrganizerName()

Saturday, April 05, 2008

Custom Zenoss graph based on multiple data points

(This tip is also on the Zenoss Wiki.)

If you want to make a custom graph in Zenoss based on more than one data point (such as a ratio or other calculation), you will need to enter a custom graph definition for RRDTool to use. I found some good guides on how to define graphs with RRDTool (such as this tutorial on CDEF and others at that site), but it took me a while to put this together with the available data points and variables in Zenoss so the graph would work.

Edit the performance template to which you wish to add the graph. Click the drop-down arrow next to Graph Definitions and choose "Add Graph..." and name it.

Click on the Graph Custom Definition tab and you are presented with a blank slate for your new graph's definition. It may be easiest to start with an example. I entered the following custom graph definition:

DEF:BusyThreads-raw=${here/fullRRDPath}/appThreads_BusyThreads.rrd:ds0:AVERAGE
DEF:RequestsPerSecond-raw=${here/fullRRDPath}/appThreads_RequestsPerSecond.rrd:ds0:AVERAGE
DEF:AppCurrentConnections-raw=${here/fullRRDPath}/currentConnections_appCurrentConnections.rrd:ds0:AVERAGE
CDEF:connectionsToThreads=AppCurrentConnections-raw,1,RequestsPerSecond-raw,BusyThreads-raw,+,+,/
LINE:connectionsToThreads#00cc00:"Connections to Threads/Thread Activity Ratio"
GPRINT:connectionsToThreads:LAST:cur\:%5.2lf%s
GPRINT:connectionsToThreads:AVERAGE:avg\:%5.2lf%s
GPRINT:connectionsToThreads:MAX:max\:%5.2lf%s\j

To break this apart, I have two data sources and three data points involved in my ratio that are part of the performance template with this graph:

Data source: appThreads
This data source has two data points, BusyThreads and RequestsPerSecond.

Data source: currentConnections
This data source has one data point, appCurrentConnections.

What I want to graph is a ratio based on these data points as follows:

appCurrentConnections / (BusyThreads + RequestsPerSecond + 1)

Basically, I want a measure of the amount of work in the queue (current connections) divided by the amount of work output my application is producing (a combination of the busy threads and requests per second it is handling, plus one to avoid the possibility of a divide-by-zero error).

With that established, we need to define the RRD DEFs (variables) used in the graph, one for each of the variables in the above calculation. Here's the one for the busy threads variable. I supplied BusyThreads-raw as the name that is used in the graph line:

DEF:BusyThreads-raw=${here/fullRRDPath}/appThreads_BusyThreads.rrd:ds0:AVERAGE

The key above is the TALES expression to get the variable from the Zenoss performance template into our RRDTool DEF variable: ${here/fullRRDPath}/dataSourceName_dataPointName

Regarding the :AVERAGE at the end: While there are many different RRD functions, the most common one I've seen used is the AVERAGE function, which takes a recent rolling average of the value in question. Please consult the RRDTool documentation for going deeper with this.

After providing DEF lines for each variable in my calculation, I need a CDEF line (calculated definition) for the actual calculation that puts the calculation together:

CDEF:connectionsToThreads=AppCurrentConnections-raw,1,RequestsPerSecond-raw,BusyThreads-raw,+,+,/

The calculation uses reverse Polish notation and the CDEF tutorial above has an excellent guide to understanding it, but basically you can think of it as a stack: the variables and constants are pushed onto the stack in order from left to right, and when the first operator (the leftmost plus sign) hits the stack, the top two items (in this case, the BusyThreads-raw and RequestsPerSecond-raw variables) are popped off the stack and added together (the operator is applied). The result is pushed back onto the stack. The next plus sign adds this sum with 1, and finally the division operator divides the AppCurrentConnections-raw variable by the topmost stack item (1 + RequestsPerSecond-raw + BusyThreads-raw).

Once we have our connectionsToThreads variable, we can graph it. The next line defines the one line on our graph:

LINE:connectionsToThreads#00cc00:"Connections to Threads/Thread Activity Ratio"

It refers to the connectionsToThreads variables, defines a color in hex notation, and defines a label. Finally, we can print some additional information on the graph:

GPRINT:connectionsToThreads:LAST:cur\:%5.2lf%s
GPRINT:connectionsToThreads:AVERAGE:avg\:%5.2lf%s
GPRINT:connectionsToThreads:MAX:max\:%5.2lf%s\j

Here we print the last, average, and maximum values of our graph line on the currently-viewed graph section.

Limitation: Thresholding
One thing I could not get working was to define a threshold based on my calculated value above. It seems that the thresholds are only valid on the values of the data points themselves, and I couldn't get a threshold working on my derived value above.

Moving a Zenoss event to history via the Transform expression

(This tip is also on the Zenoss Wiki.)

There are cases where certain events are just noise and you want them moved to history automatically, but perhaps without having ALL of the events in that event class moved to history. For example, you may wish to move certain events from one event class to another based on matching text and at the same time have these go straight to history.

To do this, enter the following in Transform of the event class mapping:

evt._action="history"

Move an event in Zenoss from one event class to another based on event text

(This tip is also on the Zenoss Wiki.)

Many events map to the /App/Failed event class, most notably the Windows Application Error_1000 error (http://<your Zenoss server>:8080/zport/dmd/Events/App/Failed/instances/Application%20Error_1000). I wanted to move some of these Application Error_1000 events to other event classes based on matching particular applications, but to leave the rest in /App/Failed. How does one do this?

To begin, confirm that you have an existing event class to receive the events. If not, create a new one by navigating through the "Events" tree from the left navigation to get to the desired parent class, and once there, click the drop-down arrow next to Subclasses and choose "Add New Organizer..." Enter the name for the new event class, e.g. "MyApplication."

Second, map an additional event class mapping to Application Error_1000. In /zport/dmd/Events/App/Failed, click the drop-down arrow to the left of EventClass Mappings and choose "Add Mapping..." For the ID of the mapping, type Application Error_1000_<name of the application to handle differently>, e.g. "Application Error_1000_MyApplication." (This event class mapping doesn't have to be named this way, but it helps to have the application name as the suffix, so that the mapping gets grouped with any other Application Error_1000 mappings in the list.)

Once you have done this, edit the properties of the new mapping. There are three key things you need to set:

Event Class Key: Set this to: Application Error_1000
Regex: I'm sure you can put in much more complicated regular expressions, but all that is necessary is to type some text from the event message, which will usually contain the application's executable name. If this is the case, all you need to enter is something like: MyApplication.exe
Transform: Here, you need to key in the Python expression that will re-map the event to a different event class, e.g.: evt.eventClass="/App/Failed/MyApplication"

Save your changes to this new event class mapping. Now you need to sequence all the Application Error_1000* events so that this custom entry is matched first. Edit the new mapping and click on the Sequence tab. Make sure that your new mapping (Application Error_1000_MyApplication) has a lower sequence number than the generic Application Error_1000 entry. I'm not sure if the sequence numbers need to start at zero, but I've done it that way. So, make your new class sequence 0, and the generic Application Error_1000 class sequence 1. Don't forget to save your changes.

That's it--the events matching your custom event class mapping will be moved to the target event class, and all the others will be left in the original class.

Change Zenoss event severity based on message text

(This tip is also on the Zenoss Wiki.)

If you have events being mapped to a particular event class, generally one event severity gets applied to all of those events. If, however, you want to change the event severity of certain events based on the contents of the event message, do the following:

Navigate to the event class (under "Classes' in the left navigation, click Events, and then navigate to the event class containing the events you wish to conditionally map).
Using to drop-down arrow in the tab bar, choose More | Transform
In the Transform entry area, enter the following:

if evt.message.find("text to find") >= 0:
evt.severity = <desired severity>

For example:

if evt.message.find("timegenerated") >= 0:
evt.severity = 3

The above will change the severity of any events containing the text "timegenerated" from the default for the event class to 3 (warning). For your convenience, the event severity values are as follows:

Severity	Description
5	Critical
4	Error
3	Warning
2	Info
1	Debug
0	Clear

Thursday, January 10, 2008

Zenoss Core web site (Zope application server) crash

We had two Zenoss winexe processes go out of control yesterday. They consumed all available CPU and RAM and caused the rest of the daemons to crash/slow down. When we restarted Zenoss ("zenoss stop" followed by "zenoss start"), all Zenoss daemons came up, but zopectl (the Zope application server daemon) immediately died.

We found several of the following errors in $ZENHOME/log/event.log that appeared to be related:

2008-01-09T10:15:36 ERROR Zope.SiteErrorLog http://server.domain.local:8080/zport/RenderServer/render
Traceback (most recent call last):
File "usr/local/zenoss/lib/python/Zope2/App/startup.py", line 167, in zpublisher_exception_hook
File "usr/local/zenoss/lib/python/ZPublisher/Publish.py", line 120, in publish
File "usr/local/zenoss/lib/python/Zope2/App/startup.py", line 233, in commit
File "usr/local/zenoss/lib/python/transaction/_manager.py", line 84, in commit
File "usr/local/zenoss/lib/python/transaction/_transaction.py", line 381, in commit
File "usr/local/zenoss/lib/python/transaction/_transaction.py", line 379, in commit
File "usr/local/zenoss/lib/python/transaction/_transaction.py", line 424, in _commitResources
File "usr/local/zenoss/lib/python/ZODB/Connection.py", line 462, in commit
File "usr/local/zenoss/lib/python/ZODB/Connection.py", line 495, in _commit
ConflictError: database conflict error (oid 0x3b, class Products.ZenUtils.PObjectCache.PObjectCache)

Remediation:

Make sure zeoctl is started (as zenoss, "zeoctl start" followed by a few seconds pause and then "zenoss status" to confirm it has a PID and is running).
cd $ZENHOME/var
rm *.zec (this deletes invalid cache files that are causing the above error)
zopectl start
Wait a few seconds, then check if Zope stays running (use "zenoss status" or just hit the website to confirm).

/dev/arthur