Welcome to docs.opsview.com

Troubleshooting plugin output

Active checks are plugins which give textual output. This page will list some of the most common or cryptic ones, with suggestions for troubleshooting.

Basic technique

Firstly, get the actual command that is run. As an admin user, search for the host in the list pages, select the contextual menu and click the number of servicechecks. This will list all the services associated with the host and the arguments used.

From the monitoring server for the host (which could be a slave), switch to the nagios user and run the command on the command line. If the result is different, then there maybe something wrong with the way NagiosĀ® Core is invoking the command - escalate to Opsview Ltd for support.

If the result is similar and you are using check_nrpe, then access the remote host and run the plugin on the remote host.

Beware: some plugins, such as check_logs or check_sec, store state information. This means invoking another time may result in some stateful information lost between invocations.


Service check Timed Out

This is output that Nagios Core will put for an active check that has taken too long to run. If you want to increase the timeout value, you can override the service_check_timeout value in opsview.conf.

However, it is probably better to understand why the active check has taken too long to run. Active checks should execute and finish very quickly (usually under 30 seconds, mostly under 10 seconds).

If you have a check that will take a long time, such as a performance test run, then you could redesign so that the task is run independently of Nagios Core, but the plugin can check for success or failure.

Connection refused by host

This can come from check_nrpe, but also from other plugins. Usually means that the remote service is not running.

CHECK_NRPE: Socket timeout after 10 seconds

This means that NRPE on the remote host has accepted the network request but hasn't responded yet. This could be because:

  • the remote plugin has taken a long time to run
  • the remote host is so busy it cannot respond

You can tell check_nrpe to use a different timeout value (-t option), but beware that Nagios Core has a different time out value (default is 30 seconds).

It has been seen on some Windows servers running NSClient++ that the timeout is exceeded for WMI statistics collection.

NDO CRITICAL - ndo last imported 8277 seconds ago

This means that the runtime database is X seconds behind current time. This can occur if:

  • there is heavy database activity so logs are not imported fast enough, for instance, if you have a very large configuration and a reload has just occurred
  • a mysql optimise is being run, which locks tables so the importing halts
  • import_ndologsd is not running - a full filesystems for /var/log/opsview could cause the daemon to crash. Restart opsview (/etc/init.d/opsview restart) or run /usr/local/nagios/bin/import_ndologsd as the nagios user

As the mysql root user, you can run: show full processlist to see what, if anything, is hanging a table update.

import_ndologsd is a daemon that pushes the logs files from /usr/local/nagios/var/ndologs into the Runtime database, thus allowing asynchronous updates to the database.

If you are getting this error frequently and have a large system, consider moving your database onto a different server - this single change will speed up the importing time immensely.

opsviewd error: No such file or directory

This is from the check_opsview_master plugin. If you get this, then the plugin was not able to speak to opsviewd's pipe process. This means that reloads from the web interface will not work. Check if opsviewd is working. Escalate to Opsview Ltd for support.

opsviewd error: CRITICAL - Socket timeout after 10 seconds

This could occur if the opsviewd process is busy with some other requests. Escalate to Opsview for support.

Time is out of sync by X seconds

This is from the check_opsview_slave_node check where it is confirming the slave has the same time as the master.

Check that you have some time synchronisation tools running for the master and slave to be synchronised.

We have seen this problem where an ssh login takes a long time to connect. This is because the plugin runs on the master to get current time and then runs on the slave to get its time. The ssh connection took a long time because of a badly configured DNS resolution - check that /etc/resolv.conf on the master and the slave has been setup correctly. You should be able to recreate by running on the opsview master:

su - nagios
date; ssh {slave} date

Usually there is very little difference in the date values. If there is a delay due to name resolving, the date times will be different

Another possibility is that if you are using virtual machines, there can be a constraint in I/O activity. One process that can cause a high load is Opsview's housekeeping routines which are invoked from cron daily at 03:11 local time. This housekeeping includes pruning of some database tables - see the mysql tuning section for information about tuning mysql.

CHECK_NRPE: Error - Could not complete SSL handshake

This can happen if there is a daemon listening at the remote end which is not NRPE.

This also can be seen if allowed_hosts in nrpe.cfg is set to specific ip addresses and check_nrpe is run from an IP address that is not listed.

Capturing Plugin Output

There maybe times when Nagios Core returns (null) if no output was displayed. From Opsview 3.9.0, there is a plugin called check_plugin_output. You can use this as the plugin and then change the arguments to the plugin that would be executed. For instance, if you want to diagnose a service check called Interface, edit the service check and:

  • change the plugin to check_plugin_output
  • add the original plugin name to the beginning of the argument field, in this case check_snmp_linkstatus
  • reload

There will be a log file generated in /tmp/check_plugin_output. A new entry will appear for every invocation of the plugin.

Remember to revert the plugin back when you have finished troubleshooting.