Welcome to docs.opsview.com

Troubleshooting Stale Results

This occurs when you have a distributed environment and the slaves are not sending results back to the master.

Results are not considered stale unless older than 30 mins beyond the check interval.

Checking the slave node via a service

Slave nodes services are generated automatically on the master server. This uses the check_opsview_slave_node plugin.

Check slave is sending results back: NRD

From Opsview 3.11.0, NRD is the default mechanism for transferring data from a slave to the master.

Logon to slave via command line as nagios user and run /usr/local/nagios/bin/retrieve_opsview_info. This will return output like:

OK
slaveresults_backlog=5
slaveresults_maxage=34
nsca=2
now=1295360023
status=1295360023
slaveresults_error=Cannot connect

This means there are 5 files back logged (in /usr/local/nagios/var/slaveresults), of which the oldest is 34 seconds ago. The slaveresults_error shows what the error message was.

Check time on the slave versus the master

A bug has been seen where if the time on the slave is ahead of the master (regardless of “actual” time), then NRD writes the results but Nagios® Core drops them because the timestamp of the result from the slave is “in the future”. Opsview 3.11.2 will include a fix for this, where if the slave timestamp is ahead, it is passed to Nagios Core as current time on the master.

Check slave can send a result back to Master

On the slave as the nagios user, run:

printf "host1\t0\tdummy result" | /usr/local/nagios/bin/send_nrd -c /usr/local/nagios/etc/send_nrd.cfg

This sends a dummy result back to the master server and proves connectivity. You should get no output. Check on the master in /usr/local/nagios/var/nagios.log that this has been received:

[1295360563] Warning: Check result queue contained results for host 'host1', but the host could not be found!  Perhaps you forgot to define the host in your config files?

Check ssh tunnels

The master speaks to the slaves via the SSH tunnels. Check that a process exists:

$ ps -ef | grep ssh
ssh -n -N -T -2 -o TCPKeepAlive=yes -o ServerAliveCountMax=3 -o ServerAliveInterval=10 -R 5669:localhost:5669 -R 5667:localhost:5667 -R 2345:localhost:2345 nagios@slave

It is possible that the ssh process is running, but the tunnel does not work. Kill the process for the ssh tunnel and a new one will be initiated automatically by the opsviewd daemon.

Note: If you have slave initiated / reverse SSH connections, you will need to run the ps command on the slave. To restart the reverse SSH connection, run /etc/init.d/opsview-slave restart.

Check slave is sending results back: NSCA

Logon to slave via command line and check the contents of /usr/local/nagios/var/cache.log. This should show the number of messages that is sent via send_nsca and the success of it:

Last run time: Tue Jan  8 10:29:41 2008
cache_service.log: collected 9 items
9 data packet(s) sent to host successfully.

Check slave can send a result back to Master

On the slave as the nagios user, run:

echo "" | /usr/local/nagios/bin/send_nsca -H localhost -c /usr/local/nagios/etc/send_nsca.cfg

This sends a dummy packet back to the master server and proves connectivity. You should get the result:

0 data packet(s) sent to host successfully.

If not, then there could be a problem with the SSH tunnels.

You can also run:

printf "temphost\ttempservice\t0\tFake result from NSCA\n" | /usr/local/nagios/bin/send_nsca -H localhost -c /usr/local/nagios/etc/send_nsca.cfg

This will send a packet to the master server with for a host of temphost and a service called tempservice. This should return with:

1 data packet(s) sent to host successfully.

To confirm this reached Nagios Core, check /usr/local/nagios/var/nagios.log for the entry:

[1281521419] Warning:  Passive check result was received for service 'tempservice' on host 'temphost', but the host could not be found!

This means that the result has got from the slave to the master and has been processed by Nagios Core.

You can also test as if Nagios Core is pushing the result to the master:

printf "temphost\ttempservice\t0\tFake result from commandline\n" >> /usr/local/nagios/var/cache_service.log
/usr/local/nagios/bin/process-cache-data cache_service.log

This should give output like:

cache_service.log: collected 2 items
1 data packet(s) sent to host successfully.

This means that the mechanism for pushing results back to Opsview master works as Nagios Core expects.

Check ssh tunnels

The master speaks to the slaves via the SSH tunnels. Check that a process exists:

$ ps -ef | grep ssh
ssh -n -N -T -2 -o TCPKeepAlive=yes -o ServerAliveCountMax=3 -o ServerAliveInterval=10 -R 5667:localhost:5667 -R 2345:localhost:2345 nagios@slave

It is possible that the ssh process is running, but the tunnel does not work. Kill the process for the ssh tunnel and a new one will be initiated automatically by the opsviewd daemon.

Note: If you have slave initiated / reverse SSH connections, you will need to run the ps command on the slave. To restart the reverse SSH connection, run /etc/init.d/opsview-slave restart.

Check NSCA on master

If the master is receiving results from other slaves, then NSCA must be working correctly on the master. If you only have 1 slave, find the nsca process and run strace -p {pid}. This should show whether data is being received.

You should see some output like:

write(6, "[1199781970] PROCESS_SERVICE_CHE"..., 114) = 114

Run “lsof -p {pid}” to see what that file descriptor is.

If this says /dev/null, this has been fixed and will be in the 2.10.2 release.

Restart nsca.

If only a few services go into an UNKNOWN state

This could be due to a bug in Nagios Core. Opsview sets the max_concurrent_checks value to 50. However, if this is reached, then services that are scheduled to execute will get pushed back.

This has been fixed in the core Nagios Core code by Opsera and is available from Opsview 3.2.0 or above.

The workaround is to set in opsview.conf the max_concurrent_checks to a higher value, or set it to 0 which ignores the upper limit.

Restart opsview on master

Sometimes you may need to restart opsview. This will restart nagios, nsca and nrd processes and re-initialise the ssh tunnels:

/etc/init.d/opsview stop
/etc/init.d/opsview start

Troubleshooting

Error processing data [Cannot connect [Connection refused] at ... ] - retry in 5 seconds

This message could appear in /usr/local/nagios/var/log/opsview-slave.log. This means that the tunnels have not been setup to send results from the slave to the master.

If in a reverse SSH tunnel setup, the slave may not have the tunnels setup correctly or is not running. Check for an SSH process on the slave like:

/usr/bin/ssh -N -T -2 -o TCPKeepAlive=yes -o ServerAliveCountMax=3 -o ServerAliveInterval=10 -R 25801:127.0.0.1:22 -L 5669:127.0.0.1:5669 -L 4125:127.0.0.1:4125 -L 5667:127.0.0.1:5667 -L 2345:127.0.0.1:2345 nagios@192.168.101.32

Couldn't unserialize a request

If you get an error in /var/log/opsview/opsviewd.log like:

[2011/02/01 10:16:55] [nrd] [ERROR] Couldn't unserialize a request: malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "\x{2d244ba3}\x{d244}...") at /usr/local/nagios/bin/../perl/lib/NRD/Serialize/plain.pm line 28, <GEN12> line 37.

The important part is the plain.pm. The default value is to use crypt, so if plain.pm is picked up, then the configuration file /usr/local/nagios/etc/nrd.conf is either missing or incorrect.

Recreate this file by reloading Opsview and then restart the nrd daemon for it to take effect:

/etc/init.d/opsview restart
Navigation
Print/export
Toolbox