Welcome to docs.opsview.com

Differences

This shows you the differences between two versions of the page.

opsview4.6:staleresults [2014/09/09 12:19] (current)
Line 1: Line 1:
 +====== Troubleshooting Stale Results ======
 +This occurs when you have a distributed environment and the slaves are not sending results back to the master.
 +Results are not considered stale unless older than 30 mins beyond the check interval.
 +
 +===== Checking the slave node via a service =====
 +Slave nodes services are generated automatically on the master server. This uses the [[opsview4.6:slavesetup#monitoring_slaves|check_opsview_slave_node]] plugin.
 +
 +===== Check slave is sending results back: NRD =====
 +From Opsview 3.11.0, NRD is the default mechanism for transferring data from a slave to the master.
 +
 +Logon to slave via command line as nagios user and run ''/usr/local/nagios/bin/retrieve_opsview_info''. This will return output like:
 +<code>
 +OK
 +slaveresults_backlog=5
 +slaveresults_maxage=34
 +nsca=2
 +now=1295360023
 +status=1295360023
 +slaveresults_error=Cannot connect
 +</code>
 +
 +This means there are 5 files back logged (in ''/usr/local/nagios/var/slaveresults''), of which the oldest is 34 seconds ago. The //slaveresults_error// shows what the error message was.
 +
 +==== Check time on the slave versus the master ====
 +A bug has been seen where if the time on the slave is ahead of the master (regardless of "actual" time), then NRD writes the results but NagiosĀ® Core drops them because the timestamp of the result from the slave is "in the future". Opsview 3.11.2 will include a fix for this, where if the slave timestamp is ahead, it is passed to Nagios Core as current time on the master.
 +
 +==== Check slave can send a result back to Master ====
 +On the slave as the nagios user, run:
 +<code>
 +printf "host1\t0\tdummy result" | /usr/local/nagios/bin/send_nrd -c /usr/local/nagios/etc/send_nrd.cfg
 +</code>
 +
 +This sends a dummy result back to the master server and proves connectivity. You should get no output. Check on the master in ''/usr/local/nagios/var/nagios.log'' that this has been received:
 +<code>
 +[1295360563] Warning: Check result queue contained results for host 'host1', but the host could not be found!  Perhaps you forgot to define the host in your config files?
 +</code>
 +
 +
 +==== Check ssh tunnels ====
 +The master speaks to the slaves via the SSH tunnels. Check that a process exists:
 +<code>
 +$ ps -ef | grep ssh
 +ssh -n -N -T -2 -o TCPKeepAlive=yes -o ServerAliveCountMax=3 -o ServerAliveInterval=10 -R 5669:localhost:5669 -R 5667:localhost:5667 -R 2345:localhost:2345 nagios@slave
 +</code>
 +
 +It is possible that the ssh process is running, but the tunnel does not work. Kill the process for the ssh tunnel and a new one will be initiated automatically by the opsviewd daemon.
 +
 +**Note**: If you have slave initiated / reverse SSH connections, you will need to run the ps command on the slave. To restart the reverse SSH connection, run ''/etc/init.d/opsview-slave restart''.
 +
 +
 +
 +
 +===== Check slave is sending results back: NSCA =====
 +Logon to slave via command line and check the contents of /usr/local/nagios/var/cache.log. This should show the number of messages that is sent via send_nsca and the success of it:
 +<code>
 +Last run time: Tue Jan  8 10:29:41 2008
 +cache_service.log: collected 9 items
 +9 data packet(s) sent to host successfully.
 +</code>
 +
 +==== Check slave can send a result back to Master ====
 +On the slave as the nagios user, run:
 +<code>
 +echo "" | /usr/local/nagios/bin/send_nsca -H localhost -c /usr/local/nagios/etc/send_nsca.cfg
 +</code>
 +
 +This sends a dummy packet back to the master server and proves connectivity. You should get the result:
 +<code>
 +0 data packet(s) sent to host successfully.
 +</code>
 +
 +If not, then there could be a problem with the SSH tunnels.
 +
 +You can also run:
 +<code>
 +printf "temphost\ttempservice\t0\tFake result from NSCA\n" | /usr/local/nagios/bin/send_nsca -H localhost -c /usr/local/nagios/etc/send_nsca.cfg
 +</code>
 +
 +This will send a packet to the master server with for a host of ''temphost'' and a service called ''tempservice''. This should return with:
 +<code>
 +1 data packet(s) sent to host successfully.
 +</code>
 +
 +To confirm this reached Nagios Core, check ''/usr/local/nagios/var/nagios.log'' for the entry:
 +<code>
 +[1281521419] Warning:  Passive check result was received for service 'tempservice' on host 'temphost', but the host could not be found!
 +</code>
 +
 +This means that the result has got from the slave to the master and has been processed by Nagios Core.
 +
 +You can also test as if Nagios Core is pushing the result to the master:
 +<code>
 +printf "temphost\ttempservice\t0\tFake result from commandline\n" >> /usr/local/nagios/var/cache_service.log
 +/usr/local/nagios/bin/process-cache-data cache_service.log
 +</code>
 +
 +This should give output like:
 +<code>
 +cache_service.log: collected 2 items
 +1 data packet(s) sent to host successfully.
 +</code>
 +
 +This means that the mechanism for pushing results back to Opsview master works as Nagios Core expects.
 +
 +==== Check ssh tunnels ====
 +The master speaks to the slaves via the SSH tunnels. Check that a process exists:
 +<code>
 +$ ps -ef | grep ssh
 +ssh -n -N -T -2 -o TCPKeepAlive=yes -o ServerAliveCountMax=3 -o ServerAliveInterval=10 -R 5667:localhost:5667 -R 2345:localhost:2345 nagios@slave
 +</code>
 +
 +It is possible that the ssh process is running, but the tunnel does not work. Kill the process for the ssh tunnel and a new one will be initiated automatically by the opsviewd daemon.
 +
 +**Note**: If you have slave initiated / reverse SSH connections, you will need to run the ps command on the slave. To restart the reverse SSH connection, run ''/etc/init.d/opsview-slave restart''.
 +
 +==== Check NSCA on master ====
 +If the master is receiving results from other slaves, then NSCA must be working correctly on the master. If you only have 1 slave, find the nsca process and run strace -p {pid}. This should show whether data is being received.
 +
 +You should see some output like:
 +<code>
 +write(6, "[1199781970] PROCESS_SERVICE_CHE"..., 114) = 114
 +</code>
 +
 +Run "lsof -p {pid}" to see what that file descriptor is.
 +
 +If this says /dev/null, this has been fixed and will be in the 2.10.2 release.
 +
 +Restart nsca.
 +
 +===== If only a few services go into an UNKNOWN state =====
 +This could be due to a bug in Nagios Core. Opsview sets the max_concurrent_checks value to 50. However, if this is reached, then services that are scheduled to execute will get pushed back.
 +
 +This has been fixed in the core Nagios Core code by Opsera and is available from Opsview 3.2.0 or above.
 +
 +The workaround is to set in ''opsview.conf'' the max_concurrent_checks to a higher value, or set it to 0 which ignores the upper limit.
 +
 +===== Restart opsview on master =====
 +Sometimes you may need to restart opsview. This will restart nagios, nsca and nrd processes and re-initialise the ssh tunnels:
 +<code>
 +/etc/init.d/opsview stop
 +/etc/init.d/opsview start
 +</code>
 +===== Troubleshooting =====
 +==== Error processing data [Cannot connect [Connection refused] at ... ] - retry in 5 seconds ====
 +This message could appear in ''/usr/local/nagios/var/log/opsview-slave.log''. This means that the tunnels have not been setup to send results from the slave to the master.
 +
 +If in a reverse SSH tunnel setup, the slave may not have the tunnels setup correctly or is not running. Check for an SSH process on the slave like:
 +<code>
 +/usr/bin/ssh -N -T -2 -o TCPKeepAlive=yes -o ServerAliveCountMax=3 -o ServerAliveInterval=10 -R 25801:127.0.0.1:22 -L 5669:127.0.0.1:5669 -L 4125:127.0.0.1:4125 -L 5667:127.0.0.1:5667 -L 2345:127.0.0.1:2345 nagios@192.168.101.32
 +</code>
 +==== Couldn't unserialize a request ====
 +If you get an error in ''/var/log/opsview/opsviewd.log'' like:
 +<code>
 +[2011/02/01 10:16:55] [nrd] [ERROR] Couldn't unserialize a request: malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "\x{2d244ba3}\x{d244}...") at /usr/local/nagios/bin/../perl/lib/NRD/Serialize/plain.pm line 28, <GEN12> line 37.
 +</code>
 +
 +The important part is the plain.pm. The default value is to use crypt, so if plain.pm is picked up, then the configuration file ''/usr/local/nagios/etc/nrd.conf'' is either missing or incorrect.
 +
 +Recreate this file by reloading Opsview and then restart the nrd daemon for it to take effect:
 +<code>
 +/etc/init.d/opsview restart
 +</code>
Navigation
Print/export
Toolbox