Welcome to docs.opsview.com


This shows you the differences between two versions of the page.

opsview4.6:slavesynchronisation [2014/09/09 12:19] (current)
Line 1: Line 1:
 +====== Slave State Synchronisation ======
 +A major feature of Opsview is the simple way in which you can setup distributed monitoring. This works by having separate NagiosĀ® Core instances on slave systems.
 +It is possible for the slaves to have synchronised status of hosts and services sent from the master or from other cluster nodes.
 +This affects users that have a distributed environment.
 +===== When Synchronised Status Matters =====
 +Each Nagios Core instance has retention information which describes the last host or service's state, output, acknowledgements, comments and downtimes.
 +==== Opsview Reload Time ====
 +If a host is moved from being monitored by the master to being monitored by a slave, or if a host is move from a slave to a different slave, the new slave does not know the last status of that host and its services. This means that if you have notifications configured to be sent from a slave, then your could get notifications for states that the Opsview master already knows about.
 +Alternatively, if a host is assigned to a slave system with multiple nodes, Opsview will decide which node is actively monitoring this host and its services. This could be a different cluster node to the last time.
 +==== Cluster Node Take Over Time ====
 +When a cluster node takes over a different cluster node, then the new node will not know the current state and could send out notifications if it considers state changes to have occurred.
 +==== Cluster Node Recovery ====
 +When a cluster node recovers, it will start monitoring its own hosts and services as usual. However, it may have older state information, as the state of the hosts and services may have changed.
 +The Opsview master's //Slave-node: {name}// check will notice that a slave node has recovered and will send the latest state information to that slave node for synchronisation.
 +===== How Opsview Synchronises States =====
 +At Opsview reload time, Opsview takes the current state information from the master Nagios Core instance (from the status.dat file) and constructs a //sync.dat// file for each slave system. This file is sent to each slave and is loaded when Nagios Core reloads. As the master knows about all states, acknowledgements and downtimes, the slave will also have the latest information before it starts doing its monitoring.
 +Additionally, every 15 minutes from a cronjob by the nagios user, each slave node will create a //sync.dat.node.{nodename}// file to send to all other nodes in the slave cluster. When a take over occurs, this state information is read into Nagios Core before the hosts are set to be actively monitored.
 +In all cases, the slave will use the information in the synchronisation file. However, if the //last check// time on the object is newer than the data from the master, then the slave will ignore this state information. Downtimes and comments will be added assuming that no similar downtime or comment already exists.
 +===== Limitations =====
 +It is critical that time is synchronised between the Opsview master and its slaves, as the state information is only processed if it is newer than the current information on the slave for each host/service.
 +There are windows where the state information may not be completely up to date:
 +  * if a state change occurs, or an acknowledgement or downtime is set between Opsview reload time and the slave starting to monitor
 +  * the state information from the failing slave could be up to 15 minutes stale at take over time
 +  * the state information sent to the slave from the master when the master notices a slave has recovered as there is a latency to the master noticing a recovered slave. If the slave checks a host or service before the master sends the latest information to the slave, then the slave will have more recent information and ignore the sync request for this object