Welcome to docs.opsview.com

Differences

This shows you the differences between two versions of the page.

opsview4.6:importantconcepts [2014/09/09 12:19] (current)
Line 1: Line 1:
 +====== Important Concepts ======
 +This page describes the important concepts of how Opsview, using NagiosĀ® Core, monitors your environment.
 +===== Host and Services =====
 +A service is a check of some sort. A service is the most granular thing in Nagios Core (a service can either be active or passive, but we assume active for the moment).
 +These active checks will have a status, one line of output and optionally some performance data.
 +
 +Hosts are containers for a set of services. A host is a logical grouping of all the services together. Services have to be associated to a host - they cannot exist without a host as their "parent container".
 +
 +Services are regularly checked, based upon their frequency or check interval. Each service is checked independently of other services and regardless of the state of their associated host.
 +
 +Hosts are also checked, with hosts only being checked "on demand" when a service has changed state. Note however, that it is possible to get a host to check regularly by changing its [[opsview4.6:host#check_interval|host check interval]] to non zero.
 +
 +We recommend that you always have a service check (which is very similar to the host check) because:
 +  * services are checked regularly - hosts are usually checked only on demand
 +  * Opsview's [[opsview4.6:access#selection_of_objects|access control object selection]] requires a service to be on a host group
 +  * reports are against services
 +  * you get performance graphs against services (not hosts)
 +
 +**Note**: Hosts without any services will **not** be shown in the Host Group Hierarchy status pages.
 +
 +==== States ====
 +
 +Services have one of 4 possible states:
 +  * OK - Everything fine
 +  * CRITICAL - Something is wrong
 +  * WARNING - Something maybe wrong
 +  * UNKNOWN - There is some internal error with the check such as incorrect parameters
 +
 +The last 3 states are collectively called //problem states//.
 +
 +Hosts have one of 3 possible states:
 +  * UP - Host is okay
 +  * DOWN - Host has a problem
 +  * UNREACHABLE - All the parents of this host are in a failure state. This is a calculated state based on the parent/child relationship dependency of a host
 +
 +It is possible that if a host is marked as //down// or //unreachable//, not all of its services will be in a failure state. This is because:
 +  * the service is checked only once a day. If the host fails during the day, then that service will only fail when it is attempted at it's next polling cycle
 +  * the service is a passive check and no new results will change its state
 +  * the service is "misconfigured", so it is associated with the wrong host
 +
 +
 +===== Plugins =====
 +All active checks use a plugin. This plugin will have the actual logic to know if something is working or not. Plugins know how to communicate with a DNS server, or how to interrogate for free filesystem space, or how to get a web page.
 +
 +The same plugin can be used many times for different services. It takes parameters to determine how to test something or what the threshold levels are.
 +
 +The parameters available are dependent on the plugin used.
 +
 +Examples of plugins include:
 +
 +==== check_disk ====
 +  * Check 1, many or all disks for their free space
 +  * Check for available inode space
 +  * Can filter disks by type or by regular expression of the name
 +
 +==== check_http ====
 +  * Check a URL returns
 +  * Check that the contents of the request contains a certain string
 +  * Check that the request returns within a certain amount of time
 +  * Check if the request gives the expected status code
 +
 +After a plugin has run, it must return a status code to Opsview - which maps to one of the OK, WARNING, CRITICAL or UNKNOWN statuses.
 +
 +The plugin may also return some optional performance data, which Opsview will record and can later be used in performance graphing.
 +
 +===== Check: Active versus Passive =====
 +A check can be either:
 +  * Active - run on a periodic basis to check that the thing is ok
 +  * Passive - where results arrive into the system, such as log alerts, backups started or SNMP traps that are received
 +
 +The main difference is that when the service goes into a failure state, active checks will automatically go back into an OK state when the problem is resolved, whereas passive services need to have their state manually changed back to OK.
 +===== Frequency and State Changes =====
 +An active check will have a //frequency// value which determines how often to run a check. This is the normal check interval.
 +
 +A //state change// is when a service goes from one state to another state.
 +
 +There are two //state types//: HARD and SOFT. The idea is that you can have a soft failure which means that a service is ''likely to be a problem soon''.
 +
 +There are two important parameters to determine the soft and hard state changes:
 +  * retry interval - this is usually a smaller value that the frequency
 +  * maximum check attempts - the number of times the check stays in a failure state before it becomes a hard state change
 +
 +A hard state can also occur if the host of a service is in a DOWN or UNREACHABLE state - this means that a service could show a check attempt of 1/3 and still be in a HARD state.
 +
 +**Note**: If a service transitions from a failure state to a different failure state, the check attempt will still increment. This means that a service in a CRITICAL HARD state will still be in HARD state if it goes into an UNKNOWN state. Only an OK state will reset the check attempt value.
 +
 +The main reason is to send notifications on hard state changes, so it provides a way to avoid sending a notification if there is a transient problem.
 +
 +
 +==== Example: Soft and Hard State Changes ====
 +Assuming the default Opsview configuration, checks normally run at an interval of 5 minutes, retries run every minute and 3 tries before the check enters a HARD state. So in a failure  scenario, the following sequence will occur:
 +  * time=0 minutes: Check runs and returns OK
 +  * time=5 minutes: check enters a soft critical state
 +  * time=6 minutes: the check is retried, state remains critical but soft
 +  * time=7 minutes: the check is retried, the state is now hard and notifications will be generated if appropriate
 +
 +
 +===== Notifications =====
 +In Opsview, a contact can have multiple [[opsview4.6:notificationprofile|notification profiles]]. This defines what hosts or services will send alerts using which [[opsview4.6:notificationmethods|notification methods]].
 +
 +If you want to temporarily stop notifications, set [[opsview4.6:downtime|downtime]] for the relevant objects. While you can disable notifications through Nagios Core CGI screens, we recommend downtime instead because:
 +  * views will update appropriately and consider the item as //handled//
 +  * you can list all downtimes to see what is outside of normal monitoring state
 +  * [[opsview4.6:odw|ODW]] statistics will distinguish between scheduled and unscheduled availability
 +  * an Opsview reload will remove any notification settings for objects to ensure consistency
 +
 +If you really do not want notifications for a host or service, you can filter at the host or service level which will affect it permanently.
 +
 +===== References =====
 +Further information can be found in the [[http://nagios.sourceforge.net/docs/3_0/toc.html|Nagios Core]] documentation under //The Basics//.
 +
 +
 +====== Advanced Concepts ======
 +This page describes the advanced concepts of how Opsview, using Nagios Core, monitors your environment.
 +
 +===== Flap Detection =====
 +//Flapping// is a special condition which a host or service could be in if it changes states "too often". During this flapping period, all notifications will cease, thus avoiding notification storms.
 +
 +Nagios Core will calculated the //state change rate// for a host or a service by storing the last 21 states and calculating the amount of change between those states.
 +
 +If the //state change rate// is above a high water mark, then the host or service is considered to be //in a flapping state// (or flapping start). The //state change rate// needs to drop below the low water mark before it is considered to be //out of flapping// (or flapping stop).
 +
 +This water mark is hard coded in Opsview's configuration generation script:
 +  * low_service_flap_threshold=5.0
 +  * high_service_flap_threshold=20.0
 +  * low_host_flap_threshold=5.0
 +  * high_service_flap_threshold=20.0
 +
 +The can be overriden in [[opsview4.6:configuration_files#overrides|the overrides section of opsview.conf]].
 +
 +Further information about flap detection can be found in the [[http://nagios.sourceforge.net/docs/3_0/flapping.html|Nagios Core]] documentation about Flapping.
Navigation
Print/export
Toolbox