Welcome to docs.opsview.com

Important Concepts

This page describes the important concepts of how Opsview, using Nagios® Core, monitors your environment.

Host and Services

A service is a check of some sort. A service is the most granular thing in Nagios Core (a service can either be active or passive, but we assume active for the moment).

These active checks will have a status, one line of output and optionally some performance data.

Hosts are containers for a set of services. A host is a logical grouping of all the services together. Services have to be associated to a host - they cannot exist without a host as their “parent container”.

Services are regularly checked, based upon their frequency or check interval. Each service is checked independently of other services and regardless of the state of their associated host.

Hosts are also checked, with hosts only being checked “on demand” when a service has changed state. Note however, that it is possible to get a host to check regularly by changing its host check interval to non zero.

We recommend that you always have a service check (which is very similar to the host check) because:

  • services are checked regularly - hosts are usually checked only on demand
  • Opsview's access control object selection requires a service to be on a host group
  • reports are against services
  • you get performance graphs against services (not hosts)

Note: Hosts without any services will not be shown in the Host Group Hierarchy status pages.

States

Services have one of 4 possible states:

  • OK - Everything fine
  • CRITICAL - Something is wrong
  • WARNING - Something maybe wrong
  • UNKNOWN - There is some internal error with the check such as incorrect parameters

The last 3 states are collectively called problem states.

Hosts have one of 3 possible states:

  • UP - Host is okay
  • DOWN - Host has a problem
  • UNREACHABLE - All the parents of this host are in a failure state. This is a calculated state based on the parent/child relationship dependency of a host

It is possible that if a host is marked as down or unreachable, not all of its services will be in a failure state. This is because:

  • the service is checked only once a day. If the host fails during the day, then that service will only fail when it is attempted at it's next polling cycle
  • the service is a passive check and no new results will change its state
  • the service is “misconfigured”, so it is associated with the wrong host

Plugins

All active checks use a plugin. This plugin will have the actual logic to know if something is working or not. Plugins know how to communicate with a DNS server, or how to interrogate for free filesystem space, or how to get a web page.

The same plugin can be used many times for different services. It takes parameters to determine how to test something or what the threshold levels are.

The parameters available are dependent on the plugin used.

Examples of plugins include:

check_disk

  • Check 1, many or all disks for their free space
  • Check for available inode space
  • Can filter disks by type or by regular expression of the name

check_http

  • Check a URL returns
  • Check that the contents of the request contains a certain string
  • Check that the request returns within a certain amount of time
  • Check if the request gives the expected status code

After a plugin has run, it must return a status code to Opsview - which maps to one of the OK, WARNING, CRITICAL or UNKNOWN statuses.

The plugin may also return some optional performance data, which Opsview will record and can later be used in performance graphing.

Check: Active versus Passive

A check can be either:

  • Active - run on a periodic basis to check that the thing is ok
  • Passive - where results arrive into the system, such as log alerts, backups started or SNMP traps that are received

The main difference is that when the service goes into a failure state, active checks will automatically go back into an OK state when the problem is resolved, whereas passive services need to have their state manually changed back to OK.

Frequency and State Changes

An active check will have a frequency value which determines how often to run a check. This is the normal check interval.

A state change is when a service goes from one state to another state.

There are two state types: HARD and SOFT. The idea is that you can have a soft failure which means that a service is likely to be a problem soon.

There are two important parameters to determine the soft and hard state changes:

  • retry interval - this is usually a smaller value that the frequency
  • maximum check attempts - the number of times the check stays in a failure state before it becomes a hard state change

A hard state can also occur if the host of a service is in a DOWN or UNREACHABLE state - this means that a service could show a check attempt of 1/3 and still be in a HARD state.

Note: If a service transitions from a failure state to a different failure state, the check attempt will still increment. This means that a service in a CRITICAL HARD state will still be in HARD state if it goes into an UNKNOWN state. Only an OK state will reset the check attempt value.

The main reason is to send notifications on hard state changes, so it provides a way to avoid sending a notification if there is a transient problem.

Example: Soft and Hard State Changes

Assuming the default Opsview configuration, checks normally run at an interval of 5 minutes, retries run every minute and 3 tries before the check enters a HARD state. So in a failure scenario, the following sequence will occur:

  • time=0 minutes: Check runs and returns OK
  • time=5 minutes: check enters a soft critical state
  • time=6 minutes: the check is retried, state remains critical but soft
  • time=7 minutes: the check is retried, the state is now hard and notifications will be generated if appropriate

Notifications

In Opsview, a contact can have multiple notification profiles. This defines what hosts or services will send alerts using which notification methods.

If you want to temporarily stop notifications, set downtime for the relevant objects. While you can disable notifications through Nagios Core CGI screens, we recommend downtime instead because:

  • views will update appropriately and consider the item as handled
  • you can list all downtimes to see what is outside of normal monitoring state
  • ODW statistics will distinguish between scheduled and unscheduled availability
  • an Opsview reload will remove any notification settings for objects to ensure consistency

If you really do not want notifications for a host or service, you can filter at the host or service level which will affect it permanently.

References

Further information can be found in the Nagios Core documentation under The Basics.

Advanced Concepts

This page describes the advanced concepts of how Opsview, using Nagios Core, monitors your environment.

Flap Detection

Flapping is a special condition which a host or service could be in if it changes states “too often”. During this flapping period, all notifications will cease, thus avoiding notification storms.

Nagios Core will calculated the state change rate for a host or a service by storing the last 21 states and calculating the amount of change between those states.

If the state change rate is above a high water mark, then the host or service is considered to be in a flapping state (or flapping start). The state change rate needs to drop below the low water mark before it is considered to be out of flapping (or flapping stop).

This water mark is hard coded in Opsview's configuration generation script:

  • low_service_flap_threshold=5.0
  • high_service_flap_threshold=20.0
  • low_host_flap_threshold=5.0
  • high_service_flap_threshold=20.0

The can be overriden in the overrides section of opsview.conf.

Further information about flap detection can be found in the Nagios Core documentation about Flapping.

Navigation
Print/export
Toolbox