Welcome to docs.opsview.com

Business Service Monitoring

Business Service Monitoring, or BSM, is a feature of Opsview to group related service checks together for a high level overview of your systems.

A business service is an important part of your business, such as your public website. This will consist of multiple components, which are groups of hosts with similar service checks. Opsview will calculate the state of components, taking into account redundancy and failover, so that you can get up to date information of each layer. This is then aggregated to the business service level to give an overall health for business and technical owners.

Access Control

BSM requires the Business Service Monitoring feature.

Users require the BSM access control. Without this access, the BSM menu options will be hidden and the dashlets will not be visible to add.

Business Services

You can control which Business Services are available for a role from a “top down” or from a “bottom up” approach.

Top Down

Choose the specific business service in the Authorised for Business Services field. If you select View All, then all Business Services will be available including any new Business Services created in future. Access to the business service will allow visibility of all components of the business service.

Bottom Up

Components will be automatically selected based on your existing status object permissions (based on host group / service group intersection or keywords). See below in the Components section for more information.

If you select the “Grant permissions to Business Services” tickbox, then the business services associated with all your components will be visible.

Editing Permission

Editing permission is only available from the top down approach.

You need to have CONFIGUREBSM access to configure Business Services. You can then select which specific business services can be edited by the role. In order to edit a business service, all its existing components will be listed. However, the available list of components to add to the business service is determined by the view ability of the components.

Components

Permission to components are automatically granted based on the existing status object definitions. If VIEWALL is specified, then all components are visible. Otherwise, it is based on the host group/service group intersection and keywords. You need to see all hosts for the component to be visible.

Note that if the component consists of 20 service checks, then the user will need to have permission to all 20 service checks on that component for permission to the component to be visible.

If CONFIGUREBSMCOMPONENTS is given to a user, then the user will be able to edit and create new components.

Configuration

Components

This screen requires CONFIGUREBSMCOMPONENTS access.

Choose the component from the left hand side. You can edit the name. You can select the host template - the drop down list will only show host templates in use.

The hosts list will populate based on the hosts using the selected host template and your access permissions. You can select multiple hosts.

The Hosts Count will be automatically populated and you can drag the sliders for the Allowed to fail and Required Online fields.

Note that when you change the operational zone percent, it can take a bit of time before the state is recalculated for display in any dashlets.

We recommend you clone any host templates delivered with Opsview, rather than updating existing host templates as future Opsview upgrades may change existing base host templates.

Business Services

This screen requires CONFIGUREBSM access.

Clicking on a business services allows you to edit the name and the list of components. Drag any new components from the Components Drawer to the “Components selected for BSM” grid. The order is retained and is used as the initial order of components in some views.

The operational zone percent and number of hosts is filled in if the user has permissions to see this information, otherwise it will say “Unavailable”.

The Components Drawer will only show components that you have permissions to see. The Components Selected may include components that you do not have permission for - if these are deleted from the business service you cannot re-add them back.

Status Logic

Component Hosts

Note that for the purposes of BSM, the host only consists of the service checks related to the host template used by the component - the host state is not taken into account.

The host can be one of three calculated states:

  • Operational - if no service checks are in a CRITICAL state
  • Failed - at least 1 service check is in a CRITICAL state. Service checks in downtime are ignored
  • Downtime - all CRITICAL service checks are in a downtime state

Additionally, there is one calculated flag:

  • Acknowledged - if all the CRITICAL service checks are acknowledged, then the component host is acknowledged

Note: The soft or hard state of the service check is not considered - the latest state is always used.

Note: If you set downtime and there are no failed services, then an operational state is used. This is to cover scenarios where downtime of 2 hours is recorded, but only 15 minutes is used. This allows the host to be marked as downtime only during the time there were actual failures.

Note: It is possible that for a host, the service checks are UNKNOWN yet the host is DOWN. From a BSM perspective, the host is considered to be OPERATIONAL because there are no CRITICAL service checks. This would be an error in configuration as the service check should be CRITICAL to show a severe error.

Components

This is calculated from the component host states and can be one of three calculated states:

  • Operational - if no hosts are failed or there are enough hosts to satisfy the operational level, then the component is operational
  • Downtime - means all hosts are in a downtime state
  • Failed - otherwise component is failed

Additionally, there are two calculated flags:

  • Acknowledged - if all the failed hosts are acknowledged, then the component is acknowledged
  • Impacted - if the state is operational and there is at least 1 host failed, then the component is impacted

The operational zone percent is calculated as hosts_required_online / hosts_total * 100. If there are not enough operational hosts, then the component is failed. Hosts in downtime are not counted, but have the effect of making failed hosts more important.

Note: Due to the operational zone percent, it is possible that a component is in an operational state with failed hosts. If those failed hosts are acknowledged, then the component will also be acknowledged, so you could have an acknowledged icon on an operational component.

Business Services

This can be one of three calculated states:

  • Offline - means at least 1 component has failed
  • Downtime - means at least 1 component is in a downtime state
  • Operational - otherwise, it means everything is working fine

Additionally, there are two calculated flags:

  • Acknowledged - if all the impacted and failed components are acknowledged then the business service is acknowledged
  • Impacted - the business service is impacted if any component is impacted. This means it is possible to have a business service in an OFFLINE state and be impacted

Views

Availability is calculated as the % of time that the host, component or business service is in an OPERATIONAL state over the last 24 hours. The initial state is considered to be OPERATIONAL, so availability will start at 100% but will be recalculated as soon as the object is created.

BSM Summary

This displays a summarised state of all business services. It is ordered by status so problems float to the top:

The business services will resize to fit the available space in the dashlet. Clicking on a grid header will sort by that field. You can hide the grid by clicking on the control.

The business service will show an indicator in the top right corner if it is acknowledged.

The business service will show an indicator in the bottom left corner if it is impacted.

You can click on any BSM object to get a contextual menu:

You can then go into an Investigate mode to see details of any object:

You can add the investigate view as a dashlet by clicking the Pin button in the top right.

BSM Service

Displays the business service and all its components and the state of the hosts.

The outline around the hosts is the operational zone, which is a visual indicator to denote how many hosts are required to consider the component as operational. The state of the component can be summarised as “the worst state of the hosts in the operational zone”.

Note that the order of hosts in the component layer is OPERATIONAL, followed by FAILED, followed by DOWNTIME. This is so any failed hosts that may occur during a downtime are considered to be more important and get pushed into the operational zone and thus possibly affect the component status.

If a host is not visible based on your access permissions, you will see a grey box with a text messge of “You do not have access to the hosts”.

BSM Component

Displays the component and the related business services and the state of the hosts.

This dashlet also shows all the related business services.

Note that the order of the hosts is FAILED, followed by DOWNTIME, followed by OPERATIONAL. This is so the most important host states are listed first.

Notifications

Notifications are configured by using Opsview's personal notification profiles and shared notification profiles.

Note that BSM notifications are based on a per profile level. This means that if you have multiple profiles with overlapping BSM objects, you will get a notification from each profile. This works differently from Nagios Core (R) as notifications are collapsed to only one per host/service.

Triggers

Notification profiles have various triggers:

  • offline - if a business service changes to offline. Only applicable to business services
  • failed - if a component changes to failed. Only applicable to components
  • impacted - if a business service or component changes from being not impacted to being impacted
  • availability - if a business service or component drops below the percent threshold

Availability is calculated every minute and is the percentage of time that the business service or component was operational over the last 24 hours. This applies to availability breaches and recoveries.

You can also specify for notifications on recovery, which will trigger when a business service or component returns to operational from one of the other states or the availability goes back above the threshold. If you have a combination of state change and availability triggers, then you will get a recovery notification when the state recovers, followed by a new notification if availability is below the threshold.

When a new business service or a component is created, it is assumed to be operational with 100% availability.

If Opsview is stopped, the information about re-notifications of all the business services and components will be saved and restored when Opsview is restarted. The state of the business service or component is assumed to be the same during an Opsview offline period.

Generally, problem notifications will be prioritised over availability notifications so a problem recovery will be required before availability breaches are sent. If you require availability notifications to be always sent, we recommend you create one notification profile for problem states and another one for availability breaches.

Re-notifications

For all notifications that are sent, if a renotification period is specified, then Opsview will do a notification check after that renotification period has elapsed. If the business service is still in a state that requires notification - due to being in the same state or is still in breach of the availability - a subsequent notification will be sent. The notification number will be incremented, to distinguish from an initial notification.

If “alert from” is set to greater than 1, then only notifications starting from this number will be sent. For instance, if re-notification interval is set to 30 minutes and alert from is set to 3, then notifications will be sent after 1 hour of the initial notification. Note that recovery notifications will respect this value.

If “stop alert after” is set to greater than 0, then there will not be any more renotifications after this amount of alerts have been sent. However, the notification number will continue to increment and recovery notifications will still get sent regardless of the “stop alert after” value.

Notifications are only sent within the specified time period. Note that if the notification is blocked due to the timeperiod, the notification number will continue to increment.

The notification number will get reset to 1 whenever there is a change of status for an object - this is because it is considered to be a different kind of issue to investigate. Note, this differs from Nagios which keeps incrementing the notification number until a recovery.

Notification Methods

All notification methods are run on the master.

When Opsview sends a notification, environment variables will be set so that the notification method can customise the output.

Variables that will be set are:

  • OPSVIEW_OBJECTTYPE - One of “BSMService” or “BSMComponent”
  • OPSVIEW_NAME - Name of the business service or component
  • OPSVIEW_ID - ID of the business service or component
  • OPSVIEW_NOTES - A text output of the notes of this object. Any HTML will be stripped out leaving just the text. If the Include Notes Information is unselected, notes will be set to the empty string
  • OPSVIEW_TIMET - The time of the notification in epoch seconds
  • OPSVIEW_SHORTDATETIME - The time of the notification in strict ISO8601 format in the current timezone, eg: 2014-01-20T07:10:23
  • OPSVIEW_LONGDATETIME - The time of the notification in the form: “day_abbr
  • OPSVIEW_NOTIFICATIONTYPE - One of “PROBLEM”, “RECOVERY”, “AVAILABILITY_BREACH”, “AVAILABILITY_RECOVERY”
  • OPSVIEW_AVAILABILITY - Value of the availiability for the object over the last 24 hours. Percent sign not included
  • OPSVIEW_AVAILABILITY_THRESHOLD - Value of the availability threshold
  • OPSVIEW_STATE - State of the object. One of “OFFLINE”, “FAILED”, “IMPACTED”, “DOWNTIME”, “OPERATIONAL”
  • OPSVIEW_STATETEXT - Text of the state of the object. You should not code against this as it could be internationalised in future
  • OPSVIEW_NOTIFICATIONNUMBER - The number of the notification, starting with 1
  • OPSVIEW_OUTPUT - Holds a string like this format to use directly in notification scripts: “NOTIFICATIONTYPE: SUMMARY: OBJECTTYPE NAME is STATETEXT, AVAILABILITY% availability”, where SUMMARY is a summary of the object such as “All components operational” or “2 of 3 components failed”
  • OPSVIEW_DETAIL - Multi line output of the state of sub objects. See example below

There are also contact specific variables that will get set:

  • OPSVIEW_CONTACTNAME - notification profile owner's username
  • OPSVIEW_CONTACTEMAIL - notification profile owner's email. Could be empty
  • OPSVIEW_CONTACTPAGER - notification profile owner's pager value. Could be empty
  • OPSVIEW_CONTACT_{NAME}=VALUE, where name/value pairs are on the contact's configuration page. This requires the notification method to state the required contact variables

And notification method specific variables:

  • OPSVIEW_NM_{NAME}=VALUE, where name/value pairs are on the notification method's configuration page

If you have any custom notification scripts, you can check for the existance of the OPSVIEW_NAME environment variable and then switch to using the Opsview information.

If you have existing Nagios Core compatible notification scripts and do not want to change them, you can use the wrapper script “notify_via_nagios” which will convert the variables above into the traditional Nagios environment variables. Set the notification method's command as: “notify_via_nagios notify_by_pigeons” - this will allow the notification script to work as executed from Nagios or from Opsview. However, not all variables will be available and output may not be entirely appropriate. See the conversion script, notify_via_nagios, for details.

Architecture

opsviewhd is the main process that listens for events to come in from the import_ndologsd daemon to then calculate the status of all the component hosts, components and business services. It is responsible for adding status and history information and invoking any notifications.

The REST API is responsible for checking access control and returning the appropriate data for status displays.

Navigation
Print/export
Toolbox