Welcome to docs.opsview.com

Designing Your System

Scalability Considerations

Variables affecting how many devices can be monitored

  • Number of service checks per host
  • Median interval for service checks
  • Type of checks being executed (quality of plugin code, local execution vs agent queries, etc)
  • Network latency
  • Stability of system being monitored (number of host / service state changes per minute)

We recommend 250 hosts as a comfortable limit for single Opsview servers based on the following assumptions:

  1. 10 service checks per host (median)
  2. 5 minute interval per service check (median)
  3. Majority of service checks are against remote agent (NRPE / SNMP)
  4. Majority of monitored hosts are on same LAN
  5. System is stable with little flux

This results in approximately 10 service checks per second being executed by the monitoring server.

How to achieve scalability

Opsview's distributed architecture can be used to monitor large numbers of hosts. The Opsview system can be split into the following components, with each running on a dedicated server:

  • Web Server (Apache proxy) and Master Server (Application server and monitoring engine)
  • Database Server #1: Opsview and Runtime
  • Database Server #2: Opsview Data Warehouse
  • Slave Cluster:
    • Slave #1
    • Slave #2
    • Slave #n

A slave cluster will contain two or more Opsview slaves. Load is balanced across all nodes in cluster.

Service Checks Per Second

At Opsview we're concerned with the number of 'checks per second' we will be performing. Opsview recommends a supported figure of around 20-25 checks per second.

Let us use an example. Say we have 2000 hosts, with 10 checks per host, using a 5 minute interval.

2000 (hosts) * 10 (service checks) / 300 (seconds) = 66 service checks per second

This is over our figure of 20-25, so we would need to attach 3 slaves to the host to hit a rate of 22 checks per second - well within our guidelines.

Remembering we can utilise each core to handle a separate worker thread, we can divide our figure of 66 by the number of cores our slave servers will have. For example, if we have 2x dual CPUs in our slave servers, this brings the number of service checks per second on each core to 11.

 66 service checks per second / 6 (number of cores in 3 CPUs) = 11

How to achieve resilience

Scalability and resilience are not mutually exclusive as distributed architecture should be used for both. Resilience can be added by doubling up on components, eg:

  • Web Cluster:
    • Web #1
    • Web #2
  • Master Server (active)
  • Master Server (standby)
  • Database Server #1: Opsview and Runtime, Replica of Data Warehouse
  • Database Server #2: Opsview Data Warehouse, replica of Opsview and Runtime
  • Slave Cluster:
    • Slave #1
    • Slave #2
    • Slave #3
    • Slave #n

Considerations:

  1. Population of Opsview Data Warehouse (ODW) can be disabled for short periods of time. Reporting against ODW can also be disabled to reduce load.
  2. When sizing a slave cluster you should cater for at least one node failing. If slave cluster is near capacity failure of one node may cause remaining nodes to exceed capacity.
  3. It is not possible to run two Opsview master servers simultaneously (active / active)

Databases

There are 4 databases used by Opsview:

  • Opsview - main configuration and access control
  • Runtime - status data and short term history
  • ODW - Opsview Data Warehouse for long term retention of data
  • Dashboard - used to hold state information about the Dashboard application

You can get a major performance improvement by moving these databases to a dedicated database server. See the process documentation.

Opsview and Runtime must be on the same server, but ODW and Dashboard can be separated.

More information in the MySQL performance tuning section.

MySQL Clustered Databases

In MySQL, it is possible to run clustered databases. If so, ensure that you have the innodb_autoinc_lock_mode variable set to 1 (in Mysql 5.1), otherwise there will be issues with host and service status data.

Disk partitioning

Opsview system

One large / partition is fine, but we recommend a partitioning schema of at least /, /boot and /var

  • / - Sufficient space for Operating System and upgrades
  • /boot - Recommend separate boot partition of 256MB
  • /usr/local - Opsview software is installed here. Recommend > 10GB

Temporary directory

Opsview will use a temporary directory when running opsview-web and other programs. By default, this is the /tmp directory.

You can set a system level environment variable, TMPDIR=/newtmp, if you want to use a different directory.

Database system

The databases can either be stored on the master server or on a separate server. In either case,

  • /var - Opsview database and backups are stored here. Recommend > 100GB if using the data warehouse (50GB otherwise)

Backups

By default, the nightly backups are in /usr/local/nagios/var/backups. It is recommended to change this to a different partition on a different physical disk for redundancy.

Security

Opsview Web Authentication

Opsview's web authentication uses an authentication ticket with a shared secret. Ensure this is set to a unique value for your instance.

Network

Opsview servers should be located in a secure area of your network. If you allow Opsview to be available on a public network, we recommend a firewall to restrict access to various ports. See this page for ports used by Opsview.

Agents

Opsview agents provide monitoring capabilities on the end host. They are contacted by the Opsview master or slave system (or clients) using an anonymous cipher for encrypted communication. Opsview agents only allow strong ciphers (ADH-128 and ADH-256) to be accepted.

However, there is no authentication.

For security, we recommend you configure a firewall between the master and slave so that only the relevant Opsview system can communicate to the end host. Alternatively you can set the "allowed_hosts" variable to define specific IP addresses for the Opsview system.

Credentials For Remote Services

If you need to connect to remote services (such as to an SNMP agent, a database, a VMware system or a Windows machine), you should use credentials with the minimum access required to achieve the desired monitoring.

SELinux

Opsview is not currently compatible with Security-Enhanced Linux extensions - this must be disabled.

Edit /etc/selinux/config and restart system:

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
#       enforcing - SELinux security policy is enforced.
#       permissive - SELinux prints warnings instead of enforcing.
#       disabled - SELinux is fully disabled.
SELINUX=disabled
# SELINUXTYPE= type of policy in use. Possible values are:
#       targeted - Only targeted network daemons are protected.
#       strict - Full SELinux protection.
# SELINUXTYPE=targeted

System Settings

Time Zone

The Unix server can be set to any time zone. This will affect the times displayed in the web interface.

Most of Opsview will display the time based on the server time zone, but note that the auto-discovery application uses the browser time zone for display time values.

All data stored in databases and files will be forced to be stored in UTC format for consistency.

If you change the time zone of the Unix server, reboot the server to ensure that all services are aware of the time zone change.

Navigation
Print/export
Toolbox