Welcome to docs.opsview.com

Table of Contents

Frequently Asked Questions

Installation and Initial Configuration

I'm having trouble with the Opsview installation - can I get any help?

If you have issues installing, sign up to the forums and post a question in the relevant topic. Try to put in as much information as you can, especially:

  • Which platform?
  • How are you installing?
  • What errors are you getting? Please include any relevant logs and output.

The forums are frequently viewed by our engineers, and others in the community may help.

We want everyone to be able to install Opsview, so we'll do our best to fix any issues and get your system up and running.

Why do I keep getting asked to authenticate?

Please check that the time is correct on the Opsview server, as an incorrect system time can cause session cookies to expire.

It is strongly recommended to use NTP as time synchronisation is critical to accurate monitoring. If you are using a virtual machine, check the virtual machine host information for time synchronisation as it may conflict with NTP.

See also the next question.

Why is the time in the UI different to that on the system

Opsview is designed so that the UI displays time based on the Opsview server's timezone. All timestamps stored in the database will be in UTC.

If the date and time on the server is correct, but parts of the UI show an incorrect time, this can be because the user that started opsview-web has the TZ environment variable set incorrectly. To fix this as the nagios user run:

unset TZ
opsview-web restart

and then investigate where the TZ variable may have been incorrectly set.

Why do I get "Error retrieving update from Opsview. Will continue to retry"?

If there is an error getting an AJAX refresh, the main content area will show this message. This could be due to network or the Opsview Web application not responding.

Your browser will continue to poll for status updates and will refresh accordingly when it gets a successful response.

Why can't I login using the Atom feed?

Are you an LDAP user? If so, this is a limitation in using LDAP.

How do I change the name of the top level of Host Group Hierarchy?

By default, the name of the top level hostgroup is “Opsview”. This can be changed in the web UI. On the Host Group Hierarchy configuration page, click on the = sign and a box will appear at the bottom where you can change the name.

Why do I get error "The Opsview Web Server is not running or is not responding to requests!"?

When connecting to Opsview, web browser is giving following error:

Opsview Error

The Opsview Web Server is not running or is not responding to requests!

This can be resolved by starting Opsview Web server:

/etc/init.d/opsview-web start

How do I change MySQL passwords post installation?

There will be a short outage to Opsview while the passwords are changed.

1. Stop opsview and opsview-web daemons
2. Edit passwords in opsview.conf
3. Change passwords in MySQL

Note: Passwords should avoid characters such as @, ! and $ due to handling by perl or by the shell. You can see how the variables would be calculated by running /usr/local/nagios/bin/opsview.sh.

mysql -p -u <root user>
USE mysql;
UPDATE user SET password=PASSWORD("newpass") WHERE user="opsview"; 
UPDATE user SET password=PASSWORD("newpass") WHERE user="odw";
UPDATE user SET password=PASSWORD("newpass") WHERE user="nagios";
UPDATE user SET password=PASSWORD("newpass") WHERE user="reports";
FLUSH PRIVILEGES;
4. Run ''/etc/init.d/rc.opsview gen_config'' (which will regenerate the Opsview configuration files and then restart opsview daemons automatically)
5. Restart the opsview-web daemon

How do I change the "admin" UI password post installation?

To change the admin user password via the Opsview GUI:

1. Login as the admin user
2. Click on the name in the top right //Logged in as admin//
3. Change your password and submit

Your current session will remain, but you will need to use the new password for the next login.

How do I recover the "admin" UI password?

In the event the admin password has been lost, the following commands will reset it to the installation default.

# mysql -u root -popsview
Password: <not shown>
mysql> use opsview;
mysql> update contacts set encrypted_password='$apr1$SUR3Kcd8$CkJfpqvqy3r.6rzawNwCS.' where name='admin';

You should now be able to access the UI again using the username admin and the password initial. Remember to change it!

Why does Nagios® Core die after a reload signal is sent?

This has been seen on slave systems which are 64bit where the master is 32bit. Running nagios in the foreground and sending a -HUP signal gives this error:

libgcc_s.so.1 must be installed for pthread_cancel to work

You have to install the 32bit version of libgcc.

This has only seen on Redhat systems.

I've manually changed the state of a passive service to OK, but it keeps changing back. Why?

This happens in a distributed environment where the state of a service is inconsistent between the master and the slave.

If the slave receives a passive check, it gets propagated to the master. If you then change the state of that service on the master, the slave thinks it is still in the old state, but the master thinks it is in the new manually chosen state.

If this service has renotification enabled, then the slave will send its state back to the master, causing the state to revert back.

The workaround is to either:

  • manually change the state on the slave (which then propagates back to the master)
  • or disable renotifications for this service

How do I change the TCP port opsview is expecting my slave is using to connect on?

This only applies when the slave is configured to initiate connections to the master.

To change port from 25807 to 25802 run the following SQL commands using the MySQL admin tool:

use opsview;
update monitoringclusternodes set id=2 where id=7;

Why isn't MRTG generating graphs for some devices?

Assuming that MRTG option is enabled for hosts in question.

As 'nagios' user, run the following commands:

/usr/local/nagios/bin/mrtgconfgen.pl full
/usr/local/nagios/bin/mrtg_genstats.sh

These regenerate the configuration and run a test. If these are successful, then check the nagios user crontab contains a line similar to:

0,5,10,15,20,25,30,35,40,45,50,55 * * * * /usr/local/nagios/bin/mrtg_genstats.sh > /dev/null 2>&1

and that cron is running OK.

You can also check the log file for error

/usr/local/nagios/var/log/mrtg_genstats.log

How do I change Opsview to use seconds instead of minutes for check interval periods?

Amend the opsview.conf file to include the following configuration within the overrides section:

$nagios_interval_length_in_seconds = 1;

You then need to restart Opsview Web: /etc/init.d/opsview-web restart.

NOTE: all service check and host configuration pages will have to be reconfigured to change the units from minutes to seconds (i.e. take all the values and multiple by 60) for check interval, retry check interval, and notification interval.

I see warning messages when restarting Opsview Web

If you see messages such as:

[root@opsview-test ~]# /etc/init.d/opsview-web restart 
Stopping opsview-web: done 
Starting opsview-web: INFO - Starting Opsview Web 4.5.0.16151 
/usr/local/nagios/libexec/check_tcp: /usr/lib64/libcrypto.so.10: no version information available (required by /usr/local/nagios/libexec/check_tcp) 
/usr/local/nagios/libexec/check_tcp: /usr/lib64/libssl.so.10: no version information available (required by /usr/local/nagios/libexec/check_tcp) 
/usr/local/nagios/libexec/check_tcp: /usr/lib64/libcrypto.so.10: no version information available (required by /usr/local/nagios/libexec/check_tcp) 
/usr/local/nagios/libexec/check_tcp: /usr/lib64/libssl.so.10: no version information available (required by /usr/local/nagios/libexec/check_tcp) 
done 

This is likely to be that you have an older version of openssl installed on your system. Upgrade to the latest openssl should fix this warning message.

SMTP configuration

Changing the sender address for email alerts

To change the SMTP sender address for your Opsview alerts:

For everything to the right of the @ symbol

Update '/etc/mailname' with your preferred host or domain name. Restart your SMTP MTA.

For the user name to the left of the @ symbol

  • In Postfix you can adjust this using the 'sender_canonical_maps' configuration parameter

There is a Mini HOWTO for Ubuntu users, which may be useful for users of other platforms, on the Ubuntu Forums

Send all messages to your company email system

Update the 'relayhost' parameter with the hostname of your SMTP MTA, eg:

 relayhost = mail.opsview.org
 

Configure Postfix to Use Gmail SMTP on Ubuntu

If you want to use a Gmail account as a free SMTP server on your Ubuntu-Linux server, you will find this article useful

http://rtcamp.com/tutorials/linux/ubuntu-postfix-gmail-smtp/

MySQL Database

Use MySQL client to execute SQL statements

How do I backup up Opsview databases and configuration

  • Edit etc/opsview.conf to set correct backup destination
 su - nagios
 /usr/local/nagios/bin/rc.opsview backup

How do I restore from database backup?

  • Identify the required image to restore from (location is held in $backup_dir variable within the opsview.conf file if using a full backup rather than database only).
su - nagios
gunzip -c {/path/to/opsview-db-{date}.sql.gz} | /usr/local/nagios/bin/db_opsview db_restore

I get a database upgrade error during an upgrade

If you get an error which says:

Upgrading Opsview part of Runtime database
DB at version 2.7.0
DBD::mysql::db do failed: Table 'opsview_contact_services' already exists [for Statement "CREATE TABLE opsview_contact_services (

There was an error in the runtime nightly backup which missed out the opsview_database_version table. The upgrade script then tries to apply changes from Opsview 2.7 and 2.8 into the database. To fix, run this in mysql:

mysql -u{user} -p{password} runtime
mysql> update opsview_database_version set version='2.8.6'

And then run:

su - nagios
/usr/local/nagios/installer/upgradedb_runtime.pl

From Opsview 3.3, the upgrade tasks associated with Opsview 2.7 and 2.8 will be removed.

How do I fix damaged database tables?

If the database is damaged, run the following commands:

 mysqlcheck -p -u <user> <database>

To repair a table (from the MySQL client):

 use <database name>;
 REPAIR TABLE <tablename>;

To check all databases you can use the following as the MySQL root user:

mysqlcheck -A -r -u root -p

A common cause of corrupted database files is that the system ran out of space on the /var partition.

Using Opsview

How do I change my password?

In the side navigation, select User → Preferences. From here, you can reset your password.

Depending on your access levels, some options may be dimmed.

Note: Versions prior to 3.0.2 did not allow changing of passwords for non-admin users due to a bug.

Why do I have "Host assumed UP - no results received" as the host output?

Hosts are checked on demand - only when a service comes in with a failure state will the host be checked.

If a service has failed, then the host will be checked and the host output will be changed to reflect this.

If you submit a result in the user interface for a service with a failed state, then the host will not be checked. This is due to a change we made to the core Nagios Core 2 engine - hosts were being checked for every failed passive result coming in which was causing problems on systems with snmp traps.

Opsview says that a host is down, but I can ping it - what is going on?

There are several possibilities:

  • Is the check command for the host a ping? For example, if it is an ssh check (which you may want to set for firewall reasons) and ssh hasn't started, then as far as Opsview is concerned, the host is down
  • As Opsview can only check from its perspective, perhaps a check from a different network location could give different results
  • Because hosts are checked “on demand” (see the important concepts page), if all services fail to recover from a host failure, then the host is not checked again

For the last possibility, the recommended approach is to always have a check which is similar to the host check (usually TCP/IP), so that this regular service check will reveal when the host has recovered. This also means you can get performance data about this service (performance data is not available for host checks due to the irregularity of them).

Why do I have a stale result straight after I have submitted a result?

This could happen in a distributed environment where the time is not synchronised between master and slaves.

When you submit a result, there is a time associated with it. If the slave is ahead of time for the master, then the slave will mark the result back in time and then the slave will immediately mark it as stale.

How can I use the ''Nagios Checker'' Firefox plugin?

More information can be found here:

NMIS pages don't look right, as though they are missing a stylesheet

This is due to a line missing from the Apache configuration file - add the line

Alias /static/nmis/ "/usr/local/nagios/nmis/htdocs/"

to the Opsview configuration section and restart Apache.

Icons are missing from my Solaris master server GUI

This is due to a bug in the SunFreeware GD package used when creating the Opsview packages (which meant gd2 images were not created correctly). This can be fixed by ensuring SFWgd is installed (from the companion DVD) and then running the following as the nagios user:

cd  /usr/local/nagios/share/images/logos/
for arg in $(echo *.png | sed 's/\.png//') do /opt/sfw/bin/pngtogd2 $arg.png $arg.gd2 0 1; done

The mass acknowledgements sometimes shows me more items than I had in the prior view

Mass acknowledgements filters the current view to only show you unhandled services and unhandled hosts. Due to the way the URLs are constructed, it was better to force showing all the unhandled hosts, otherwise from some views, you wouldn't get any items to acknowledge.

The Parent Tree link doesn't work

In Internet Explorer 7, the Parent Tree link just gives me a broken image and an error saying:

The page you are viewing uses Java. More information on Java support is available from the Microsoft website.

This means you need to install Java on your machine. Goto http://java.com and install Java.

The Parent Tree diagram overwrites the contents of the web page underneath

The Parent Tree functionality uses the Hypergraph Java applet, but Java applets do not respect the z-index and thus overwrite all other HTML content. This is a limitation in Java applets. See http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4858528 for the bug report sent to Sun.

How long after a failure can I expect to receive a notification?

Nagios Core has a concept of SOFT and HARD states and notifications are only sent on hard states. See the important concepts guide for more information.

How can I work out what plugin arguments were actually used?

Unfortunately, there is no UI method of finding this, but you can query the database to get the information:

  • Connect to the runtime database (use the connection information in opsview.conf)
  • Get the service object id:
select * from nagios_objects where name1='hostname' and name2='servicename'
  • The service object id is the value in the object_id column
  • Get the last 5 results:
select state,output,command_line from nagios_servicechecks where service_object_id={object_id} order by start_time desc limit 5
  • The command line column has the full command executed by Nagios Core

On Safari 4, I get layout errors after I've selected some options on some pages, such as the contacts edit page

This is a bug in Webkit, which is used by Safari 4, Google Chrome and other browsers.

We have raised this with the Webkit team:

We also are tracking this internally in our jira system.

Administering Opsview

I get access denied and I cannot see any of the /admin pages

Since Opsview 3.1 introduced granular access controls, your role probably has had the ADMINACCESS access rights removed.

As access to the the user interface is not possible, the issue will need to be addressed via the database. Connect to MySQL as the opsview user (configured in opsview.conf) and run:

mysql> select 
 contacts.name as contact_name, 
 roles.id as roleid,
 access.id as accessid,
 access.name 
 from contacts,roles,roles_access,access 
where accessid=access.id 
 and roleid=roles.id 
 and contacts.role=roles.id 
 and contacts.name='{contactname}';

Change the {contactname} as appropriate.

This should return output like:

+--------------+--------+----------+----------------+
| contact_name | roleid | accessid | name           |
+--------------+--------+----------+----------------+
| admin        |     10 |        1 | VIEWALL        | 
| admin        |     10 |        3 | ACTIONALL      | 
| admin        |     10 |        6 | NOTIFYSOME     | 
| admin        |     10 |        7 | CONFIGUREHOSTS | 
| admin        |     10 |        8 | RELOADACCESS   | 
| admin        |     10 |        9 | ADMINACCESS    | 
+--------------+--------+----------+----------------+
6 rows in set (0.00 sec)

If ADMINACCESS is missing, then it will need to be added to the role (this will take effect for all contacts using this role).

mysql> insert into roles_access values (10,9);
Query OK, 1 row affected (0.03 sec)

Rerunning the above select statement should show the ADMINACCESS in the list now.

You can then login and see the audit logs to work out who made the change….

Another possibility that has been reported is that you have an incorrect cookie. Try deleting all cookies related to Opsview and re-login. Please let us know about this on the mailing lists if you get this problem as we are trying to track down the root cause.

Reloading gives me an error about nagvis.ini.php

When reloading, I get an error which says:

Can't close to nagvis.ini.php: Bad file descriptor at /usr/local/nagios/bin/nagconfgen.pl

Check the permissions of /usr/local/nagios/nagvis/etc/nagvis.ini.php. This should be owned by the nagios user and the nagcmd group.

Also ensure that your apache user is a member of the group nagcmd.

Committing Changes gives me an error "Authorisation Mismatch"

When attempting to acknowledge an error or submit a change, Opsview responds with

Authorisation Mismatch

1. Clear your browser cache.

2. If clearing browser cache fails, it could be that you have a forms.js file that is pre Opsview 4.4.1. Contact Opsview Support for assistance.

Nagios Plugins

What does this output message mean?

See our list of common plugin outputs.

I would like check XXX to return a WARNING instead of CRITICAL alert

Some checks allow you to amend the warning and critical limits to achieve what you want, but other checks do not have enough fine grained controls.

For example if you want check_icmp to return a WARNING instead of CRITICAL when it picks up “rta nan, lost 100%” then clone the original service check and amend the arguments from from

/usr/local/nagios/libexec/check_icmp -H 192.168.1.100 -w 200,100% -c 11000,200%

to

/usr/local/nagios/libexec/negate --critical=1 --substitute /usr/local/nagios/libexec/check_icmp -H 192.168.1.100 -w 200,100% -c 11000,200%

This will amend the return from CRITICAL to WARNING.

negate has other uses too - please see its help output for more details.

I get 'CRD CRITICAL - oldest checkresult file is XXXXXX seconds old, YYYYYY checkresult files backlogged'

In the checkresults directory (default: /usr/local/nagios/var/spool/checkresults), occasionally .ok files will be left without being cleaned up. If you see 'CRD CRITICAL' and find that you have one or more .ok files that are stale, it is safe to remove these files.

To automate this process add this command to your nagios crontab:-

find /usr/local/nagios/var/spool/checkresults -type f -name "*.ok" -mtime +1 -exec sh -c 'f="{}";[ -e $f ] && [ ! -e ${f%.ok} ] && rm ${f}' \;

Distributed Monitoring

On the master, it says "Next check: N/A" but I know this is a regularly scheduled check

This is because the Nagios Core CGIs on the master for that service are passive checks, not active checks. It will just display the results from the last check from the slave and has no idea of when the next scheduled check will be. This is a limitation in the Nagios Core CGIs.

Nagios Core is not running on one of my slaves, how do I fix?

The simple way of starting the nagios daemon on a slave is via the Master server.

When logged in as the 'nagios' user on the Master server:

/usr/local/nagios/bin/rc.opsview gen_config

This will re-generate the configuration and (re)start the nagios daemon on each slave server.

If this command fails, goto /admin/reload to see any error messages that may have been captured.

Why are there large numbers of service checks in 'UNKNOWN' state?

This occurs in a distributed environment if a slave has not sent results back to the master for longer than 30 minutes.

A slave server problem should be alerted via the check_opsview_slave plugin. But if it is not resolved, then services monitored by this slave will start to go into UNKNOWN states after 30 minutes (note, the host will not be set into an UNKNOWN state). You need to fix the connectivity issues.

Other things it could be:

  • NSCA not running on the master
  • Nagios Core command pipe not setup correctly
  • The slave server has very high latency - check nagios.log and nagiostats output on the slave for the Active Service Latency values

I get a "HostKeyVerification Failed." error - how do I fix it?

This is usually only seen in environments with reverse SSH tunnels either on the initial setup or when a slave has changed IP address.

In Opsview version 3.0 the fix is to run the following as nagios user

dosh -i -s <slave name>

In previous versions of Opsview use the following steps:

send2slaves -t

For each slave with an error, a line similar to the following will be returned:

ssh -o ConnectTimeout=10 -p <port> -o BatchMode=yes -o HostKeyAlias=<slave>-ssh localhost hostname

Rerun the command as nagios but change the BatchMode=yes to BatchMode=no.

How do I use NSCA with Opsview

SNMP

Why does my SNMPv3 host not have MRTG graphs generated?

This is a known limitation. Only hosts with SNMPv1 or v2 will have MRTG graphs generated.

How can I load extra MIB's into Opsview?

There are two directories on the Opsview master used for MIBs:

  • /usr/local/nagios/snmp/all to hold all MIBs
  • /usr/local/nagios/snmp/load to hold MIBs that are actually loaded by Net-SNMP

Put the MIB files in there and ensure that the files are readable by the nagios user.

As per the SNMP setup documentation, the /etc/snmp/snmpd.conf (or /etc/sma/snmp/snmp.conf on Solaris) should exist and at least contain the line:

mibdirs +/usr/local/nagios/snmp/load

Please check your configuration against our configuration notes if this file does not exist.

You will need to restart the Net-SNMP daemon to take effect.

Run /usr/local/nagios/bin/send2slaves -s to send the snmp/load directory to all slaves and restart SNMP.

My SNMP 5.3 system isn't accepting traps correctly

By default, SNMP version 5.3 and newer requires authentication to be set up - this configuration can be disabled by setting

disableAuthorization yes

in the /etc/snmp/snmptrapd.conf file.

Network Protocols

Monitoring Agents

I can't find an agent package for my Unix OS / Architecture, what now?

If an installation package isn't already available you can contact us to see if it is something we are working on.

Alternatively, you can build it from sources:

1) Download source code and pre-requisites

Download NRPE and Nagios Plugins packages from the Nagios Core website.

2) Compile NRPE and Nagios Plugins

You will need at least GCC and OpenSSL software installed.

Steps for compiling software:

  • Start console session (bash, ksh, csh, etc)
  • Create 'nagios' user and group
  • Unpack tarball
  • Change into software directory (cd nrpe-2.2.5)
  • Run command './configure' and check for any missing dependencies
  • Run command 'make all' to build software
  • Optionally run command 'make install' to install software in relevant locations
3) Assemble additional functionality

Add additional monitoring plugins from Opsview source repository and Nagios Exchange web site.

4) Update configuration

Update NRPE configuration files to suit your requirements (/usr/local/nagios/etc/nrpe.cfg).

5) Create packages (optional)

To deploy software to multiple systems we recommend we create a software package for your OS platform. For linux this may be using RPM or Debian packaging tools. Solaris and AIX include their own tools.

FAQ: Log messages

There are various locations for log messages, depending on the subsystem

Audit logs

Why do I keep getting "Username 'user' logged in via auth tkt" in my audit logs?

By default, the authentication ticket is set to expire after 1 day. If the time on your Opsview server is too slow, then the browser could expire the authentication ticket, though the session would still exist at the Opsview server.

Make sure the Opsview server has its time synchronised correctly.

/usr/local/nagios/var/nagios.log

These are the Nagios Core log files. These are rotated every week and the archives are put into /usr/local/nagios/var/archives.

This is medium volume.

/var/log/opsviewd.log

This is the main log destination for Opsview programs. It is possible to split the locations if you wish by changing the Log4perl.conf file in /usr/local/nagios/etc.

By default, this acts as a bucket for all log messages.

Invalid offset

[2008/04/04 03:11:45] [run_scheduled_reports] [FATAL] Died: Invalid 
offset: New_York

This appears to be a bug in DateTime: http://search.cpan.org/src/DROLSKY/DateTime-TimeZone-0.6902/Changes

It is harmless and will be fixed when we next update DateTime.

Import took > 5 seconds

[2008/11/10 18:16:54] [import_ndologsd] [WARN] Import of 1225937390.604976, size
=768003, took 10.0321450233459 seconds > 5 seconds

This means the import file into the Runtime database took more than 5 seconds. As the imports are run mostly every 5 seconds, this means the database will fall behind with the status as reported by Nagios Core. However, this is not a problem over a reload as the initial load can take some time. But if you are getting this message frequently or you are getting it when the size value is small, then there needs to be investigation about why there is such a delay in the importing.

See the prerequisite information about Mysql for some recommendations.

/var/log/opsview-web.log

This holds logging information for the Opsview Web application.

You can change the configuration via /usr/local/opsview-web/etc/Log4perl.conf.

Low volume (by default).

For example, if you wanted logging for the Root.pm controller, you would add these lines:

log4perl.logger.Opsview.Web.Controller.Root=DEBUG, LOGFILE

You do not have to restart opsview-web. This takes effect within a few seconds.

syslog

Some 3rd party software, such as NRPE, ndo2db and NSCA, will log to syslog. The destination depends on your syslog configuration.

Performance Data and Graphing

Where's the graphing icon? I'm sure there's performance data

You need to reload Opsview by clicking “Navbar > Settings > Reload Configuration” in the UI. When a new service is created, Opsview doesn't know if there is performance data until data is returned (the first reload activates the new check - wait for the first status update and then reload again). The 2nd reload will see the graph data files exist and then display the graph icon in the UI.

If you still don't get a performance graph icon, check whether the Nagios Core status screen for that service shows the graphing icon - the url is something like: http://opsviewserver/cgi-bin/extinfo.cgi?type=2&service=SSH&host=opsview

If it does, then the macro has been set correctly in the configuration. Check the runtime.opsview_host_services table for this service has the perfdata_available flag set:

select hostname, servicename, perfdata_available
from opsview_host_services
where hostname='{host}' and servicename = '{servicename}'

This flag controls whether to draw the graphing icon or not. This table is updated by ndoutils_configdumpend, which is run when the Nagios Core configuration has been updated following a reload.

Another possibility is that the RRDs are being migrated from Opsview 2 style to Opsview 3 style. The graphing icon will only appear when the RRDs have migrated for a particular service. See RRD migration notes for more information about the migration process.

It has been seen in the migration process that some RRD files can be left behind because the migration process cannot work out what the host name or service name is. If you don't mind losing performance history, you can delete the existing RRD file and Opsview will automatically create a new one:

cd /usr/local/nagios/var/rrd
ls *.rrd
rm *.rrd

Graphing is not currently working in Opsview - how does it work?

There are 2 separate and distinct areas of graphing within Opsview;

Performance data returned by service checks

Any service check that returns valid and consistent performance data as per the guidelines will have graphs automatically created. The process is

  • A new service check is added in and a reload performed to activate it
  • As soon as the check is run against a server and performance data is returned the graphs are created
  • The graph icons and links in the UI will be created on the next reload

The service check performance data is continually logged into nagios/var/perfdata.log and processed regularly to create the rrd files, which are stored in nagios/var/rrd. The log file for this process is in nagios/var/log/nagiosgraph.log.

Performance data isn't being generated or it isn't interpreted correctly

Some service checks do not provide performance data. Other checks do provide it but the data isn't interpreted correctly. In both cases a change to the provided map file is probably required /usr/local/nagios/etc/map - this file is used to read the service check output and (if a match is found) generates performance data for use in the service graphs (if no patch is found then performance data is used if it is in the service check output).

This file is a series of perl regular expressions which matches service check output. From Opsview 2.12.7 if you need to make a change to one of the entries you can create a map.local file (this will override the entries in the map and will not be overwritten on an upgrade).

Any changes to the map.local file should be followed by a

perl -c map.local

to ensure there are no grammatical errors in the file.

Changes to map.local will be used when processing the next set of performance data, which is every 30 seconds.

There is some more information about the map file and its format on the nagiosgraph web site.

Performance graph combines several pieces of data, how do I graph individually?

It is possible to generate graphs for specific performance data by appending an argument to the URL.

For graph based on check_http plugin, we get four pieces of data:

  • Response time (time)
  • Page size (size)
  • Warning threshold (time_warn)
  • Writical threshold (time_crit)

Standard URL will be similar to:

/cgi-bin/show.cgi?fixedscale=1&service=HTTP&host=www.opsview.org

Appending '&db=,time' will generate graph based on 'time' value, eg:

/cgi-bin/show.cgi?fixedscale=1&service=HTTP&host=www.opsview.org&db=,time

Parameters can be combined so to display response time and critical threshold on same graph simply append '&db=,time,time_crit'

Performance graphs show an arbitrary range

Some performace checks do not show minimum and maximum ranges and an arbitrary range is used.

To force the use of a specific range the following can be added to the URL

&rrdopts=%2Dl%200%20%2Du%2016384%20%2Dr

This adds the following rrd options to the command generating the graph:

  1. l 0 -u 16384 -r

where -l <num> provides the minimum range, -u <num> the maximum range and -r specifies the ranges are rigid (prevents autoscaling). See man rrdgraph for further options that are available.

MRTG graphs

Any host that has the 'MRTG Graphs' check-box ticked on the host configuration page will be scanned by MRTG for data every 5 mins - any data gathered is converted into a graph (stored in nagios/var/mrtg) and access to them provided by the Reporting → Network Traffic link in the sidenav. See also the Installation FAQ entry and the SNMP FAQ entry about MRTG graphing.

The MRTG configuration for all hosts will be rescanned at Opsview reload time if any host with MRTG has had some configuration change or if a host template with MRTG has been changed. You can force an MRTG rescan by running as the nagios user on the master server:

/usr/local/nagios/bin/mrtgconfgen.pl full

Note: If a user has view access for the host, then they can view the MRTG graphs for that host.

Note: MRTG uses the host address to identify the host, not the unique Nagios Core host identifier. If you have multiple hosts with the same host address, Opsview checks that a user has access to all the hosts with this host address.

Note: In a slave cluster, only the first node in the cluster will poll the device.

Note: It is not possible to disable specific interfaces for MRTG to monitor. You could use the Opsview host interfaces functionality to select specific interfaces.

Why doesn't graphing work on Solaris 10 sparc?

This seems to be due to a problem in the SunFreeware rrdtool package (the Intel version doesn't have this issue). Downloading, compiling and installing your own version should fix the issue.

How can I delete all the graphing history for a service?

Just remove the RRD file on the Opsview master server. In /usr/local/nagios/var/rrd, locate the appropriate RRD files. They take the format:

{hostname}/{servicename}/{metric}/value.rrd

The names are urlencoded, so for example may contain %20 to signify a space.

After you have removed the file, the next performance data that arrives will automatically create a new RRD.

Why are some of my RRD files appear to be losing history / getting deleted?

There is a housekeeping job, /usr/local/nagios/bin/opsview_cronjobs, which deletes an RRD files which have not been modified within the audit log retentions number of days. As RRD files are updated with new results, this should only be for RRD files which belong to old hosts or services.

It has been seen on a Debian Lenny system which was not completely upgraded (the kernel remained as the Etch version), that RRD Tool updated the rrd files but did not change the modification time. Thus the RRD files were deleted by the housekeeping and the graphs were then recreated. The fix for this scenario was to upgrade the kernel appropriately. See http://www.mail-archive.com/rrd-users@lists.oetiker.ch/msg14481.html for a discussion about this issue.

In addition to upgrading your system kernel, an alternative solution exists; which is to compile RRD Tool from source, as discussed here.

Opsview Development FAQs

How do I troubleshoot Nagios Core CGIs?

To simulate a Nagios Core CGI, switch to nagios user on Opsview master:

su - nagios
cd /usr/local/nagios/sbin
REQUEST_METHOD=GET REMOTE_USER=admin QUERY_STRING="host=all&createimage" ./statusmap.cgi > /tmp/cgi.out

This will run the CGI as if the web server had served the request.

Upgrading issues

I have duplicated services for hosts

If you have upgraded to Opsview 3.3.2 and hosts are showing duplicate services with different check times, this is due to a bug in the upgrade script.

The problem is due to an attempt to fix an NDOutils bug where nagios instance ids could be created multiple times. This affects Nagvis, amongst other issues. However, the upgrade script in Opsview 3.3.2 fails to take into account when the nagios_instance id is set to a number other than 1.

This only affects a small number of users.

To fix, you have to disable information from the Runtime database about the other instances. The immediate fix is:

mysql> use runtime
mysql> select * from nagios_instances;

This should return only 1 row. If you have more than 1 row here, this is an unsupported configuration (you have more than just Opsview pointing to this runtime database).

The instance_id will be > 1. The instructions below assume this value is set to 3 - please change appropriately.

mysql> update nagios_objects set is_active=0 where instance_id != 3;
mysql> delete from nagios_programstatus where instance_id != 3;

This disables all the objects for the other instance_ids and removes some metadata about the instance connections. Now reload Opsview and the additional services will disappear.

Differences to Nagios Core

Service Groups

In Opsview, a service group is a group of service checks, unassociated to any hosts.

In Nagios Core, a service group is a list of services (associated to a host). The closest concept in Opsview is a keyword, which consists of a group of services though the use of tagging.

Opsview does not generate any configuration for a Nagios Core service group.

WMI Agentless Monitoring

Segfault Issue

If you are seeing the following while running “check_wmi_plus.pl” based service checks using ”-H hostname” instead of ”-H ipaddress”:

UNKNOWN - The WMI query had problems. The error text from wmic is: [lib/util/fault.c:163:fault_report()] =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= [lib/util/fault.c:164:fault_report()] INTERNAL ERROR: Signal 6 in pid 5044 (4.0.0tp4-SVN-build-UNKNOWN)[lib/util/fault.c:165:fault_report()] Please read the file BUGS.txt in the distribution [lib/util/fault.c:166:fault_report()] =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Then firstly test to see if the service checks complete successfully with the IP instead. If so, please check the DNS search domains in your resolv.conf on the Opsview server.

Password with special characters

If you use a password with special characters, you will need to escape the special character. For example, 'Password!' would become 'Password\!'.

Domain Passwords

If your username is part of a domain you will need to escape your login, like so: “DOMAIN\\\\username”

Navigation
Print/export
Toolbox