Welcome to docs.opsview.com

This is an old revision of the document!


High Availability Master

This document covers the theory on setting up high availability for the Opsview master.

There are different ways of achieving high availability. You could run the Opsview master in a virtual machine and provide availability at the virtual machine level - that setup is outside of the scope of this document.

This document assumes you have distinct hardware which provide failover capability.

This document also assumes CentOS, though the theory is applicable to other OSes.

Two Node Solution

This configuration involves two near-identical servers which work in active-standby mode. This assumes both servers will be running the mysql databases locally.

Some form of shared disk will be required. The test setup with Opsview was completed using a simple NFS mount on each node, however a more robust solution such as a fibre-attached or iSCSI volume should probably be used - in this instance use of a clustering filesystem (gfs/ocfs) will be necessary.

Example ha.cf

debugfile /var/log/ha-debug
logfile	/var/log/ha-log
logfacility	local0
keepalive 5
bcast	eth0
auto_failback on
ping 192.168.10.30   # NFS server
debug 1
use_logd no
node ha-centos4-a
node ha-centos4-b

Example haresources

# ha-centos4-b is specified as the 'preferred' node, i.e. this is where Opsview will usually run
ha-centos4-b opsview opsview-web IPaddr::192.168.10.130/24

Note that the above files are based on the 'old' and very basic Linux-HA configuration files. From version 2 of this project it's preferable to use XML config files to specify a much more detailed configuration. I've used the old method because it's quick and simple and useful for proof of concept. The above configuration can of course be reproduced using the new configuration files.

Installation process

The process for the test setup was as follows:

  • Install two identical CentOS4 machines
  • Create nagios user/group and nagcmd group ensuring the UIDs match on both machines
  • Mount /usr/local/nagios on each machine using a shared disk/filesystem
  • Mount /var/lib/mysql (or whichever is the location for mysql's data directories) using a shared disk/filesystem
  • Install the Opsview RPMs on both machines
  • Install the database using the Opsview scripts on one machine
  • Set up Linux-HA as described above

Four Node Solution

This solution uses two database servers running in active-standby, and two Opsview application servers which also run as active-standby. This allows the database to be separated from the application server (as is recommended on larger installations anyway), whilst also providing a highly available configuration.

The Mysql database servers can be setup with shared disks or in a master-master replication (http://www.onlamp.com/pub/a/onlamp/2006/04/20/advanced-mysql-replication.html). Either way, there should be a virtual IP address which Opsview will communicate with mysql, that will switch between either database server depending on whichever is the active node.

Opsview HA Setup 4 Node

Considerations in a distributed environment

If you have a distributed setup, ensure that:

  • The same SSH private key for the nagios user is used by both masters, so that communications to the slaves will still work following a switch over
  • The host used as the master server has a host address of localhost, so that checks run for the master server work regardless of which is the primary one

If you are using the normal forward SSH, the Opsview master will automatically re-establish connections. If you are using reverse SSH, the slave must connect to the Opsview master via the virtual IP. Due to SSH host key checking, you will need to use the same SSH host key on the standby master. Test by connecting to both masters from the command line on each slave node.

Two Node with Mysql Replication Solution

Replication

To give good fault tolerance and recovery in MySQL, the storage engine should be InnoDB. InnoDB can recover from crashes, rolling back half done transactions, and committing pending transactions to disk in case of failure. MyISAM is not capable of such operations, needing table repair, with possible loss of information, in case that tables are not properly flushed to disk. InnoDB on the other hand has a bigger data footprint for the same amount of data, compared to MyISAM.

Opsview uses a combination of MyISAM and InnoDB, exploiting the best features of each data engine. ODW is best in MyISAM (for long-term storage, MyISAM provides less storage footprint). Opsview and Runtime databases use InnoDB (providing constraints, cascading updates and deletes, transactionality, etc).

Shared Storage / DRBD

Opsview does not support active/active master monitoring due to architectural restrictions. HA has to be achieved via an active/passive setup. Only one server will monitor, and provide interface services, while the other will have all monitoring services stopped. In case of a master failure, the services need to be started on the passive node.

If the Opsview services need to be started up on another master, due to a hardware failure, the newly spawned daemon will need the same view of the data the failed node left behind. Via an iSCSI device that the active node mounts, or via DRBD (a block device that synchronously mirrors it’s contents to another block device on another server, providing RAID-1 over the net).

Note: Sharing the data via NFS is not recommended

General Diagram

Opsview HA Setup 2 Node

Installation Process

The two nodes will share the nagios installation (/usr/local/nagios), the nagios user home (/var/log/nagios), and if you use apache, the configuration files can be shared as well (/etc/httpd).

The opsview master will connect always against the database running in the localhost. That way, if the master fails and the slave takes the control, opsview will start using the database instance which was replicating data.

Shared Storage/DRBD

It should be installed in both nodes. If RedHat is used, you can add the CentOS extra repositories to yum:

cat > /etc/yum.repos.d/centos-extras.repo <<EOF
[extras]
name=CentOS-5 - Extras
#mirrorlist=http://mirrorlist.centos.org/?release=5&arch=\$basearch&repo=extras
baseurl=http://mirror.centos.org/centos/5/extras/\$basearch/
gpgcheck=1
gpgkey=http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-5
enabled=0
EOF

To install DRBD:

yum --enablerepo=extras install drbd83.x86_64 kmod-drbd83.x86_64

Configuration:

cat > /etc/drbd.conf <<EOF
# syncer rate
common { syncer { rate 10M; } }
# recurso
resource r0 {
        protocol C;

        on opsview1 {
                device    /dev/drbd0;
                disk      /dev/sdXX;
                address   IP1:7789;
                meta-disk internal;
        }
        on opsview2 {
                device    /dev/drbd0;
                disk      /dev/sdXX;
                address   IP2:7789;
                meta-disk internal;
        }
}
EOF

Set up the resources:

modprobe drbd
drbdadm create-md r0
drbdadm up r0

Now, both nodes should be in Secondary/Inconsistent state:

# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by mockbuild@v20z-x86-64.home.local, 2009-08-29 14:07:55
 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:754774856

And we should force one of the nodes to be the primary:

drbdadm -- --overwrite-data-of-peer primary r0

DRBD is now set up and you should format the DRBD partition with whatever filesystem you want to use (ext3 in the example):

mkfs.ext3 /dev/drbd0

Heartbeat

Heartbeat is necessary in order to detect when the master fails and let the slave take control.

First of all, install it in both nodes:

yum --enablerepo=extras install heartbeat.x86_64

Configuration in both nodes:

cat > /etc/ha.d/ha.cf <<EOF
logfacility     daemon
keepalive 1
deadtime 10
warntime 5
initdead 120 # depend on your hardware
udpport 694
bcast ethX
auto_failback off
node    opsview1
node    opsview2
respawn hacluster /usr/lib64/heartbeat/ipfail
EOF

mkdir /mnt/data

cat >> /etc/ha.d/haresources <<EOF
opsview1.example.com IPaddr2::${IPADDR}/255.255.255.240/ethY drbddisk::r0 Filesystem::/dev/drbd0::/mnt/shared::ext3 opsview opsview-web httpd
EOF

cat > /etc/ha.d/authkeys <<EOF
auth 1
1 crc
EOF

chmod 600 /etc/ha.d/authkeys

In both nodes there should be the same haresources file.

Opsview

Once finished the DRBD and Heartbeat installation, the shared components should be symlinked in both nodes to the shared partition

mkdir /mnt/shared/nagios
ln –s /mnt/shared/nagios /usr/local/nagios
mkdir /mnt/shared/nagios-home
ln –s /mnt/shared/nagios-home /var/log/nagios

Now you can proceed installing opsview

yum install opsview

MySQL

To setup the database you have to take into account that the second node will be replicating the first, and after a failure the first will start replicating the second, so both nodes sould be able to act as a replication master (more info on this here: Replication HowTo).

In both nodes, create users for replication:

mysql> GRANT REPLICATION SLAVE ON *.*
    -> TO 'repl'@'%.example.com' IDENTIFIED BY 'slavepass';

In both my.cnf configuration files (be sure to set different server id's in each node):

[mysqld]
log-bin=mysql-bin
server-id=DIFFERENT_ID_IN_EACH_NODE
innodb_flush_log_at_trx_commit=1
innodb_autoinc_lock_mode=0 # Required for MySQL 5.1 or later
sync_binlog=1

Get the master status:

mysql> FLUSH TABLES WITH READ LOCK;
mysql> SHOW MASTER STATUS;
+---------------+----------+--------------+------------------+
| File          | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+---------------+----------+--------------+------------------+
| mysql-bin.003 | 73       | test         | manual,mysql     |
+---------------+----------+--------------+------------------+

Start replication in the slave:

CHANGE MASTER TO
  MASTER_HOST='master_host_name',
  MASTER_USER='replication_user_name',
  MASTER_PASSWORD='replication_password',
  MASTER_LOG_FILE='recorded_log_file_name',
  MASTER_LOG_POS=recorded_log_position;

START SLAVE;

When opsview is installed and the database replication is already working, opsview should be configured to connect with the database running in localhost. Since the opsview configuration files are in the shared storage, and the database is replicating in the slave node, the opsview configuration just has to be performed in the master.

To test if the failover is done as expected, the hb_takeover script (installed with heartbeat) can be used in the slave.

Managing a failure

In case of a failure, the monitoring services will be started in the slave by the heartbeat system. The only thing that should be done in the slave is stop the MySQL replication:

mysql> stop slave;

Now is time to diagnose and repair the first node failure. Is REALLY important here, before get the first node online, to check that the auto_failback option in heartbeat is OFF. Once the first node is repaired and online again, the MySQL replication should be reconfigured: the old master will be now the slave and vice versa.

Before starting the replication, the second node's database should be imported into the first node. On the second node:

mysql> FLUSH TABLES WITH READ LOCK;

in another session:

$ mysqldump --all-databases --master-data > opsview_dump.sql

The -master-data option automatically appends the statements required on the slave to start the replication process.

Again in the mysql session

mysql> UNLOCK TABLES;

This process could require a quite huge amount of time if the ODW database is being used. If so, is recommended to first export the runtime and opsview databases locking the tables, and then, without locking, export the odw.

Managing Opsview updates

To update Opsview you have to execute the following command in the active node when a new version is available.

yum update

After execute it in the active, you can execute it in the passive node.

Navigation
Print/export
Toolbox