Welcome to docs.opsview.com

Differences

This shows you the differences between two versions of the page.

opsview4.6:hamaster [2014/09/09 12:19]
127.0.0.1 external edit
opsview4.6:hamaster [2015/04/29 14:02] (current)
nferguson
Line 1: Line 1:
-====== High Availability Master ====== +====== High Availability ======
- +
-This document covers the theory on setting up high availability for the Opsview master. +
- +
-There are different ways of achieving high availability. You could run the Opsview master in a virtual machine and provide availability at the virtual machine level - that setup is outside of the scope of this document.  +
- +
-This document assumes you have distinct hardware which provide failover capability. +
- +
-This document also assumes CentOS, though the theory is applicable to other OSes. +
- +
- +
-===== Two Node Solution ===== +
- +
-This configuration involves two near-identical servers which work in active-standby mode. This assumes both servers will be running the mysql databases locally. +
- +
-Some form of shared disk will be required.  The test setup with Opsview was completed using a simple NFS mount on each node, however a more robust solution such as a fibre-attached or iSCSI volume should probably be used - in this instance use of a clustering filesystem (gfs/ocfs) will be necessary. +
- +
- +
-Example ha.cf +
- +
-  debugfile /var/log/ha-debug +
-  logfile /var/log/ha-log +
-  logfacility local0 +
-  keepalive 5 +
-  bcast eth0 +
-  auto_failback on +
-  ping 192.168.10.30   # NFS server +
-  debug 1 +
-  use_logd no +
-  node ha-centos4-a +
-  node ha-centos4-b +
- +
-Example haresources +
- +
-  # ha-centos4-b is specified as the 'preferred' node, i.e. this is where Opsview will usually run +
-  ha-centos4-b opsview opsview-web IPaddr::192.168.10.130/24 +
- +
-Note that the above files are based on the 'old' and very basic Linux-HA configuration files.  From version 2 of this project it's preferable to use XML config files to specify a much more detailed configuration.  I've used the old method because it's quick and simple and useful for proof of concept.  The above configuration can of course be reproduced using the new configuration files. +
- +
- +
-==== Installation process ==== +
-The process for the test setup was as follows: +
- +
-  * Install two identical CentOS4 machines +
-  * Create nagios user/group and nagcmd group ensuring the UIDs match on both machines +
-  * Mount /usr/local/nagios on each machine using a shared disk/filesystem +
-  * Mount /var/lib/mysql (or whichever is the location for mysql's data directories) using a shared disk/filesystem +
-  * Install the Opsview RPMs on both machines +
-  * Install the database using the Opsview scripts on one machine +
-  * Set up Linux-HA as described above +
- +
- +
-===== Four Node Solution ===== +
-This solution uses two database servers running in active-standby, and two Opsview application servers which also run as active-standby.  This allows the database to be separated from the application server (as is recommended on larger installations anyway), whilst also providing a highly available configuration. +
- +
-The Mysql database servers can be setup with shared disks or in a master-master replication (http://www.onlamp.com/pub/a/onlamp/2006/04/20/advanced-mysql-replication.html). Either way, there should be a virtual IP address which Opsview will communicate with mysql, that will switch between either database server depending on whichever is the active node. +
- +
-{{:opsview4.6:opsview_ha_setup_4_node.png|Opsview HA Setup 4 Node}} +
- +
- +
-===== Considerations in a distributed environment ===== +
-If you have a distributed setup, ensure that: +
-  * The same SSH private key for the nagios user is used by both masters, so that communications to the slaves will still work following a switch over +
-  * The host used as the master server has a host address of ''localhost'', so that checks run for the master server work regardless of which is the primary one +
- +
-If you are using the normal forward SSH, the Opsview master will automatically re-establish connections. If you are using [[opsview4.1:monitoringserver#ssh_tunnel|reverse SSH]], the slave must connect to the Opsview master via the virtual IP. Due to SSH host key checking, you will need to use the same SSH host key on the standby master. Test by connecting to both masters from the command line on each slave node. +
- +
-===== Two Node with Mysql Replication Solution ===== +
- +
-==== Replication ==== +
- +
-To give good fault tolerance and recovery in MySQL, the storage engine should be InnoDB. InnoDB can recover from crashes, rolling back half done transactions, and committing pending transactions to disk in case of failure. MyISAM is not capable of such operations, needing table repair, with possible loss of information, in case that tables are not properly flushed to disk. InnoDB on the other hand has a bigger data footprint for the same amount of data, compared to MyISAM. +
- +
-Opsview uses a combination of MyISAM and InnoDB, exploiting the best features of each data engine. ODW is best in MyISAM (for long-term storage, MyISAM provides less storage footprint). Opsview and Runtime databases use InnoDB (providing constraints, cascading updates and deletes, transactionality, etc). +
- +
-==== Shared Storage / DRBD ==== +
- +
-Opsview does not support active/active master monitoring due to architectural restrictions. HA has to be achieved via an active/passive setup. Only one server will monitor, and provide interface services, while the other will have all monitoring services stopped. In case of a master failure, the services need to be started on the passive node. +
- +
-If the Opsview services need to be started up on another master, due to a hardware failure, the newly spawned daemon will need the same view of the data the failed node left behind. Via an iSCSI device that the active node mounts, or via DRBD (a block device that synchronously mirrors it’s contents to another block device on another server, providing RAID-1 over the net). +
- +
-Note: Sharing the data via NFS is not recommended +
- +
-==== General Diagram ==== +
- +
-{{:opsview4.6:opsview_ha_setup_2_node.png|Opsview HA Setup 2 Node}} +
- +
-==== Installation Process ==== +
- +
-The two nodes will share the nagios installation (''/usr/local/nagios''), the nagios user home (''/var/log/nagios''), and if you use apache, the configuration files can be shared as well (''/etc/httpd''). +
- +
-The opsview master will connect always against the database running in the localhost. That way, if the master fails and the slave takes the control, opsview will start using the database instance which was replicating data. +
- +
-=== Shared Storage/DRBD === +
- +
-It should be installed in both nodes. If RedHat is used, you can add the CentOS extra repositories to ''yum'': +
-<code> +
-cat > /etc/yum.repos.d/centos-extras.repo <<EOF +
-[extras] +
-name=CentOS-5 - Extras +
-#mirrorlist=http://mirrorlist.centos.org/?release=5&arch=\$basearch&repo=extras +
-baseurl=http://mirror.centos.org/centos/5/extras/\$basearch/ +
-gpgcheck=1 +
-gpgkey=http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-5 +
-enabled=0 +
-EOF +
-</code> +
- +
-To install DRBD: +
-<code> +
-yum --enablerepo=extras install drbd83.x86_64 kmod-drbd83.x86_64 +
-</code> +
- +
-Configuration: +
-<code> +
-cat > /etc/drbd.conf <<EOF +
-# syncer rate +
-common { syncer { rate 10M; } } +
-# recurso +
-resource r0 { +
-        protocol C; +
- +
-        on opsview1 { +
-                device    /dev/drbd0; +
-                disk      /dev/sdXX; +
-                address   IP1:7789; +
-                meta-disk internal; +
-        } +
-        on opsview2 { +
-                device    /dev/drbd0; +
-                disk      /dev/sdXX; +
-                address   IP2:7789; +
-                meta-disk internal; +
-        } +
-+
-EOF +
-</code> +
- +
-Set up the resources: +
-<code> +
-modprobe drbd +
-drbdadm create-md r0 +
-drbdadm up r0 +
-</code> +
- +
-Now, both nodes should be in Secondary/Inconsistent state: +
-<code> +
-# cat /proc/drbd +
-version: 8.3.2 (api:88/proto:86-90) +
-GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by mockbuild@v20z-x86-64.home.local, 2009-08-29 14:07:55 +
- 0: cs:Connected ro:Secondary/Secondary ds:Inconsistent/Inconsistent C r---- +
-    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:754774856 +
-</code> +
- +
-And we should force one of the nodes to be the primary: +
-<code> +
-drbdadm -- --overwrite-data-of-peer primary r0 +
-</code> +
- +
-DRBD is now set up and you should format the DRBD partition with whatever filesystem you want to use (ext3 in the example): +
-<code> +
-mkfs.ext3 /dev/drbd0 +
-</code> +
- +
-=== Heartbeat === +
- +
-Heartbeat is necessary in order to detect when the master fails and let the slave take control. +
- +
-First of all, install it in both nodes: +
-<code> +
-yum --enablerepo=extras install heartbeat.x86_64 +
-</code> +
- +
-Configuration in both nodes: +
-<code> +
-cat > /etc/ha.d/ha.cf <<EOF +
-logfacility     daemon +
-keepalive 1 +
-deadtime 10 +
-warntime 5 +
-initdead 120 # depend on your hardware +
-udpport 694 +
-bcast ethX +
-auto_failback off +
-node    opsview1 +
-node    opsview2 +
-respawn hacluster /usr/lib64/heartbeat/ipfail +
-EOF +
- +
-mkdir /mnt/data +
- +
-cat >> /etc/ha.d/haresources <<EOF +
-opsview1.example.com IPaddr2::${IPADDR}/255.255.255.240/ethY drbddisk::r0 Filesystem::/dev/drbd0::/mnt/shared::ext3 opsview opsview-web httpd +
-EOF +
- +
-cat > /etc/ha.d/authkeys <<EOF +
-auth 1 +
-1 crc +
-EOF +
- +
-chmod 600 /etc/ha.d/authkeys +
-</code> +
- +
-In both nodes there should be the same ''haresources'' file. +
- +
-=== Opsview === +
- +
-Once finished the DRBD and Heartbeat installation, the shared components should be symlinked in both nodes to the shared partition +
-<code> +
-mkdir /mnt/shared/nagios +
-ln –s /mnt/shared/nagios /usr/local/nagios +
-mkdir /mnt/shared/nagios-home +
-ln –s /mnt/shared/nagios-home /var/log/nagios +
-</code> +
- +
-Now you can proceed installing opsview +
-<code> +
-yum install opsview +
-</code> +
- +
-=== MySQL === +
- +
-To setup the database you have to take into account that the second node will be replicating the first, and after a failure the first will start replicating the second, so both nodes sould be able to act as a replication master (more info on this here: [[http://dev.mysql.com/doc/refman/5.0/en/replication-howto.html|Replication HowTo]]). +
- +
-In both nodes, create users for replication: +
-<code> +
-mysql> GRANT REPLICATION SLAVE ON *.* +
-    -> TO 'repl'@'%.example.com' IDENTIFIED BY 'slavepass'; +
-</code> +
- +
-In both ''my.cnf'' configuration files (be sure to set different server id's in each node): +
-<code> +
-[mysqld] +
-log-bin=mysql-bin +
-server-id=DIFFERENT_ID_IN_EACH_NODE +
-innodb_flush_log_at_trx_commit=1 +
-innodb_autoinc_lock_mode=0 # Required for MySQL 5.1 or later +
-sync_binlog=1 +
-</code> +
- +
-Get the master status: +
-<code> +
-mysql> FLUSH TABLES WITH READ LOCK; +
-mysql> SHOW MASTER STATUS; +
-+---------------+----------+--------------+------------------+ +
-| File          | Position | Binlog_Do_DB | Binlog_Ignore_DB | +
-+---------------+----------+--------------+------------------+ +
-| mysql-bin.003 | 73       | test         | manual,mysql     | +
-+---------------+----------+--------------+------------------+ +
-</code> +
-Start replication in the slave: +
-<code> +
-CHANGE MASTER TO +
-  MASTER_HOST='master_host_name', +
-  MASTER_USER='replication_user_name', +
-  MASTER_PASSWORD='replication_password', +
-  MASTER_LOG_FILE='recorded_log_file_name', +
-  MASTER_LOG_POS=recorded_log_position; +
- +
-START SLAVE; +
-</code> +
- +
-When opsview is installed and the database replication is already working, opsview should be configured to connect with the database running in localhost. Since the opsview configuration files are in the shared storage, and the database is replicating in the slave node, the opsview configuration just has to be performed in the master. +
- +
-To test if the failover is done as expected, the hb_takeover script (installed with heartbeat) can be used in the slave. +
- +
-==== Managing a failure ==== +
- +
-In case of a failure, the monitoring services will be started in the slave by the heartbeat system. The only thing that should be done in the slave is stop the MySQL replication: +
-<code> +
-mysql> stop slave; +
-</code> +
- +
-Now is time to diagnose and repair the first node failure. Is REALLY important here, before get the first node online, to check that the ''auto_failback'' option in heartbeat is OFF. Once the first node is repaired and online again, the MySQL replication should be reconfigured: the old master will be now the slave and vice versa. +
- +
-Before starting the replication, the second node's database should be imported into the first node. On the second node: +
-<code> +
-mysql> FLUSH TABLES WITH READ LOCK; +
-</code> +
- +
-in another session: +
-<code> +
-$ mysqldump --all-databases --master-data > opsview_dump.sql +
-</code> +
-The ''-master-data'' option automatically appends the statements required on the slave to start the replication process. +
- +
-Again in the mysql session +
-<code> +
-mysql> UNLOCK TABLES; +
-</code> +
- +
-This process could require a quite huge amount of time if the ODW database is being used. If so, is recommended to first export the ''runtime'' and ''opsview'' databases locking the tables, and then, without locking, export the ''odw''. +
- +
- +
-==== Managing Opsview updates ==== +
- +
-To update Opsview you have to execute the following command in the active node when a new version is available. +
-<code> +
-yum update +
-</code> +
- +
-After execute it in the active, you can execute it in the passive node.+
 +The documentation previously provided on this page was out of date and no longer accurate for setting up High Availability within Opsview.  To avoid confusion, it has been removed whilst an updated document is produced, reviewed and published.
Navigation
Print/export
Toolbox