Welcome to docs.opsview.com

Differences

This shows you the differences between two versions of the page.

opsview4.6:odw [2014/09/09 12:19] (current)
Line 1: Line 1:
 +====== Opsview Data Warehouse (ODW) ======
 +The Opsview Data Warehouse, or ODW, is the long term storage for monitoring data. It converts the data from the Runtime database into an OLAP datawarehouse format.
 +A core requirement for a datawarehouse is to not normalise the data too much, yet still allow queries to be easily created. As you'll see, hopefully we will have reached this target.
 +
 +===== Overview =====
 +The datawarehouse pulls data in once an hour. It works out if there have been any configuration changes and stores those if required. It then saves various Nagios® Core information from the runtime database before calculating the summary statistics.
 +
 +
 +==== Distributed Monitoring ====
 +ODW calculates information based on the Opsview Master server. Any slaves monitoring remote sites will automatically send their information to the master, and hence will be captured in ODW as well.
 +
 +If you have a connection to slaves which is not consistently available, you can set parameters to temporarily delay the ODW import if all slaves are not OK. This increases the integrity of the data in ODW. You can enable this functionality by setting a [[opsview4.6:configuration_files#detect_slave_status_on_import|configuration variable]].
 +
 +
 +==== Limitations ====
 +  * Only service information is stored
 +
 +==== References ====
 +The datawarehouse has been created based on guidelines in Ralph Kimball's book, [[http://www.amazon.co.uk/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247|The Data Warehouse Toolkit]].
 +
 +
 +===== Setup =====
 +In [[opsview4.6:systempreference#odw|System Preferences in the ODW tab]], enable ODW importing.
 +
 +A cron job will run every 4 minutes past the hour to import into ODW.
 +
 +**Important**: There is a default servicecheck called //Opsview Data Warehouse Status// which will alert if the ODW database is not up to date. Set this up so you will be alerted if the imports fall behind more than 2 hours.
 +
 +
 +===== Available data =====
 +==== Schema ====
 +A picture can say a thousand words, so our essay for the ODW relationships is this {{opsview4.6:odw_schema.pdf|schema diagram}}.
 +
 +
 +
 +==== Dimensions ====
 +The dimensions are primarily host and servicechecks. As we also store configuration information about hosts and services, there are possibly more than 1 row for each host/service. When designing your SQL query, make sure you take into consideration that multiple rows could exist and join your tables appropriately.
 +
 +There is also a performance label dimension. This allows quick querying to find out what performance information is available for a given host.
 +
 +
 +==== Facts ====
 +These are the core information. There are five tables:
 +  * state_history - this shows all the state changes that have occurred for a given service
 +  * servicecheck_results - this shows the result of every check from every service
 +  * performance_data - this shows every performance result from every service. If you only using averages, then use [[opsview4.6:odw#performance_hourly_summary|performance_hourly_summary]] table instead
 +  * downtime_*_history - contains all downtimes for all hosts and services
 +  * acknowledgement_* - contains all acknowledgements for all hosts and services
 +
 +**Note** The downtime_*_history tables use the nagios_object_id as the key for the host/service. This is because a downtime is a range of time, so it is possible that a host or service's configuration could change, thus the key would be different. However, since nagios_object_id is fixed, we key against that instead. The acknowledgement_* tables do not use nagios_object_id because an acknowledgement is at a point in time, so does not have the same problem.
 +
 +==== Summary facts ====
 +There are two tables:
 +  * service_availability_hourly_summary
 +  * performance_hourly_summary
 +
 +=== service_availability_hourly_summary ===
 +This is an aggregated table, taking information from state_history, downtime_*_history and acknowledgement_* to provide numerical information about the number of seconds that a service is in a particular state. This allows quick calculations to discover any problems you might have.
 +
 +The columns in this table are:
 +  * start_datetime - this is the top of the hour for the summary information
 +  * seconds_ok - this is total number of seconds that this service was okay
 +  * seconds_not_ok - normally ok + not_ok = 3600 (not true if a service has started in the middle of an hour)
 +  * seconds_warning - amount of time in a warning state
 +  * seconds_critical - amount of time in a critical state
 +  * seconds_unknown - amount of time in an unknown state
 +  * seconds_not_ok_hard - equivalent to seconds_not_ok, but based on a hard state only. Always <= seconds_not_ok_hard
 +  * seconds_warning_hard - amount of time in a hard warning state. Always <= seconds_warning
 +  * seconds_critical_hard - amount of time in a hard critical state
 +  * seconds_unknown_hard - amount of time in a hard unknown state
 +  * seconds_not_ok_scheduled - amount of time in a scheduled downtime for host/service and service not ok
 +  * seconds_warning_scheduled
 +  * seconds_critical_scheduled
 +  * seconds_unknown_scheduled - the summation of warning, critical and unknown will give seconds_not_ok_scheduled
 +  * seconds_unacknowledged - amount of time after a hard state change before an acknowledgement is received. See notes below about the calculation of seconds_unacknowledged
 +  * seconds_unhandled - amount of time service was in a failure state and not acknowledged and not in downtime and host was up. See notes below about the calculation of seconds_unhandled. Available from Opsview 4.1
 +  * seconds_unhandled_hard - amount of time service was in a hard failure state and not acknowledged and not in downtime and host was up. Available from Opsview 4.1
 +
 +== Notes ==
 +  * When first importing, all failures are considered acknowledged
 +  * Downtimes that start before the first import are ignored
 +  * acknowledged and scheduled downtimes are not linked to state because it shouldn't matter
 +  * "*_soft" has not been added. This can be calculated as, eg, seconds_warning - seconds_warning_hard
 +  * Nagios Core has a concept of a "soft OK". In ODW, all OK states are hard
 +  * seconds_not_ok_scheduled is incremented if a downtime is set for a service that is already in a failed state
 +  * If the host for a service is in a downtime state, then the service will be considered to be in downtime too
 +  * seconds_unacknowledged starts from the time of a failed hard state
 +  * If a host is acknowledged, all services on that host will be acknowledged too (if they are in a failed state)
 +  * If a service is acknowledged before it goes into a hard state, seconds_unacknowledged will be = 0 for that particular outage
 +  * seconds_unacknowledged will not be incremented if a service was in downtime at the time of the hard failure. However, if a service is already in a failed state and is unacknowledged, the seconds_unacknowledged will continue to increment during this downtime period
 +  * seconds_unacknowledged will start incrementing when a downtime finishes
 +  * If a service is already in a failed state without being acknowledged and downtime is set, seconds_unacknowledged will include the seconds elapsed during this downtime period. However, this would not be usual operational behaviour
 +  * To find the amount of time a service has been in critical without a scheduled downtime, use seconds_critical - seconds_critical_scheduled
 +  * seconds_unhandled counts its time based on events of the service, so acknowledgements, downtime start and downtime stop events on the service will count the time appropriately. However, host state events for the particular service are not included necessarily recorded, so the unhandled calculation maybe too high until a subsequent event occurs for the service
 +
 +=== performance_hourly_summary ===
 +
 +This is an aggregated table of the performance data for each performance plot. This allows you to simply get the information over time. The maximum granularity is on an hour basis.
 +
 +The columns in this table are:
 +  * start_datetime - this is the top of the hour for the summary information
 +  * performance_label - this joins with the performance_labels table
 +  * average - the average value over the hour
 +  * max - the maximum value over the hour
 +  * min
 +  * count - the number of performance plots in this hour. If there were no plots, there would be no rows
 +  * stddev - the standard deviation of a sample within the hour. This is the same as Excel's STDEV function
 +  * stddevp - the standard deviation of the population within the hour. This is the same as Excel's STDEVP function
 +  * first - the first value in the hour
 +  * sum - the total of all values in that hour
 +
 +**Hint**: If you want an average over a day, do not use ''AVG(average)''. Use ''SUM(sum)/SUM(count)'' instead.
 +
 +
 +===== Example queries =====
 +
 +==== Percentage availability for a given day for all services ====
 +<code>
 +select sum(seconds_ok)/sum(seconds_ok+seconds_not_ok)*100
 +from service_availability_hourly_summary
 +where start_datetime between '2007-11-06 00:00:00' and '2007-11-06 23:00:00'
 +</code>
 +
 +==== Percentage of time in error due to hard state ====
 +<code>
 +select sum(seconds_not_ok_hard)/sum(seconds_ok+seconds_not_ok)*100
 +from service_availability_hourly_summary
 +where start_datetime between '2007-11-06 00:00:00' and '2007-11-06 23:00:00'
 +</code>
 +
 +==== Percentage of time in critical state ====
 +<code>
 +select sum(seconds_critical)/sum(seconds_ok+seconds_not_ok)*100
 +from service_availability_hourly_summary
 +where start_datetime between '2007-11-06 00:00:00' and '2007-11-06 23:00:00'
 +</code>
 +
 +
 +==== Percentage availability (excluding scheduled downtimes) ====
 +<code>
 +select sum(seconds_not_ok-seconds_not_ok_scheduled)/sum(seconds_ok+seconds_not_ok)*100
 +from service_availability_hourly_summary
 +where start_datetime between '2007-11-06 00:00:00' and '2007-11-06 23:00:00'
 +</code>
 +
 +==== Percentage availability for all HTTP services during working hours (8am-6pm) ====
 +<code>
 +select sum(seconds_ok)/sum(seconds_ok+seconds_not_ok)*100
 +from service_availability_hourly_summary,servicechecks
 +where start_datetime between '2007-11-06' and '2007-11-12'
 +and service_availability_hourly_summary.servicecheck=servicechecks.id
 +and servicechecks.name = 'HTTP'
 +and time(start_datetime) between '08:00:00' and '18:00:00'
 +</code>
 +
 +==== Average response time to acknowledge failures ====
 +<code>
 +select avg(seconds_unacknowledged)
 +from service_availability_hourly_summary
 +where start_datetime between '2007-11-06 00:00:00' and '2007-11-06 23:00:00'
 +and seconds_unacknowledged > 0
 +</code>
 +**Note**: you have to ask for seconds_unacknowledge > 0 (which means a failure actually occurred), otherwise it will bring the average down. Also, if a problem automatically recovers, then the seconds_unacknowledged counter stops, so this will not always be due to a human response time.
 +
 +==== List all services that failed during a scheduled downtime for a given week ====
 +<code>
 +select distinct(servicechecks.id),hostname,name
 +from service_availability_hourly_summary, servicechecks
 +where service_availability_hourly_summary.servicecheck=servicechecks.id
 +and start_datetime between '2007-11-06' and '2007-11-12'
 +and seconds_not_ok_scheduled > 0
 +</code>
 +==== List all performance data for a host ====
 +<code>
 +select
 + performance_data.datetime,
 + servicechecks.hostname,
 + servicechecks.name,
 + performance_labels.name,
 + performance_labels.units,
 + performance_data.value
 +from
 + servicechecks,
 + performance_labels,
 + performance_data
 +where
 + servicechecks.id = performance_labels.servicecheck
 + and performance_labels.id = performance_data.performance_label
 + and performance_data.datetime between '2009-01-01' and '2009-01-02'
 + and servicechecks.hostname='opsview'
 +</code>
 +
 +**Note**: It is necessary to join against the servicechecks table because the servicechecks.id number will change when there has been a configuration change to the service or its associated host. Searching the servicechecks table via the hostname will ensure that this host's information will always be returned.
 +==== List daily average values for all performance data on a host ====
 +There are two ways of calculating this:
 +  * using summary data
 +  * using raw data
 +
 +The summary data takes about a tenth of the time of using the raw data and so we recommend you use the summary tables.
 +
 +Note also that there maybe slight differences in the values returned between the two calls, but these are rounding errors after 13 significant digits.
 +
 +=== Summary data query ===
 +Note that the BETWEEN statement is '2013-09-11 00:00:00' AND '2013-09-11 23:00:00' because you will want to include the last hour's summary data, but not the hour in the following day.
 +
 +<code>
 +SELECT
 + DATE(performance_hourly_summary.start_datetime) AS date,
 + servicechecks.hostname,
 + servicechecks.name,
 + performance_labels.name,
 + performance_labels.units,
 + SUM(performance_hourly_summary.sum)/SUM(performance_hourly_summary.count) as value
 +FROM
 + servicechecks,
 + performance_labels,
 + performance_hourly_summary
 +WHERE
 + servicechecks.id = performance_labels.servicecheck
 + AND performance_labels.id = performance_hourly_summary.performance_label
 + AND performance_hourly_summary.start_datetime BETWEEN '2013-09-11 00:00:00' AND '2013-09-11 23:00:00'
 + AND servicechecks.hostname='opsview'
 +GROUP BY
 + servicechecks.hostname,
 + servicechecks.name,
 + performance_labels.name,
 + performance_labels.units,
 + DATE(performance_hourly_summary.start_datetime)
 +</code>
 +
 +=== Raw data query ===
 +**Note**: This may include a data point that is in the following day due to the behaviour of BETWEEN, eg 2010-09-12 00:00:00.
 +<code>
 +SELECT
 + DATE(performance_data.datetime) AS date,
 + servicechecks.hostname,  
 + servicechecks.name,  
 + performance_labels.name,  
 + performance_labels.units,
 + AVG(performance_data.value)
 +FROM
 + servicechecks,  
 + performance_labels,  
 + performance_data
 +WHERE
 + servicechecks.id = performance_labels.servicecheck
 + AND performance_labels.id = performance_data.performance_label
 + AND performance_data.datetime BETWEEN '2010-09-11' AND '2010-09-12'
 + AND servicechecks.hostname='opsview'
 +GROUP BY
 + servicechecks.hostname,
 + servicechecks.name,
 + performance_labels.name,
 + performance_labels.units,
 + DATE(performance_data.datetime)
 +</code>
 +
 +==== List daily average and max values for the HTTP response time across all hosts in the last month ====
 +<code>
 +SELECT
 + DATE(performance_hourly_summary.start_datetime) AS date,
 + servicechecks.hostname,
 + performance_labels.name,
 + SUM(sum)/SUM(count) AS average,
 + MAX(max) AS max
 +FROM
 + servicechecks,
 + performance_labels,
 + performance_hourly_summary
 +WHERE
 + servicechecks.id = performance_labels.servicecheck
 + AND performance_labels.id = performance_hourly_summary.performance_label
 + AND performance_hourly_summary.start_datetime BETWEEN DATE(NOW()-INTERVAL 1 MONTH) AND DATE(NOW())-INTERVAL 1 HOUR
 + AND servicechecks.name='HTTP'
 + AND performance_labels.name='time'
 +GROUP BY
 + servicechecks.hostname,
 + performance_labels.units,
 + DATE(performance_hourly_summary.start_datetime)
 +</code>
 +
 +Note that this query uses the summarised performance data table which is much faster (and can be held for much longer) than the raw data table.
 +
 +==== List all service state changes for a particular servicegroup that occurred yesterday ====
 +Order by hostname, servicename, so you can see state changes for that service over the time period.
 +<code>
 +select
 + servicechecks.hostname,
 + servicechecks.name,
 + state_history.datetime,
 + state_history.status,
 + state_history.output
 +from
 + state_history,
 + servicechecks
 +where
 + state_history.datetime between '2008-12-02 00:00:00' and '2008-12-03 00:00:00'
 + and state_history.servicecheck = servicechecks.id
 + and servicechecks.servicegroup = 'Operations'
 +order by
 + servicechecks.hostname,
 + servicechecks.name
 +</code>
 +
 +==== Unix disk usage for a specific host ====
 +
 +List disks and show MB used and percent used for each filesystem:
 +<code>
 +SELECT
 + start_datetime,
 + performance_labels.name,
 + AVG(case performance_labels.units when 'MB' then average else null end) 'MB used',
 + AVG(case performance_labels.units when '' then average else null end) '% used'
 +FROM
 + performance_hourly_summary,
 + performance_labels,
 + servicechecks
 +WHERE
 + performance_labels.servicecheck=servicechecks.id
 + AND performance_labels.id=performance_hourly_summary.performance_label
 + AND servicechecks.hostname='opsview'
 + AND servicechecks.name LIKE 'Unix disk:%'
 +GROUP BY
 + start_datetime,
 + servicechecks.name
 +ORDER BY
 + start_datetime DESC
 +LIMIT 10;
 ++---------------------+----------------+------------------+------------------+
 +| start_datetime      | name           | MB used          | % used           |
 ++---------------------+----------------+------------------+------------------+
 +| 2013-01-03 19:00:00 | /              | 3548.08333333333 | 88.0191666666666 |
 +| 2013-01-03 19:00:00 | /mnt/tmpfs     |               24 |               96 |
 +| 2013-01-03 19:00:00 | /var/lib/mysql |            39526 |            92.48 |
 +| 2013-01-03 18:00:00 | /              |             3546 |          87.9675 |
 +| 2013-01-03 18:00:00 | /mnt/tmpfs     |               24 |               96 |
 +| 2013-01-03 18:00:00 | /var/lib/mysql | 39525.0833333333 |          92.4775 |
 +| 2013-01-03 17:00:00 | /              | 3543.91666666667 |           87.915 |
 +| 2013-01-03 17:00:00 | /mnt/tmpfs     |               24 |               96 |
 +| 2013-01-03 17:00:00 | /var/lib/mysql |            39524 |            92.47 |
 +| 2013-01-03 16:00:00 | /              |           3541.5 |           87.855 |
 ++---------------------+----------------+------------------+------------------+
 +10 rows in set (0.50 sec)
 +</code>
 +
 +
 +==== Windows disk usage for a specific host ====
 +
 +List disks and show GB used and percent used for each filesystem:
 +<code>
 +
 +SELECT
 + start_datetime,
 + performance_labels.name,
 + AVG(case performance_labels.units when 'G' then average else null end) 'G used',
 + AVG(case performance_labels.units when '%' then average else null end) '% used'
 +FROM
 + performance_hourly_summary,
 + performance_labels,
 + servicechecks
 +WHERE
 + performance_labels.servicecheck=servicechecks.id
 + AND performance_labels.id=performance_hourly_summary.performance_label
 + AND servicechecks.hostname='windows'
 + AND servicechecks.name LIKE 'C Drive'
 +GROUP BY
 + start_datetime,
 + servicechecks.name
 +ORDER BY
 + start_datetime DESC
 +LIMIT 10;
 +
 ++---------------------+------+---------+--------+
 +| start_datetime      | name | GB used | % used |
 ++---------------------+------+---------+--------+
 +| 2012-11-23 14:00:00 | C: % |   32.31 |     54 |
 +| 2012-11-23 13:00:00 | C: % |   32.31 |     54 |
 +| 2012-11-23 12:00:00 | C: % |   32.31 |     54 |
 +| 2012-11-23 11:00:00 | C: % |   32.31 |     54 |
 +| 2012-11-23 10:00:00 | C: % |   32.31 |     54 |
 +| 2012-11-23 09:00:00 | C: % |   32.31 |     54 |
 +| 2012-11-23 08:00:00 | C: % |   32.31 |     54 |
 +| 2012-11-23 07:00:00 | C: % |   32.31 |     54 |
 +| 2012-11-23 06:00:00 | C: % |   32.31 |     54 |
 +| 2012-11-23 05:00:00 | C: % |   32.31 |     54 |
 ++---------------------+------+---------+--------+
 +10 rows in set (0.04 sec)
 +</code>
 +
 +===== Frequently asked questions =====
 +==== Why aren't you using NDO? ====
 +We are. However, NDO has not really been designed for long term storage (by default, service results are set to expire after a week). By pulling the data from NDO into our own schema, we can format the data so that can be held for a longer period of time. We can also add indexes where it is most appropriate, without worrying if new updates to NDO will break other software that relies on ODW.
 +
 +==== How come it is sometimes "servicecheck" and other times "service"? ====
 +Unfortunately, this is a historical problem and would be difficult to correct without aliasing of table and column names. We've have now made sure our [[opsview4.6:terminology|terminology]] is correct, so new work should be consistently named.
 +==== How are timezones stored? ====
 +All data in ODW is stored using UTC, not the local time of the server, so no timezone information is stored.
 +
 +
 +==== How can I ensure timezones are calculated correctly? ====
 +If you access ODW using the Opsview libraries (Odw.pm or Odw::Schema), then the timezone for the session is switched to UTC automatically.
 +
 +If you access ODW from a mysql command line, you can either set the session timezone to be UTC with: ''set time_zone="+00:00"'' or make the conversions as part of the input:
 +  * use CONVERT_TZ(datetime, @@session.time_zone, ‘+00:00’) to convert to UTC when inputting date ranges
 +  * use CONVERT_TZ(datetime, ‘+00:00’, @@session.time_zone) to convert from UTC when displaying times
 +
 +==== How long does a dataload take? ====
 +This information is stored in the dataloads table. You can see when the data load time for a particular hour period was made:
 +<code>
 +select
 + from_unixtime(period_start_timev) as period_top_of_hour,
 + from_unixtime(load_start_timev) as load_start_time,
 + load_end_timev-load_start_timev as duration
 +from dataloads
 +order by id desc
 +limit 10
 +</code>
 +
 +==== Can I see the trends of previous data loads? ====
 +Run this query:
 +<code>
 +select
 + from_unixtime(period_start_timev),
 + from_unixtime(load_start_timev),
 + duration,
 + num_serviceresults
 +from dataloads
 +order by id desc
 +limit 100
 +</code>
 +This will show the start of the hour period being imported, when the import took place, the duration and the number of serviceresults that were imported.
 +
 +==== I get a service result with "ODW_STATUS WARNING - No update since: ...." ====
 +This is from the check_odw_status plugin. It means the ODW import job is falling behind.
 +
 +If there was a failure, see the next section.
 +
 +If this is because your data loads are taking more than an hour to run (see above for a query to show the time taken), then one thing you can do is to change the cron entry for import_runtime to only run once per day. This has the effect of doing a lookup only once in the first hour's import and subsequent hours will import much quicker as lookups are not required.
 +
 +Also there is a configuration option ''$detect_slave_status_on_import'' which enables checking slave servers status when running ''import_runtime'' - disabled by default. If enabled all active slave servers have to be in ''OK'' status for at least 5 minutes.
 +
 +Note:
 +  * the crontab will be overwritten as part of an upgrade
 +  * you should change the parameters to the check_odw_status plugin so the threshold for alerting on ODW imports is for 25 hours
 +  * configuration changes for hosts and services will only be recorded when the import runs
 +
 +==== I get an error "There are running dataloads existing" from import_runtime ====
 +This means that a new dataload has run, but the last one hasn't completed yet. This can also occur if, say, a server shutdown was initiated during the last dataload causing an incomplete dataload to occur.
 +
 +There needs to be manual intervention to check:
 +  * that there isn't an existing dataload running
 +  * that there is a row in the dataloads table with a status of "running"
 +
 +The fix depends how far back the last successful import was. Use the check_odw_status plugin to print the last import time, or run the query:
 +
 +<code>
 +mysql> use odw;
 +mysql> select from_unixtime(period_start_timev),status from odw.dataloads order by id desc limit 1;
 ++-----------------------------------+---------+
 +| from_unixtime(period_start_timev) | status  |
 ++-----------------------------------+---------+
 +| 2009-02-05 10:00:00               | running |
 ++-----------------------------------+---------+
 +1 row in set (0.00 sec)
 +</code>
 +
 +If there is definitely no import_runtime process running, then continue with the cleanup. This could have occurred if, for instance, the server was rebooted during the import process. Check for the import_runtime process with:
 +<code>
 +ps -ef | grep import_runtime | grep -v grep
 +</code>
 +
 +**Note**: When you cleanup the last import (detailed below), be aware that some data cannot be reverted completely, such as the end time for downtime_service_history and downtime_host_history.
 +
 +**Note**: If the pid of the last import no longer exists, then it is assumed that the last import failed and an automatic cleanup of the database is run and the import should continue.
 +
 +=== Last import was within a week ===
 +As Runtime holds servicecheck results for a week, you can just get the import to continue from the last point
 +  * manually run: ''/usr/local/nagios/bin/cleanup_import''. This will delete rows related to the failed dataload
 +  * manually run: ''/usr/local/nagios/bin/import_runtime -i 1''. This will import one hours worth of data - there could be other errors, such as [[opsview4.6:mysql#common_tasks|crashed tables]]
 +  * manually run: ''/usr/local/nagios/bin/import_runtime'' or leave the cron job to run. This will import up to current time
 +
 +=== Last import was over a week ago ===
 +If the last import was a very long time ago, but you want the statistical data, you can do the steps from above but note:
 +  * service check results are discarded after a week, so the import script will not be able to import every check result
 +  * performance data is stored within the service check results in runtime, so those will be lost too
 +  * other data (notifications, downtimes, service state changes) are retained in runtime up to the audit log retention period (default 365 days), so will still be imported correctly as long as that data hasn't been pruned. The state changes tables determine the calculations for the service_availability_hourly_summary table
 +
 +If you do not want statistical data between the last import and a recent date, you can restart the import from a different point.
 +  * Run the cleanup script manually: ''/usr/local/nagios/bin/cleanup_import''
 +  * Run a single import, giving a restart time: ''/usr/local/nagios/bin/import_runtime -i 1 -r "2008-10-31 09"''
 +
 +If this works successfully, then you can run import_runtime without the -i option which will cause it to catchup to current time (or you could leave for the next cron job).
 +
 +==== I get an error 'Last update to DB is YYYY-MM-DD HH:MM:SS. Cannot run until after YYYY-MM-DD HH:MM:SS' but the run after date has already past ====
 +
 +This has been seen in testing where the runtime.nagios_programstatus table was holding multiple rows. This table should only ever have 1 row. To fix it, ensure the Opsview daemons are shut down
 +
 +  /etc/init.d/opsview stop
 +
 +and then run the following MySQL queries against the runtime database
 +
 +<code>
 +mysql> delete from runtime.nagios_instances where instance_id != 1;
 +mysql> delete from runtime.nagios_programstatus where instance_id != 1;
 +</code>
 +
 +Then restart Opsview
 +
 +  /etc/init.d/opsview start
 +
 +Please let us [[:support|know]] if you get this error.
 +==== I get the error message "Upgrade in progress" ====
 +This is due to a lock file that is added during a package installation to disable the ODW imports from running. If you are sure that an upgrade is not in progress, then you can remove the lock file from ''/usr/local/nagios/var/upgrade.lock''.
 +
 + 
 +==== How do I change the data in ODW? ====
 +If state change data has not got into the Runtime database, then the ODW will not know about these state changes either. If you need to change the state of a service into a, say, OK state, you need to follow this process:
 +  * Identify the service id: ''select id from servicechecks where name='...' and hostname='...' ''
 +  * Change any values of the appropriate values in service_availability_hourly_summary
 +  * Update the service_saved_state table so that the last recorded state is correct. This ensures the right state is calculated for the next hour
Navigation
Print/export
Toolbox