Welcome to docs.opsview.com

Configuring SNMP Trap processing

Introduction

The SNMP Trap mechanism allows devices on a network to send information back to a management host using the SNMP protocol. This is particularly useful on large networks as it can be used in place of active SNMP (where a device is polled for status information), thus reducing network load. Devices can usually be configured to send specific types of trap such as link status changes, BGP, HSRP and many others, making this a flexible monitoring option. How to configure specific devices to send traps is beyond the scope of this document, so it is assumed that at least one device is already configured to send traps to the Opsview host.

Design

Each service check configured to accept SNMP traps has an ordered list of rules. Each rule is evaluated in turn. If a rule is false, then the next rule is evaluated. If a rule matches as true, the specified action is taken and no more rules for that service check are evaluated.

An action could be either:

  • submit a passive check result to Opsview with an appropriate message, or
  • do nothing, and thus stop processing of any further rules

If a trap falls off the end of all rules because all of the rules were false, then this trap becomes an exception and will appear on the SNMP Trap Exceptions page. This is required so that an administrator is aware that the rules need tuning to cater for this particular trap.

When a trap is received, it contains information about the source IP. This is associated to a host. A host can have more than one SNMP trap service check defined. In this case, each service check is evaluated independently of the others. The figure below shows a trap being evaluated in four service checks, represented as columns.

These columns are not ordered, so there is no guarantee which service check column will be evaluated first.

A consequence of multiple service checks is that a single trap could raise multiple alerts to Opsview. However, there will only ever be one SNMP Trap Exception.

One example of using multiple service checks is if you wanted a service check to show interface status, with another service check alerting on error log messages.

Note: Traps received will have the SNMP community value hidden so that passwords are not stored on the filesystem.

Creating a Service Check (Basic)

To demonstrate the best way to use the SNMP Trap processing support in Opsview, we'll create a service check that generates alerts for a Cisco router when a link state change is detected.

(Image 2)

Firstly, create a new service check with Service Checks → Create new service check and configure it as shown in Image 2. In particular, select SNMP trap under Type of Check. Currently it is necessary to save the service check before it is possible to edit the trap matching rules.


Once saved, you will get to the rules screen for that service check which should look similar to the that displayed in Image 3.

(Image 3)

Choose the Create New Rule in Options to create a new rule.

You can also get to this rules list view by clicking on the Edit SNMP Trap Rules: X link on the service checks list page.


The 'Name' field is functionally insignificant so should simply be set such that its purpose is clear if the list of rules is later reviewed. The following field accepts snippets of perl code which is used to match the contents of a trap and hence trigger the action. Image 4 shows the configuration used to match the link status change that we are attempting to monitor in this instance.

(Image 4)

The rule will match when the name of the trap (in the variable ${TRAPNAME}) is set to either “IF-MIB::linkUp” or “IF-MIB::linkDown”, since the code must return 'true' in order for the trap to be considered a match.


Another important field is the message field, as this determines the text of the alert that will be generated if the rule matches. The macros listed below the field allow us to create a message which contains useful and specific information about the event that has occurred. The syntax ${Px} and ${Vx} allows you to refer to specific lines in the trap contents, and we can use the information obtained from the SNMP Trap Exceptions page to find out what is contained on each line of the trap (image 1). Here we have selected line 6 – the name of the interface, and line 8 – the new state of that interface, in order to create a useful message.


Once the rule has been configured and the changes submitted, the service check can then be applied to a host so that alerts will be generated if this service check's rules match a trap received from that host. Finally the configuration needs to be saved and Opsview reloaded from the Status and Reload page.


After applying the new service check to a router called TestRouter, and setting the interface Serial0/0 to 'down', the following service alert is generated:

[13-11-2006 16:44:47] SERVICE ALERT: TestRouter;Cisco - Link State Change;WARNING;HARD;1;Interface Serial0/0 has changed state to "administratively down"

Creating a Service Check (Advanced)

Often a more complex set of requirements exists which make it necessary to use more than one rule in order to establish whether or not an SNMP trap should match a service check and generate an alert. Similarly it may also be a requirement to generate different levels of alerts depending on the contents of the trap, such as a warning for a normal switchport changing state to 'down' but a critical alert for an uplink port doing the same thing. Equally some ports may be used temporarily by engineers working on site and these might be ignored entirely.


This particular example is illustrated in the rules given below, which would be entered in a service check and applied to a host. We'll assume that interfaces 1 and 2 on a given switch are gigabit uplink ports, interfaces 3 to 12 (inclusive) are permanently connected to nodes, and interfaces 13-26 are available for use by visiting engineers.


FieldValue
NameUplink ports are critical
Rule“${TRAPNAME}” eq “IF-MIB::linkDown” && “${IF-MIB::ifIndex}” < 3
ActionSend Alert
Alert LevelCritcal
MessageUplink port ${V6} has changed state to $V{8}

Note that here we test for IF-MIB::linkDown, and it would be fair to assume that the message need not use the variable ${V8} to obtain the word 'down' and instead this could just be written as text. The reason this has been done, however, is because Cisco devices (and others) report 'administratively down' rather than simply 'down' if the interface has been taken down by a configuration action, which could be helpful to someone when diagnosing the cause.


FieldValue
NameNormally-up switchports raise warnings
Rule“${TRAPNAME}” eq “IF-MIB::linkDown” && “${IF-MIB::ifIndex}” < 13
ActionSend Alert
Alert LevelWarning
MessageSwitchport ${V6} has changed state to $V{8}

Here we are performing a check for an interface index less than 13 as these are ports for which warnings should be raised. Note that we don't need to specify that the interface number must also be greater than 2. This is because the rules are matched in order for a service check, and the previous rule (Rule 1) would have matched if the interface number was 1 or 2.


At this point it would appear that the original requirements for the service check have been satisfied as no alerts will be raised for the remaining switchports (> 12) as intended, however this is not the case. If a received trap does not match a rule in the service check, it will 'fall through' and appear in the SNMP Trap Exceptions page. In order to prevent this from happening, another rule is used to create a match which has no resulting alert associated with it.


FieldValue
NameIgnore engineer switchports
Rule“${TRAPNAME}” eq “IF-MIB::linkDown” && “${IF-MIB::ifIndex}” > 12
ActionStop Processing
Alert LevelN/A
MessageN/A

This should match all remaining traps that were not caught by the first two rules.


As the order of the rules is crucial, on the page with the list of rules, you can click on Reorder and drag and drop to change the order of the execution of rules.

SNMP Trap Exceptions

SNMP traps that are received by Opsview but do not match a rule in a service check column, appear in the SNMP Trap Exceptions page. This can be useful because it displays the contents of the traps that have been received - in order to create an appropriate rule to match a trap it is obviously necessary to know what is inside it. The image below shows a trap which was generated by a Cisco router when its serial link changed state to 'up'.

The name of the trap is displayed here, as well as the reason that the trap was not processed and the full contents of the trap itself.

Possible reasons for traps not getting matched successfully are:

No trapname

The trap is not complete or cannot be recognised.

Trap name not fully translated

The trap has not been fully translated - this will appear if the trapname contains more than 1 digit at the end. It usually means that the MIB file has not been loaded for this device. You will need to manually add your device specific MIB file into Opsview. Locate the file from your device vendor and copy it into /usr/local/nagios/snmp/mib/load.

You will need to restart the snmpd daemon when the MIB has been copied so that it is live. Run as the nagios user:

/usr/local/nagios/bin/snmpd reload

If you use slaves, then you will need to send the MIBs to all slaves. As the nagios user, run:

/usr/local/nagios/bin/send2slaves -s

It is possible that various portions of an SNMP trap will have untranslated OID values. Opsview will not automatically mark those as exceptions because it could be a valid value (for example, some devices encode IP addresses into the OID section).

Not monitoring this IP

A trap has been received from a host not currently being monitored - this can be because the device has not been configured for monitoring (i.e. no host configuration or no SNMP trap service checks assigned to the host) or the device that sent the trap used a secondary IP address (and it is not configured in the 'Other Interfaces' section on the Host Configuration page).

At least 1 servicecheck failed to match a rule

The configured rules have not matched the trap.

Rule failed to evaluate - check log file

There is an error within a rule related to this hostname. Check the log file /usr/local/nagios/var/log/snmptrap2nagios.log for the exact error.

Reprocessing

If you have amended the rules for trap processing and want to see if traps are no longer considered exceptions, you can hit the Re-process button. That will cause all the traps in the exception to be re-evaluated using the current configuration. However there are two main differences from normal trap processing:

  • If the rules define an alert to occur through Nagios® Core, these will not be raised
  • The traps will not be traced (because this could loop around)
  • Any traps with a reason of Trapname not fully translated will be deleted as it is not possible to use any new MIBs to get a translation - you will need to get the device to send a new trap before the translation applies

Troubleshooting

To help with diagnostics of SNMP trap rules processing, you can switch on trap debugging. Go to the SNMP Trap summary page, click on the 'Hosts with tracing enable' link and add in the hostname that you want to debug traps for. The Add field will suggest names of hosts with SNMP trap services assigned to them as you type.

You must reload Opsview - any new traps from these hosts will now be traced.

Note: snmptrap2nagios works off IP addresses. It only collects traps for IPs that are known, and are expected to be collected. See the config file /usr/local/nagios/etc/snmptraps.cfg in the $hostip_lookup variable to see if the IP is known. If the IP is not known, then an entry will go into the SNMP Trap Exceptions page (can take 5 minutes to receive).

After a number of traps have been received, revisit this page, select the hostname for the traps to debug and select 'Collect'. This step pulls the debugging information to the master server in a distributed setup and parses the traps recieved.

A new option will now appear on the 'SNMP Trap Exceptions' page for the host: View debug. Alternatively, you can find all current debugged traps by clicking on the SNMP Trap link on the left hand navigator and following 'SNMP trap debugging rows'.

If you click View debug, you will then get a breakdown of the rules that were evaluated for that packet.

If you click on the rules for a specific trap, you will get a list of the debug output for the rules:

This screen will then show the results of each rule. The Expanded Rule column is the rule, but with all the macros expanded out. This allows you to see the perl code that is run. The Result column shows the result of running that piece of perl code. There are 3 possible results:

  • Match - the perl code executed and returned a true value (any value except 0 or the null string)
  • No match - the perl code executed and returned either 0 or the null string
  • Error - the perl code failed to execute, for example, due to a syntax error

The SNMP trap rule engine considers a No match and an Error to be a rule failure, but it is useful to split these two conditions out for diagnostics.

In the situation above, we can see that there were no matches for the first two rules for the Cisco SNMP - Link State Change service check, but the 3rd catch-all matched.


Another advantage of using debugging, is that you can use the saved trap for reference when defining your rules. If you click through to the service check and the rules, you will get the debugged trap so you can use that to help write your rule.

No traps received?

If you are not seeing any traps being collected, there maybe an error with the interaction between snmptrapd and the Opsview script /usr/local/nagios/bin/snmptrap2nagios.

You can enable a log of all traps in snmptrapd. Add the option: -Lf /tmp/snmptrapd.log to snmptrapd to log all traps to the file. Restart snmpd to take effect. When traps arrive, you should see something like this in /tmp/snmptrapd.log:

2008-10-23 18:17:09 192.168.10.20(via UDP: [192.168.10.20]:3606) TRAP, SNMP v1, community public
        SNMPv2-SMI::enterprises.7367 Link Up Trap (0) Uptime: 0:21:54.14
        RFC1213-MIB::ifIndex.4 = INTEGER: 4     RFC1213-MIB::ifDescr.4 = STRING: "eth0"

If the traps are being received in the file, check that snmptrap2nagios has been invoked: ls -lu /usr/local/nagios/bin/snmptrap2nagios. The update time should be at the time of the trap reception in the log file.

Send a test trap to snmptrap2nagios. Put the following into a temporary file, say /tmp/testtrap:

192.168.10.20
192.168.10.20
SNMPv2-MIB::sysUpTime.0 4:20:49:47.73
SNMPv2-MIB::snmpTrapOID.0 SNMPv2-SMI::enterprises.9.9.43.2.0.1
SNMPv2-SMI::enterprises.9.9.43.1.1.6.1.3.45 1
SNMPv2-SMI::enterprises.9.9.43.1.1.6.1.4.45 2
SNMPv2-SMI::enterprises.9.9.43.1.1.6.1.5.45 3
SNMP-COMMUNITY-MIB::snmpTrapAddress.0 192.168.10.20
SNMP-COMMUNITY-MIB::snmpTrapCommunity.0 "*****"

Then send this to snmptrap2nagios:

cat /tmp/testtrap | /usr/local/nagios/bin/snmptrap2nagios -d -e /tmp/snmptrap.exception

Unless you happen to have a configuration with this IP address, you should get an exception file in /tmp/snmptrap.exception. This will have the contents of:

1224785290
5
192.168.10.20
192.168.10.20
SNMPv2-MIB::sysUpTime.0 4:20:49:47.73
SNMPv2-MIB::snmpTrapOID.0 SNMPv2-SMI::enterprises.9.9.43.2.0.1
SNMPv2-SMI::enterprises.9.9.43.1.1.6.1.3.45 1
SNMPv2-SMI::enterprises.9.9.43.1.1.6.1.4.45 2
SNMPv2-SMI::enterprises.9.9.43.1.1.6.1.5.45 3
SNMP-COMMUNITY-MIB::snmpTrapAddress.0 192.168.10.20
SNMP-COMMUNITY-MIB::snmpTrapCommunity.0 "*****"#---next trap---#

The first line is a timestamp, so could be different. The 2nd line is an error code - 5 means not defined.

If all these pass, then snmptrap2nagios is saving its information correctly.

Distributed SNMP Trap processing

In a distributed setup, the master server could be swamped with SNMP traps from all devices. You can reduce the load by using slave servers:

  • Ensure the monitored device is configured to forward SNMP traps to a slave server
  • Ensure the monitored device is configured in Opsview with the monitored server as the slave server
  • In a slave cluster, the device has to send SNMP traps to all the cluster nodes because it won't know which cluster node is the active one for a particular host

Opsview will make sure that the SNMP trap configuration file is sent to the slave on the next reload. Any MIBs promoted via the web interface will also be sent to the slaves.

At 1 minute intervals, the master server will poll the slave servers for any snmptrap exceptions or debug information and then import them for display in the web interface.

Navigation
Print/export
Toolbox