We saw it in our previous articleIn addition, the non-management of false positives can have a significant impact on the operating costs of a computer system.
Cause #1 of false positives a bad configuration that undergoes the usual jolts of an SI.
A few examples :
- A few packets of a ping that get "lost" and make a piece of equipment go wrong.
- A slightly too long RTA on an equipment ping due to a network overload, which makes an equipment go wrong.
- A network share that doesn't respond for a few minutes and makes a service go unrecognized.
- Disk space filled for a few minutes with a backup before a tape copy. The disk space verification service goes to Alert or Critical.
The parameters to be adjusted to limit false positives
Thresholds and thresholds
First step in limiting false positives : setting the thresholds.
We do not set the threshold of a hard disk in the same way if a disk is 200 GB or 2 TB.
If a device is in DMZ or accessible via VPN, the alert thresholds on the RTA must be increased. 10 ms, the default value in ServiceNav, may not be appropriate.
Second step in reducing false positives: additional controls.
What could be more frustrating than receiving a notification from a DOWN server, rushing to the server and in the time it takes to connect to the server receive a notification from the UP server? All this because 2 ping packets got "lost" on the network.
The interest of additional checks is therefore to ask ServiceNav to check X times at Y minute intervals whether the Alert/Critical/Unknown state is still current before putting the item (equipment or service) in a non-OK state, and thus to launch the complete processing chain of an alert.
Here we monitor the RTA of a ping and alert if it exceeds the critical threshold (in red).
As can be seen, the ATR regularly exceeds the threshold for a few moments, but then quickly returns to normal. It is probably necessary to work on the connection between the ServiceNavBox and remote equipment, but it is not necessary to open a ticket every time you pass through Critical.
We have therefore decided to put 3 complementary checks at 1 minute intervals for this equipment. That is to say that to start alerting and displaying on the operating dashboards, it is necessary that the RTA of the equipment be above the threshold for 3 minutes in a row.
Result: 1 single critical passage (the "hole" around 4pm) instead of several dozen.
In ServiceNav, there are therefore simple and effective ways to reduce false positives by adjusting thresholds and implementing additional controls. When you know the cost of processing a false positiveIf you're not sure what to do, you can spend a few minutes adjusting your configuration.
And as a result, does ServiceNav help me identify the items that need to be treated as a priority? Of course it does! And we'll see that in our next article to be published soon on our blog.