We saw this in our previous articleThe lack of management of false positives can have a significant impact on the operating costs of a computer park.
Number one cause of false positives : a bad configuration which undergoes the usual jolts of an IS.
Some examples :
- A few packets from a ping that get "lost" and cause a device to go into error.
- A slightly too long RTA on a device ping due to network overload, which makes a device go into error.
- A network share that is unresponsive for a few minutes and makes a service unknown.
- Disk space filled for a few minutes with a backup before a copy to tape. The disk space check service goes to Alert or Critical.
Parameters to be adjusted to limit false positives
Thresholds, thresholds and thresholds
First step in limiting false positives: the setting of thresholds.
The threshold of a hard disk is not set the same way if a disk is 200 GB or 2 TB.
If a device is in DMZ or accessible via VPN, the alert thresholds on the RTA must be increased. 10 ms, the default value in ServiceNav, may not be appropriate.
Second step in reducing false positives: additional controls.
What could be more frustrating than receiving a notification from a DOWN server, rushing to the server and in the time it takes to connect to the server receiving a notification from the UP server? All this because 2 ping packets got "lost" on the network.
The interest of the additional checks is to ask ServiceNav to check X times at Y minutes interval if the Alert/Critical/Unknown state is still valid before making the element (equipment or service) go into a non-OK state, and thus to launch the complete processing chain of an alert.
Here we supervise the RTA of a ping and alert if it exceeds the critical threshold (in red).
As can be seen, the ATR regularly exceeds the threshold for a few moments, but then quickly returns to normal. It is probably necessary to work on the connection between the ServiceNavBox and the remote equipment, but it is not necessary to open a ticket every time you go to Critical.
We therefore decided to set 3 additional checks at 1-minute intervals for this equipment. That is to say that to start alerting and displaying on the operating dashboards, it is necessary that the RTA of the equipment is above the threshold for 3 minutes in a row.
Result: only 1 critical passage (the "hole" at 4pm) instead of several dozen.
In ServiceNav, there are simple and effective ways to reduce false positives by adjusting the thresholds and setting additional controls. When you know the cost of treating a false positiveIt is easy to understand why it is worth spending a few minutes adjusting your configuration.
And does ServiceNav help me to identify the items that need to be addressed first? Of course it does! And we'll look at that in our next blog post coming soon.