Share “Interpreting test failure rates can be tricky”


Interpreting test failure rates can be tricky

Robert Hayes Published: September 14, 2013

If you were to rely on a test that was 95% accurate, you might think that is a reasonably good test.  This is a common statistical confidence level used by statisticians, engineers and scientists worldwide but interpreting the result may leave you a bit snookered.  To start off with, 95% may or may not address all failure modes of a test.  Take for example a medical test for a specific undesirable condition, if that test is 95% accurate, that may only mean that if you have the medical condition, then in 95 out of 100 such cases the test will correctly indicate that you do have the condition.  Missing the correct indication in 5% of the cases like this is known as a false negative outcome.  On the other hand, one must ask what is the probability that the test will say you have the condition when in truth you really don’t have that condition.  This latter possibility is known as the false positive outcome.  Some tests can independently be run a second time to improve its reliability, some tests if they give a wrong answer may be more likely to give the same wrong answer if run again and so for those kinds of tests, a redo would not be an independent check.

Just for simplicity, let’s assume that a test advertised as being 95% accurate means that it only has a 5% false negative and a 5% false positive outcome (again this means only 5% of the time it will say you don’t have a condition when in fact you do and likewise it will say in 5% of the cases that you do have the condition when in fact you don’t).  This can give some difficult interpretations given the fact that if there does remain the possibility that the test could be wrong, then how do you interpret the result?

To provide an example, consider a case where the condition is known to occur in 1% of the testing cases to be carried out.  If 10,000 tests are to be administered, then this means the condition occurs in only 100 cases.  With a 5% false positive rate, 500 of the tests will give a positive indication of the result when in fact the answer should be negative.  Similarly with a 5% false negative rate, of those 100 true positives, 5% (or in this example 5) of these positive conditions can be expected to give a negative test result (when in fact the condition is positive).

This contrived situation resulted in 5 times as many false positives as there were actual positives and yet even on the true positives, 5% of those were incorrectly evaluated to be negatives!  How then does the statistician, scientist, engineer, clinician etc decide how to use this result?  The test being 95% accurate is actually a high quality test and can be considered very reliable.  What is often done by practitioners is to utilize additional information to make the test far more reliable.  Independent indicators for example can change the probabilities drastically.  If the test is only applied to those conditions already known to likely have the result being tested for, then the test is no longer being applied to a random population sample and it would be like saying you are testing only 100 cases which were already expected to have a 95% incidence of a positive result.  The numbers given previously almost completely disappear in terms of both false positives and false negatives.  In many (not all) cases, the false conclusions can be considered negligible in such instances.  If the test represents quality control measurements for consumer products, they can be acceptable for those things which have no effect on public safety but would almost certainly not be acceptable if people’s lives depended on a correct result.  If the test is strength of materials for example, the single test will probably be acceptable for the metal in paper clips but not if it is for the metal in a suspension bridge or structural girders in a multistory building, it would certainly not be acceptable.

This does not make the situation more simple but it does provide a grading scale on interpretation of test results.  By knowing how critical a correct answer is in a test result, additional rigor in attaining higher confidence in the accuracy of the test can be implemented.  In a simplified picture you can ask, is it ok to accept a test result if it is wrong by some percentage.  DNA testing for crimes is a good example and most people expect false positive probabilities in the millionth of a percent but not all tests carried out in industry really need that much confidence.  Generally, much like capitalism, the product rises to meet the demand.