By David Harley
There has been a certain amount of excitement and irritation in anti-virusresearch circles about a not-very-good comparative test of antivirusscanners that was conducted at LinuxWorld on 8th August, 2007. I was soexercised personally that I sat down and wrote a long white paper (free,gratis and unpaid by anyone) on Untangling the Wheat from the Chaff inComparative Anti-Virus Reviews.
Here, though, is a less irascible summary that might give you some pointers onassessing how good a comparative test is likely to be.
1) Small sample sets (the Untangle test used 18 discrete samples) tell youhow a given scanner performed against a small set of presumed malware, in aspecific "snapshot" context:. According to the testing conditions. According to the way the scanner was configured for the testThey tell you nothing about how the scanner will perform against any othersample set. If you want to test detection of In the Wild (ItW) virusesmeaningfully, you have to use a full and fully validated test set that meetsan acceptable definition of "In the Wild," not a few objects that may beviruses and may be ItW. How many is a full set? Well, the current WildListat the time of writing consists of 525 viruses on the main list (1,958 ifyou count the supplementary list). See the wildlist website for anexplanation of how these lists work, and Sarah Gordon's article "What isWild?" for a consideration of what we mean by ItW.
Why should you believe what the WildList Organization tells you? Well, thereare problems. WLO only tracks viruses (at the moment: that is changing), andthe list is always months out of date, because of the time it takes toprocess and cross match samples, and so on. But that's the point. WildCore,the WLO collection, gives a tester a sound, pre-verified collection to workfrom (though testers with access to that collection are still expected togenerate and validate their own samples from it, not just run it againstsome scanners. It doesn't, and can't, include all the viruses (let aloneother malware) currently in the wild in a less technical sense, but it doesgive you a dependable baseline for a valid sample set. Of course,professional testing organizations don't necessarily only test detection ofmalware in the wild. They may also test for zoo viruses and other malwarethat isn't known to be in the wild. This is important, because they cannotassume that a customer will never need to detect or protect against these.They may also test heuristic detection, time to update, and some forms ofusability testing, but I won't go into detail on these interesting butcomplicated methodologies on this occasion.
2) Unvalidated samples invalidate a test that uses them. A collection needscare and maintenance, as well as a significant test corpus, and soundvalidation is a critical factor in professional testing. Assuming thatsamples from blackhat web sites or your own mailbox are (a) malware and (b)specific malware variants because your favourite scanner tells you so is notvalidation, and offers no protection against false rejections (falsepositives). If that scanner happens to be one of the scanners you'retesting, you introduce an unacceptable degree of bias in favour of thatproduct.
3) While a voluntary community resource can make a significant contributionto the common weal, even in security (ClamAV, Snort, and so on), it can'tmatch a full strength industrial solution in all respects (contractualsupport, for example). When people find a positive attribute in an object,such as a $0 price tag, they're tempted to overestimate its other positiveattributes and capabilities (this is sometimes referred to as "haloeffect"). That's understandable, but it has no place in a rigorous testingprogram, and to be less than rigorous when you're making recommendationsthat affect the security and well-being of others is reprehensible.
4) Other concepts you should be aware of are ultracrepidarianism and FalseAuthority Syndrome, which can be informally defined as a tendency for thosewith a platform to speak from to overestimate their own competence insubjects in which they have no specialist expertise. When looking at a test,you are advised to take into account the expertise and experience of theindividual conducting the test. The widespread popular distrust of theanti-virus community extends not only to attributing malicious behaviour toAV vendors ("they write the viruses") but to assuming their essentialincompetence. Strangely enough, there are some pretty bright people inanti-virus research. Scepticism is healthy, but apply it to people outsidethat community, not just those within it! As a rule of thumb, if you thinkthat anyone can do a comparative test, that suggests that you don't knowwhat the issues are. One of the politer comments directed at me after Ipublished the paper was "You don't need to be a cook to tell if a mealtastes good." Perfectly true. But you do need to know something aboutnutrition to know whether something is good for you, however good it tastes.
5) In the Wild is a pretty fluid concept. In fact, it's not altogethermeaningful at all in these days when worms that spread fast and far are indecline, and malware is distributed in short bursts of many, many variants.Come to that, viruses (and worms) are much less of an issue they were.Anti-virus isn't restricted to that arena (though you may have been toldotherwise by instant experts), but it can't be as effective in all areas asit was when viruses were public enemy number one. On the other hand, nosolution is totally effective in all areas. The AV research community is(slowly and painfully) coming to terms with the fact that the test landscapehas to change. Mistrust any test that doesn't even recognize that theproblems exist.
6) There are a whole raft of problems connected with the types of objectused for testing by non-professional testers: non-viral test files, garbagefiles from poorly maintained collections, unvalidated samples from malwaregenerators, simulated viruses and virus fragments, and so on.* If you don't know anything about the test set, assume the worst.* If you don't know where it came from, mistrust it.* If you don't know how or if it was validated, mistrust it.* If you suspect that it came from one of the vendors under test, or thatthe only validation carried out was to identify samples with one of thescanners being tested, mistrust it. With extreme prejudice.* If you don't know what the samples were or how many, mistrust them.* If you're offered the chance to test the same set for yourself, be awarethat unless you're an expert on malware and testing, or have reliablecontacts in the community who can do it for you, you'll probably reproducefaulty methodology, so the results will be compromised.
7) Sites like VirusTotal are not intended to conduct any sort of comparativetesting, they're for trying to identify a possibly malicious object at agiven moment in time. Unless you know exactly what you're doing, any resultsyou get from such a site is useless for testing purposes, and if you ask theguys who run these sites, they'll usually agree that comparative detectiontesting is an inappropriate use of the facility.
8) The EICAR test file is not a virus, and doesn't belong in a virus sampletest set. It's perfectly reasonable to test how scanners process the EICARtest file, but the fact that a scanner recognizes it doesn't prove thattest-bed scanner apps have been configured properly.
9) Your test-bed apps have to be similar in functionality and platform, andshould be configured carefully so that no single product has an unfairadvantage. One particularly memorable example some years ago was a test(nothing to do with Untangle) that reviewed several known virus scanners anda single generic application. The latter was given the "editor's choice"accolade because it stopped the entire test set from executing. This soundsfair enough unless you realize that many people and organizations stillprefer virus-specific detection because generic products can't distinguishbetween real threats and "innocent" objects: this usually means that eitherall objects are blocked (executable email attachments, for instance) or elsethat the end user has to make the decision about whether an object ismalicious, which, for most people, defeats the object. By failing toacknowledge this issue, that particular tester invalidated his conclusionsby imposing his own preferences on what should have been an impartialdecision. In other words, apples and oranges look nice in the same fruitbowl, but you need to know the difference between them before you choose oneto eat. You also need to be clear about what it is you're testing.
The Untangle test was presented as a test of known viruses in the wild. Infact, because of the methodology used, it effectively tried to test severalthings at once. There seems to have been no separation between desktopscanners, appliances or gateway scanners, or between platforms, or betweencommand-line and GUI interfaces. The tester failed to recognize that he wasactually trying to conduct four tests at once: recognition of the EICAR testfile, recognition of presumed "wild" malware, recognition of presumed "zoo"malware (known, but not necessarily in the wild), and unknown presumedmalware (essentially a test of heuristic detection). Even if he'd goteverything else right, the test would have been let down by the muddledtargeting.
10) There are reasons why some tests are generally considered valid by theAV community. Some of those reasons may be self-serving - the vendorcommunity is notoriously conservative - but they do derive from a very realneed to implement a stringent and impartial baseline set of methodologies.Unfortunately, to do so requires considerable investment of time andexpertise, and that's expensive. (That's one of the reasons that manyfirst-class tests are not available to all-comers.) To understand what makesa test valid, look at the sites listed below, find out how they conducttests and learn from it. You don't have to accept everything they (or I)say, but you'll be in a better position to assess comparative reviews in thefuture.
None of the following sites has the universal, unquestioning approbation ofthe entire anti-virus research community, but they are taken seriously:
The paper I mentioned earlier includes a number of references and furtherreading resources, for those who want to know more about this difficult butfascinating area.
PingBack from http://dharley.wordpress.com/2007/08/23/avien-guide-published/