Users today are offered choices among many security products, any number of which are sufficient, and none perfect.  Along with these products are myriads of product test results and certifications, all there to help you make a better, more informed decision on which product to use.  And as product developers, we’ll point to the tests and reviews that best represent our product. (Like this recent report on the just released Microsoft Security Essentials Beta and the most current AV-Comparatives test showing Windows Live OneCare (OneCare) reaching the vaunted status of Advanced+.)

But are the tests doing what they ought to do?

I would like to take this opportunity to present a case for advancing the methodology of testing security products.

In all the time that this industry has been in place, product testing has been conducted by way of throwing huge numbers of malware at the product and seeing how well the product can detect that malware.  "Improvement" in testing was measured by increasing the number of samples.  "Comprehensiveness" was to have millions instead of thousands, and coverage of the many types instead of just lots of malware.  And only recently, consideration for false positives (FPs) finally is influencing the interpretation of the test results. 

(An example: it is this concept of false positives that allowed OneCare to win the latest AV-Comparatives test.  There were two other comparable products, one scoring a detection rate higher and one the same as OneCare.  But, because they also were among the highest in FPs (over 15 FPs), both fell to Advanced. OneCare only had 0-2 false positives, the lowest of all tested products, and the only one in this lowest category.)

Because false positives cause unnecessary upheaval that may result in nonfunctioning machines, and because a high detection rate is often directly correlated with the propensity to FP, we would like to recognize AV-Comparatives, and all the other testers and certifications that do not blindly judge detection capability without consideration for false positives.  And our hat's off to Virus Bulletin for having had a no-FP requirement for its VB100 Award for the longest time.

So, the recognition that false positives are an important consideration in the interpretation of test results is now becoming standard.  What next to make tests more meaningful for the real user? 

As I mentioned before, the standard way of testing is to throw lots and lots of malware at the products and present a detection percentage.  This is then presented as a measure of the quality of the product.

But does that really represent quality for the average user?  The tests do not simulate the likely scenario on our machines at home or at the office.  So, how is the result then meaningful?  If a product misses 1% more than another, are those 10,000 samples in a million meaningful to you?  Maybe it's 10,000 distinct samples of a single server-side polymorphic trojan from one site that your browser happens to warn you not to visit?  Or, they might be mostly comprised of a set of targeted attacks.  Important to the targeted entity and the products they use, but for you or me?

How do we fix this?

One of the best advances in the security industry in recent years is the ability we have to capture telemetry about the malware cases we encounter.  The data associated with malware infections enables us to produce the semiannual Security Intelligence Report.  And selective use of prevalence reports enables us to make decisions each month regarding the best way for the MSRT to protect the eco-system.  Others in the industry make use of their telemetry to also produce reports, and free tools to clean up the most prevalent malware affecting the eco-system.

What we need to do is to incorporate this data in the tests.  To accomplish this, the Microsoft Malware Protection Center (that’s us), in its arrangements that give other security vendors access to the malware we collect, has started to also provide normalized prevalence data to other security vendors, security industry testers, and the WildList Organization.

Tony Lee manages our collection of malware and its distribution to our partner security vendors who care to participate in the Microsoft Virus Information Alliance (VIA).  He will contribute the next section of this blog…

Malware manages to evolve in its ability to distribute, mutate and update itself at an increasingly fast pace – we’re often talking about hours and days here. Malware also targets various sizes and groups of the population. These infection characteristics pose challenges to AV product testing, both in the demographic and chronological sense. In order to meaningfully reflect a product's ability to protect its users, the testing methodology employed needs to have an up-to-date and accurate view of the threat landscape.

Through telemetry collected by our various antimalware products, we are able to observe what is statistically significant to reflect the state of threat activities in the wild, in near real time. For example, by observing first seen, last seen dates of a threat, and its occurrences during various periods of time, we can assess the age, severity and activity trend at both file and threat levels.

Recently, I established an experimental program to share this prevalence data with our security partners. We have received very positive feedback and suggestions. At the core of this program is an automation process that monitors noticeable new threat activities as they are taking place in the field. The process then aggregates, analyzes and publishes this data to security partners in an encrypted channel, on a daily basis. Recipients of this information can assimilate this data over time and construct a view similar to the example below:

SHA1: 18375FD78CDE1E1B7291FBC37831CB36013895FD
MD5: 9FFCA5614A1032B0709ECAB67DF10F49
Total Reports: 17,052
File Size: 96,047
We also share weekly information in a Top 100 list; the top 20 in the report generated July 10th are shown here:

* ITW Index is an abstract representation of one element against another; it does not represent actual count.



Threat Name

ITW Index



Worm:Win32/Koobface.gen!D [generic]




VirTool:WinNT/Koobface.gen!B [generic]




Worm:Win32/Koobface.gen!D [generic] [non_writable_container]




TrojanProxy:Win32/Koobface.gen!C [generic] [non_writable_container]




Trojan:Win32/Liften.A [non_writable_container]




TrojanDownloader:Win32/Small.gen!B [generic] [non_writable_container]




Trojan:Win32/Matcash.gen!M [generic]




Backdoor:Win32/Delf.B [non_writable_container]




Trojan:Win32/Tibs.gen!lds [generic]




Trojan:Win32/Vundo.gen!AN [generic]








PWS:Win32/Daurso.gen!A [generic] [non_writable_container]




PWS:Win32/Daurso.gen!A [generic] [non_writable_container]




PWS:Win32/Daurso.gen!A [generic] [non_writable_container]




Trojan:Win32/C2Lop.gen!B [generic]




Trojan:Win32/Killav.gen!A [generic]








VirTool:Win32/Injector.gen!G [generic]




Trojan:Win32/Vundo.gen!AN [generic]




VirTool:WinNT/Koobface.gen!B [generic]


We hope that sharing this type of information can help security vendors prioritize resources to combat malicious threats in the wild; it is also in our goal to encourage, by example, other security vendors to share data with AV product testers; the testers can then analyze and aggregate this data to better assess the relevance of threats and weigh them meaningfully in their tests.

[- Tony]

The examples above make a very good contrast to a password stealer that I encountered when someone passed me a spam message from within an MMORPG I was playing  (SHA1: 3BC300E799D57601004692D3E1282637535257FA, MD5: A662DF230142E1E10DB4E8A2865E3AB7). I downloaded it and submitted it so our products would be able to protect against it.  And to this day, there has been no outside telemetry of this piece of malware.  But from a tester’s perspective, my password stealer and each of the above examples shown by Tony are all the same, despite the fact that all of Tony’s samples have been noticed on users’ machines significantly more times.

So, I would propose this method for a test strategy to get people to start thinking along these lines:

  1. Test samples are gathered with accompanying telemetry.
  2. Statistics are then normalized per contributor - so a larger company with more seats does not overwhelm another contributor’s normalized telemetry.
  3. After the application of a function to rate the significance of the individual test samples, samples are granted values.
  4. Detection of a sample results in points corresponding to that sample’s granted value.

Here is an example:

Sample A:            50
Sample B:            25
Sample C:            15
Sample D:            5
Sample E:            2
Next 100 samples: 3
If a product misses only E, it would have a score of 98.
If a product only detects A,B,C,D,E, it would score a 97.
If a product only misses A and gets everything else, it would get a 50.
I’ve simplified the example greatly.  But you should be able to see basically that the product is being rated for its ability to detect what users are likely to encounter (and have encountered).  The significance is that it is far more important to be protected against sample A because it is so much more prevalent as it alone accounted for half of all infections!

It will take some time, just as it took some time for most testers to fully recognize that detection scores cannot disregard the accompanying false positives.  And even if the testers don’t fully embrace this type of testing, we hope that we have opened their minds to consider a better representation of their test set to something that would be more meaningful to their constituents, the computer-using public.

-- Jimmy Kuo and Tony Lee