Users today are offered choices among many security products, any number of which are sufficient, and none perfect.  Along with these products are myriads of product test results and certifications, all there to help you make a better, more informed decision on which product to use.  And as product developers, we’ll point to the tests and reviews that best represent our product. (Like this recent report on the just released Microsoft Security Essentials Beta and the most current AV-Comparatives test showing Windows Live OneCare (OneCare) reaching the vaunted status of Advanced+.)

But are the tests doing what they ought to do?

I would like to take this opportunity to present a case for advancing the methodology of testing security products.

In all the time that this industry has been in place, product testing has been conducted by way of throwing huge numbers of malware at the product and seeing how well the product can detect that malware.  "Improvement" in testing was measured by increasing the number of samples.  "Comprehensiveness" was to have millions instead of thousands, and coverage of the many types instead of just lots of malware.  And only recently, consideration for false positives (FPs) finally is influencing the interpretation of the test results. 

(An example: it is this concept of false positives that allowed OneCare to win the latest AV-Comparatives test.  There were two other comparable products, one scoring a detection rate higher and one the same as OneCare.  But, because they also were among the highest in FPs (over 15 FPs), both fell to Advanced. OneCare only had 0-2 false positives, the lowest of all tested products, and the only one in this lowest category.)

Because false positives cause unnecessary upheaval that may result in nonfunctioning machines, and because a high detection rate is often directly correlated with the propensity to FP, we would like to recognize AV-Comparatives, and all the other testers and certifications that do not blindly judge detection capability without consideration for false positives.  And our hat's off to Virus Bulletin for having had a no-FP requirement for its VB100 Award for the longest time.

So, the recognition that false positives are an important consideration in the interpretation of test results is now becoming standard.  What next to make tests more meaningful for the real user? 

As I mentioned before, the standard way of testing is to throw lots and lots of malware at the products and present a detection percentage.  This is then presented as a measure of the quality of the product.

But does that really represent quality for the average user?  The tests do not simulate the likely scenario on our machines at home or at the office.  So, how is the result then meaningful?  If a product misses 1% more than another, are those 10,000 samples in a million meaningful to you?  Maybe it's 10,000 distinct samples of a single server-side polymorphic trojan from one site that your browser happens to warn you not to visit?  Or, they might be mostly comprised of a set of targeted attacks.  Important to the targeted entity and the products they use, but for you or me?

How do we fix this?

One of the best advances in the security industry in recent years is the ability we have to capture telemetry about the malware cases we encounter.  The data associated with malware infections enables us to produce the semiannual Security Intelligence Report.  And selective use of prevalence reports enables us to make decisions each month regarding the best way for the MSRT to protect the eco-system.  Others in the industry make use of their telemetry to also produce reports, and free tools to clean up the most prevalent malware affecting the eco-system.

What we need to do is to incorporate this data in the tests.  To accomplish this, the Microsoft Malware Protection Center (that’s us), in its arrangements that give other security vendors access to the malware we collect, has started to also provide normalized prevalence data to other security vendors, security industry testers, and the WildList Organization.

Tony Lee manages our collection of malware and its distribution to our partner security vendors who care to participate in the Microsoft Virus Information Alliance (VIA).  He will contribute the next section of this blog…

Malware manages to evolve in its ability to distribute, mutate and update itself at an increasingly fast pace – we’re often talking about hours and days here. Malware also targets various sizes and groups of the population. These infection characteristics pose challenges to AV product testing, both in the demographic and chronological sense. In order to meaningfully reflect a product's ability to protect its users, the testing methodology employed needs to have an up-to-date and accurate view of the threat landscape.

Through telemetry collected by our various antimalware products, we are able to observe what is statistically significant to reflect the state of threat activities in the wild, in near real time. For example, by observing first seen, last seen dates of a threat, and its occurrences during various periods of time, we can assess the age, severity and activity trend at both file and threat levels.

Recently, I established an experimental program to share this prevalence data with our security partners. We have received very positive feedback and suggestions. At the core of this program is an automation process that monitors noticeable new threat activities as they are taking place in the field. The process then aggregates, analyzes and publishes this data to security partners in an encrypted channel, on a daily basis. Recipients of this information can assimilate this data over time and construct a view similar to the example below:

SHA1: 18375FD78CDE1E1B7291FBC37831CB36013895FD
MD5: 9FFCA5614A1032B0709ECAB67DF10F49
Total Reports: 17,052
File Size: 96,047
 
We also share weekly information in a Top 100 list; the top 20 in the report generated July 10th are shown here:

* ITW Index is an abstract representation of one element against another; it does not represent actual count.

Rank

SHA1

Threat Name

ITW Index

1

57fba4d10135c316676b9ad6c0c01c36dc63203a

Worm:Win32/Koobface.gen!D [generic]

56

2

52c9b8405ba34081e64482cdc843bc4c86201e03

VirTool:WinNT/Koobface.gen!B [generic]

50

3

0a7499954d78214189824f8c5cda0b8267882921

Worm:Win32/Koobface.gen!D [generic] [non_writable_container]

43

4

8fc4a8c85c97b1094014fab96fc1135e79e6a41a

TrojanProxy:Win32/Koobface.gen!C [generic] [non_writable_container]

38

5

7017d9cc703d195240679158e4f4bb229c25db5d

Trojan:Win32/Liften.A [non_writable_container]

37

6

93afca82dc4e0e78a61740dd21cfa1e13ef638ab

TrojanDownloader:Win32/Small.gen!B [generic] [non_writable_container]

36

7

51dd6f7bea5c1f8bcac756e34da0964af1193a36

Trojan:Win32/Matcash.gen!M [generic]

34

8

04cb20e91195126351fdd8ec472e663bfed5b452

Backdoor:Win32/Delf.B [non_writable_container]

33

9

db9d18d257df0bb2ef894e3c25dbe42fb787ed34

Trojan:Win32/Tibs.gen!lds [generic]

25

10

85589f11ab008a9954acb9a80d97836d40c8d464

Trojan:Win32/Vundo.gen!AN [generic]

25

11

e28580d1d635e7e4702b5975a00ceb61762d6a11

TrojanDownloader:Win32/VB.XR

23

12

3ed104ed15396c6a45d12621b577211700193179

PWS:Win32/Daurso.gen!A [generic] [non_writable_container]

23

13

245bfc230c2f93304dcd741000e4c53197b081cc

PWS:Win32/Daurso.gen!A [generic] [non_writable_container]

22

14

b2268207ea777d07620f983f96f51da34c7bb3bf

PWS:Win32/Daurso.gen!A [generic] [non_writable_container]

22

15

2160b1794492f332ded96514785265ce4d21e8ef

Trojan:Win32/C2Lop.gen!B [generic]

22

16

0890ff9aa1b4330561f53bb11a3fb00446515477

Trojan:Win32/Killav.gen!A [generic]

19

17

8579da5efc66348179bd9ea9985478887e2a5946

Trojan:Win32/Ertfor.A

17

18

948f6e13e36170a94f32edabb71c1e5b45324724

VirTool:Win32/Injector.gen!G [generic]

17

19

a07938f44a443026ace653e8181518910fb3d103

Trojan:Win32/Vundo.gen!AN [generic]

17

20

3ce19165aeb97e92d4e55ba0fbe73c0aeea51d51

VirTool:WinNT/Koobface.gen!B [generic]

16

We hope that sharing this type of information can help security vendors prioritize resources to combat malicious threats in the wild; it is also in our goal to encourage, by example, other security vendors to share data with AV product testers; the testers can then analyze and aggregate this data to better assess the relevance of threats and weigh them meaningfully in their tests.

[- Tony]

The examples above make a very good contrast to a password stealer that I encountered when someone passed me a spam message from within an MMORPG I was playing  (SHA1: 3BC300E799D57601004692D3E1282637535257FA, MD5: A662DF230142E1E10DB4E8A2865E3AB7). I downloaded it and submitted it so our products would be able to protect against it.  And to this day, there has been no outside telemetry of this piece of malware.  But from a tester’s perspective, my password stealer and each of the above examples shown by Tony are all the same, despite the fact that all of Tony’s samples have been noticed on users’ machines significantly more times.

So, I would propose this method for a test strategy to get people to start thinking along these lines:

  1. Test samples are gathered with accompanying telemetry.
  2. Statistics are then normalized per contributor - so a larger company with more seats does not overwhelm another contributor’s normalized telemetry.
  3. After the application of a function to rate the significance of the individual test samples, samples are granted values.
  4. Detection of a sample results in points corresponding to that sample’s granted value.

Here is an example:

                       Prevalence
Sample A:            50
Sample B:            25
Sample C:            15
Sample D:            5
Sample E:            2
Next 100 samples: 3
 
If a product misses only E, it would have a score of 98.
If a product only detects A,B,C,D,E, it would score a 97.
If a product only misses A and gets everything else, it would get a 50.
 
I’ve simplified the example greatly.  But you should be able to see basically that the product is being rated for its ability to detect what users are likely to encounter (and have encountered).  The significance is that it is far more important to be protected against sample A because it is so much more prevalent as it alone accounted for half of all infections!

It will take some time, just as it took some time for most testers to fully recognize that detection scores cannot disregard the accompanying false positives.  And even if the testers don’t fully embrace this type of testing, we hope that we have opened their minds to consider a better representation of their test set to something that would be more meaningful to their constituents, the computer-using public.

-- Jimmy Kuo and Tony Lee