One person's code is another person's data

A number of people have heard this rant from me, especially during the development of bitlocker. This point was re-enforced with the announcement buffer overrun vulnerabilities in image processing code discovered back in 2005 (http://www.us-cert.gov/cas/techalerts/TA05-312Apr.html). Processing of data became an important threat consideration for BitLocker and resulted in part of its final design.

When planning a feature - and iteratively through its design, it's important to understand what problem is being solved so that the solution can be validated. Originally when I was given the task to create what eventually became BitLocker, the problem that needed solving was simple:

SysKey (System Key) is stored obfuscated allowing automated discovery. SysKey is used to encrypt multiple keys.

I was part of NGSCB (Next-Generation Secure Computing Base) looking at using a TPM 1.2 feature known as DRTM (Dynamic Root of Trust Measurement) that was being designed by multiple parties (https://www.trustedcomputinggroup.org/). The idea of using the TPM to protect SysKey was too good to pass up. At the same time some colleagues were working on code integrity and the initial design thoughts considered combining the two to achieve the task of protecting SysKey. When I joined NGSCB, this idea was presented to me, and everyone thought this would be a simple side project to the intended primary task of working on the secure kernel, requiring no more than ½ a developer.

One aspect of the BitLocker design process that I want to cover in this blog was about code and data integrity. As any good engineer would do, I took a step back and performed a "does this make sense" pass... this was a security solution, was it actually secure? Very quickly, I was convinced the answer was (of this early design) no. Code integrity was intended to check the integrity of code... but nothing was being done about data. I described this at the time as a wrought iron gate with no walls. It got the point across.

When does data become code?

When it's parsed.

Let's ignore buffer overruns and parser bugs for a second and consider how a script interpreter works. There are multiple variations, but ultimately a text file, full of strings, is converted into an internal data structure (in some cases this data structure is very compact, often called p-code or byte-code). This sequence of data is parsed, resulting in actions based on the data. As an aside, I've enjoyed writing a few interpreters, including "TKTSLogo".

Now consider what happens with many data files. A program processes a data file (or retrieves data from the registry, or data from a database) to ultimately make decisions based on that data. These decisions might be as simple as what font to use or the color of the text in an error dialog. They may be as dangerous as what DLL's to load, or what programs to execute. Ah, but (again, in absence of security bugs) the programs are signed, so what's the problem if the data causes a program to be called?

Many programs are intrinsically unsafe. For example one program can be invoked to disable the firewall, another program can be invoked to copy all the files to a network location, in many cases all it takes is the right parameters or a script... And what's a script? It's data parsed by a specific program.

It is important therefore that data should be included in any threat model analysis, and not ignored.

The Rest Is History

For BitLocker, this was mitigated by using a combination of diffusion and encryption. In coming up with this solution, we resulted in BitLocker's biggest selling feature - encryption of the entire volume and not SysKey. Niels wrote about why the diffusion and encryption mitigates data modification in his white paper, so I wont repeat it here.

My ending thought for the day is what I started with:

One person's data is another person's code.