Hello world!  Remember Mad Libs?  How about Scrabble, when you'd try making up words that sound legit just to be de-bluffed by your friend.  Playing these games provides endless hours of fun with words and letters.  In software and the Internet, words, letters, and text are everything.  Whether you're up in the cloud, down in the code, or consuming the content—written language is the information that’s central to it all.

Unicode provides a set of standards for representing most of the world's languages and scripts within a single framework.  It’s pretty awesome really—the ability to capture the world’s scripts past, present, and future.  Where else would you find a character set that encodes everything from ASCII (Latin) to the symbols of the ancient Phaistos Disc, such as this PLUMED HEAD: CWeber1

Unicode has come to be the de facto system for representing and encoding characters across any computing platform.  It's central to most modern operating systems, programming languages, and applications.  But, similar to a networking protocol stack, most software developers don't want to wrangle with the details.  It should be good enough to know that your strings are handled as Unicode, so you can build your software without sorting out the complex details of charset transcoding, normalization, etc.

Still, there are attacks and countermeasures that should be known.  In my BlueHat presentation I intend to cover two broad categories—one around visual perception attacks, and the other around character transformations.  In the cloud, URL's rule.  Okay, URI has superseded URL and, with Unicode, we should be talking about IRI (Internationalized Resource Identifier).  But anyway, with the growth of Internationalized Domain Names (IDNs), IRIs have just as much a place as do URIs.  What I'm really  concerned with are the domain names, the IDNs.  We saw early visual spoofing attacks as early as 2002, and again with Eric Johanson’s Paypal spoof in 2005.  Times have changed since then and the browser vendors and registrars have gotten smarter about IDN. 

However, the attack vectors continue to emerge.  I plan to demo some of these and describe the current landscape of IDN, especially as it relates to the IDN revisions that are soon to be standardized.  These revisions, dubbed IDNA 2008, bring important changes, both good and dangerous.  On the one hand, we've moved to an inclusion-based model, from exclusion-based for allowed characters.  On the other hand, we'll have edge cases where a single domain name could resolve to two different IP addresses under the new and old IDN standards.  Can your cloud-based services be spoofed?

Moving along, we'll take a closer look at how character transformations can be used to exploit software.  Some characters really do have split personalities much like Dr. Jekyll and Mr. Hyde, which affect you whether your product parses text and wants to prevent buffer overflows, or its a Web-app looking to defend against XSS attacks.  Through subtle manipulations, attackers could send you strings that expand by factors up to 18x when normalized.  In attempts to evade XSS filters, an attacker could inject characters such as the U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE which when lower-cased change to a U+0069 LATIN SMALL LETTER I. 

CWeber2

In other situations, processing of special Unicode characters such as the BOM might also open up exploits.  Because many assigned characters have special meaning and properties, their usage outside of their intended scope may require closer attention.

I’m happy to be going over these issues with you and the Blue Hat crowd at my talk, Character Transformations:  Finding Hidden Vulnerabilities, aimed at developers and testers.  I want developers to see some of the issues, and I want testers to see some new inputs and test cases.

-Chris Weber

Co-Founder, Casaba Security