Ben Ari's UAG and IAG Blog

Plenty of useful and fun info on UAG, Microsoft's remote access and reverse-proxy product.

Ain’t nuthin regular about regular expression!

Ain’t nuthin regular about regular expression!

  • Comments 1
  • Likes

One of the wonderful things about UAG is its ability to accept many of its configuration options using wildcards. This means, for example, that if an application you are publishing uses 10 different servers, you may be able to key in one expression that describes all of them, rather than having to feed in all the server names manually. This is also useful when you want to define access rules to your applications, and can define access to a complex, multi-URL application with just one or a few expressions, rather than having to type-in all possible URLs and variations.

To accomplish that, UAG supports the “Regular Expression” text-matching mechanism, also called RegEx for short.

Regular Expression is a part of Computer Science, and has been around for a long time. Many systems and development environment are built to use it, and it has been making life easier for everyone. Many of UAG’s configuration options are set to accept input formatted using the Regular Expression formatting. For example, when you publish an application, you may describe the name of the server using a regular expression:

clip_image002

Naturally, not every text field in UAG allows this, but many of them do. Here are a few examples:

1. Search and replace using content types

2. Server and URLs for skip/do not skip body parsing

3. Global URL Character rules

4. URL identification for Download, upload, restricted zone and ignore requests in timeout calculations

5. URLs and URL parameters in the URL set configuration

6. Most configurations done in XML files, like the Application Wrapper configuration

To be able to use RegEx, you would need to understand it’s formatting, and that may not be so simple. Being as old as it is, it’s extremely well documented, but not every product implements all the RegEx options exactly the same. Naturally, the basic things are common to all implementations, but some of the more complex structures may be different, or not implemented at all with UAG’s RegEx library.

Before I go into the actual formatting of RegEx, let’s discuss the process of using it. 1st, you need to understand the commonalities in your data set, so you can clearly define the wildcard. For example, let’s say your servers are:

- Engineering001

- Engineering002

- EngDB01

- EngDB2

Would you be able to describe all four in one expression in plain English? Not so sure…but that would be a good starting point. In this example, all servers begin with “eng”, with a middle-string of “ineering” or “DB”, and ending with either one, two or three digits. If you needed to define just the 1st two, it would be simpler, of course…all servers begin with Engineering00, and ending with one character that is either 1 or 2. RegEx does allow you to cover either the two servers, or all four, but will require a more complex expression to cover the four.

The next step is to pull up the RegEx documentation (some links at the bottom of the post), and be creative. At the most basic level, you could just list all 4 with pipes (Engineering001|Engineering002|EngDB01|EngDB2), but you would probably want to get a little cleverer. Naturally, the more servers you have, the harder this becomes, and unfortunately, there’s no magic trick to solve this for you…just experience and brainstorming.

The last step, once you have your mind around this, is to validate the expression. For this, there are websites like http://tools.netshiftmedia.com/regexlibrary/ (more links at the bottom of the post). Here, you feed in your regular expression, and the name of the string it’s supposed to match (your server names, in our example), and it will light up in green if it matches, and red if it doesn’t. In our case, you would feed in the server names, one-by-one, and hope to get a green for all of them. If one comes back RED, it means your expression is not matching it, so you need to go back to the drawing board.

The basic concept of Regular Expression is the NON-LITERALS. These are characters that can be used within the expression as “commands”. For example, the most common non-literal is the dot (.), which means “any single character”. If you are familiar with DOS wildcards, this is equivalent to the question-mark character, where “dir file?.txt” would show any file that starts with the word ‘file’ and has one additional character in the name. To clear a common mistake…this is NOT like the asterisk character in DOS, which is for a string of any length. If we go back to our example from before, than you could describe the 2 first servers using “Engineering00.”, because both the 1 and 2 are covered by that dot. For that matter, you could also use “Engineering0..”, “Engineering…”, “Eng.……….”, or if you’re really crazy, even “……….….”. Yup, that’s fourteen dots, and that’s not a good idea. Think of a reason why?

The answer is that this expression would cover ANY server that has 14 characters in its name. In the UAG world, this would mean that if your application refers to some other server that has 14 characters in the name, UAG will treat it as a possible internal server, and scan every page it delivers for servers with that characteristic. It could lead to real havoc, as many servers on the internet would match that…including www.office.com and others.

Another useful non-literal is the asterisk (star), which means “repeat the previous token as many times as needed”. For example, if we use the expression “server*”, it could match “server”, “serverserver”, “serverserverserver” and so on. It’s very common to use both of the above as a generic wildcard, so “server.*” would mean “something that starts with server, and has any character after it, repeated as many times as it does”. This makes it equivalent to the old DOS asterisk char. This combination can also be used like this: “.*server.*”, which means “any text that has the word server in the middle”.

Another useful structure is the square parenthesis, which allows you to define sets of characters (a.k.a. Character Class). For example, if you use the expression “Engineering00[0-9]”, that would mean that the last character would be any of the digits zero to nine. If you had to accommodate servers “Engineering001” to “engineering999”, you could use “Engineering[0-9][0-9][0-9]”. You could save some space with repetition, so instead of specifying the same thing 3 times, you could use “Engineering[0-9]{3}”. In this case, the curly brackets refer to the token before them – the [0-9]. You can even specify a repeat range, like this {1,5}, which means the preceding token can repeat between one and five times, so in our case, “Engineering[0-9]{1,5}” would cover anything between Engineering0 to Engineering99999.

A useful variation of the character class is [0-9a-z], which is “all digits, all letters”. Yet another variation is [0-9a-z_-], which also includes the dash and underscore in the allowed characters. In a character class, only ONE character is allowed, so if you define the expression “b[ioa]ng”, it would match ‘bing’, ‘bong’ and ‘bang’, but not ‘biong’, ‘boang’ or others.

Sometimes, you just can’t find a good combination of wildcards, and for that, we have the pipe…or “alternation”. The pipe serves to configure an expression that has several alternatives. Like the example I gave above, the expression “server1|server2” simply means that both Server1 or Server2 are covered.

Lastly, one thing to keep in mind is that the various non-literal characters that we covered here, as well as others, cannot be used to actually represent a character in a server name. If your server’s name is really server.local, that could be a problem, as the dot is the wildcard. If you specify that as part of the expression, it will cover servers named server_local, server9local, or any other character, which could potentially lead to a problem. The answer is the backslash character, which signals to the RegEx processor to treat the next character as a literal, no matter what. We refer to this as “backslashing” the name, so you would simply type “server\.local” instead. Not that hard…right?

And finally, let’s go back to our example. Knowing what you know now…can you think of a good way to represent all 4 servers simple and efficiently? There’s no single right answer here. One way I would have done it is:

Eng(ineering|DB)[0-9]{1,3}

Here are some more links, in case you are interested in getting more familiar with some of the advanced structures of RegEx:

http://www.zytrax.com/tech/web/regex.htm

http://www.regular-expressions.info/

http://msdn.microsoft.com/en-us/library/ms974570.aspx

http://www.dmoz.org/Computers/Programming/Languages/Regular_Expressions/

http://doc.cat-v.org/bell_labs/structural_regexps/

http://perldoc.perl.org/perlre.html

RegEx checkers:

http://bokehman.com/regex_checker

http://www.fileformat.info/tool/regex.htm

Comments
  • Nice tutotial Ben!

Your comment has been posted.   Close
Thank you, your comment requires moderation so it may take a while to appear.   Close
Leave a Comment