Monday, August 5, 2013

Get Regular (Expressions)

There's been a lot of debate about Unix since it was first introduced by Bell Labs.  Computer geeks engaged in their own Ford vs. Chevy debate: Unix vs. VMS, multiple Unix minicomputers vs.an IBM mainframe, System V vs. BSD, and so on.  For every proponent of Unix's case-sensitive commands and file system, there were an equal number of opponents who just found it an annoying nuisance. Ah, the Holy Wars of Computerdom.  It was fun stuff.

But whether your roots are in Unix, or VMS, or MVS, or any of the PC/Mac operating systems, I think most of us have found regular expressions to be one of the most useful things to come out of the Unix/Bell Labs world.

We all use simple pattern matching of one form or another.  The MS-DOS DIR command uses an asterisk (*) to match any number of characters and a question mark (?) to match a single character.  The SQL LIKE operator is another example.  The percent sign (%) matches any number of characters, and the underscore (_) matched a single character.  That's really great stuff, and it's good enough 75% of the time.

But there are some problems that simple pattern matching won't solve.  Suppose I want to select rows of data from a table that had valid email_addresses.  My query might look like this:

select email_address
from email_address_table
where email_address is not null;


Well, that query gets all rows that have data in email_address column.  But are the data any good?  Who knows? Let's try the LIKE operator:

select email_address
from email_address_table
where email_address like '%@%';


That's a bit better.  At least we'll see data that looks like username@hostname.  But a regular expression can do even better.

select email_address
from email_address_table
where regexp_like( email_address, 
           '^[a-zA-Z0-9\._-]+@[a-zA-Z0-9\.]+\.[a-zA-Z]+$');


OK, so the example with LIKE is simpler.  But, after we pull all the data using LIKE, we need to do some post-processing ( in C, C++, Java, PL/SQL) to get the data we really wanted.  This regular expression accepts that data that
  1. Has a username before the "@" symbol.  The username can be any combination of letters, numbers, periods, underscore, or dash.
  2. Has an @ symbol separating the username from the hostname.
  3. Has a hostname consisting of letters, numbers, and periods, and ends with a top-level domain consisting of just letters.
This is a really  big improvement.  Now we have a tool that works for 99% of the data.

The magic is in the regular expression metacharacters.   Metacharacters are part of the regular expression grammar; this list is just a start to describe the example:
  1. ^ -- matches the beginning of the string
  2. .  -- period matches any character
  3. [] -- describes a set of characters to match.  In our example, the set of characters includes the letters from a to z and A to Z, period, underscore, and dash
  4. + -- count operator, specifies that the string must have one or more of the bracketed characters.
  5. \ -- escape character. If we want to match on a metacharacter, we must escape it.  The period is a metacharacter, and if we want period to match a period and nothing else, we escape it.
  6. $ -- matches the end of the string.
     And anything that's not a metacharacter matches itself.  For example, the "@" symbol in our expression is not a metacharacter, so it only matches itself.

    Let's look at some sample email addresses.  One of these is a good email address, the others are not:

    Pebbles 
    Wilma_Flintstone@Bedrock
    Fred_Flinstone@SlateConstruction.com
    
    

    Now, lets see how the three queries do.

    The first query, not null, accepts all the data. 'Pebbles' clearly does not have an email address, and Wilma's address is not complete.

    The second query, using LIKE, does not accept Pebbles.  Wilma's address is still a problem.

    The third query, using a regular expression, accepts only Fred's address at SlateConstruction.com.

    Here's a test page at Email and Phone Test for phone numbers and email addresses.  The test page splits the phone number into an area code and exchange, and the page splits the email address into an username and hostname. How the page does that is another day's topic.

    This is a topic that's really worth a deep dive.  Mastering Regular Expressions is really deep dive into the topic.  Another tip:  try using a regular expression whenever you get a chance.  It's a bit like playing the piano:  the more you try regular expressions, the better you'll understand them.

    Unfortunately, there are several versions and many implementations of regular expressions, and some implementations are better than others.  The GNU/Linux egrep command  and the PERL language have excellent regular expression implementations. Both are good, but they differ, so take time to read the documentation that comes with your favorite editor or text processor.




    No comments:

    Post a Comment