Practical Regex - Part 1/2 (The Common Operators)

Background

Even before the Beatles or Led Zeppelin became famous, theories about regular expressions were already in existence. Regular expression engines were eventually implemented in Unix machines during the '70s. Consequently, powerful regex tools have been bundled also in GNU Linux or Mac OSX systems by default: the search utilities (find and grep) and the search & replace utilities (sedawk, and perl). 

Regular expression (regex) is a rule or pattern that is usually used to check/validate texts, and to search a text within a group of texts, and sometimes modifying the matched texts. They are used by word processors, text editors, compilers, interpreters, email validators, autocomplete widgets, search engines, etc.  Regular expressions are magical creatures but they could be really intimidating. And being a programmer, it's really valuable if you know their essential concepts: you don't have to master them, you just need to understand their most common and important contructs in order to be productive. There are some problems that will be messy or hard to solve if you will not utilize regex or just don't know how to leverage them. 

There are only a handful of core ideas in regex:

  1. Predefined Groups: \d\D\w\W\s\S
  2. Alternatives: |
  3. Quantification: ?*+{}
  4. Grouping: ()
  5. Special Characters: .[][-][^]
  6. Anchors: ^$

Still, in the table below, these are the commonly used ones (8 operators only), at least from my experience. If you master the 8 operators first, the rest could be easily learned as needed. Tip: You must memorize the first 8 and really understand them or at least pay particular attention to them since they will be the foundation of other (less common) operators.

Regex Big Ideas

Concept Usual Operator
Predefined Groups \d\w
Alternatives |
Quantification *+?
Grouping ()
Special Characters .

The core ideas of regex are really simple on their own, but if you combine them like a team, their power will grow exponentially (as you will see below), and it could solve even the most complex cases. Part 1 of this series deals with the matching of texts, while part 2 will deal in the modification/manipulation of the matched texts.

Getting Hands Dirty

Regex is usually defined inside of 2 forward slashes and quotes are not used to enclosed them. This is the usual construct in various programming languages.

/regex/

The good thing about regular expressions is that the essential concepts/ideas are transferable across various programming languages (JavaScriptPythonJavaPHPC++, or even SQL). In this post, I'll illustrate regex using JavaScript since you don't need to install or configure anything and its interpreter is easy to access using just a web browser. In JavaScript, one of the ways to use regex is via the match string function: it will return the array of matches if there's a match, or null otherwise.

'The strings should be placed here.'.match(/regex/)

The sample codes in this article will run in any browser (Chrome, Firefox, Safari, etc) if you'll access their Console window. When using Chrome, the Console panel (see screenshot below) could be accessed using Command + Option + J (Mac) or Control + Shift + J (Windows/Linux) keyboard shortcuts. 

Chrome Console Panel

I. Predefined Groups:

These are set of predefined classes/groups used for easy comparison. Note that each of the rule will match a single character only.

Digit \d matches any number from 0 to 9.

Word \w matches any letter (alphabet characters, both lowercase and uppercase), number (0-9), or the underscore (_). Hence, it has 63 possible values (26 lowercase + 26 uppercase + 10 digits + 1 underscore). Word is similar to the notion of usual valid variable names in programming just like tempVar42 or temp_var are valid, but temp-var is not.

The samples below will all return a match:

'0'.match(/\d/)
'7'.match(/\d/)
'A'.match(/\w/)
'w'.match(/\w/)
'_'.match(/\w/)

The samples below will have no match and will all return null. In line 1, d is a letter and not a digit; in line 2, $ is a symbol and not a letter, digit, or underscore.

'd'.match(/\d/)
'$'.match(/\w/)

II. Alternatives:

Vertical bar (|) is used to specify the allowed/alternate patterns.

The samples below will all return a match:

'Mr'.match(/Mr/)
'Mr'.match(/Mr|Mrs/)
'Mr'.match(/Mr|Mrs|Ms/)

The samples below will have no match and will all return null since Engr is not matched by any rule.

'Engr'.match(/Mr/)
'Engr'.match(/Mr|Mrs|Ms/)

Practical Example:

var status = 'in_progress';

if (status.match(/draft|in_progress|for_review/) !== null){
  // Add this article to the list of unpublished.
}

III. Quantification:

Asterisk (*) and Plus (+) are used to quantify number of times the preceding operator will be used. 

+ means one or more times (the pattern on its left occurs at least once).

* means zero or more times (like +, but the pattern is allowed to be absent). 

? means zero or one time only (the pattern is optional, and could only occur once). 

The samples below will all return a match:

'2015'.match(/\d+/)
'2015'.match(/\d*/)
'7'.match(/\d+/)
'7'.match(/\d*/)
'7'.match(/\d?/)
''.match(/\d*/)
''.match(/\d?/)

'admin42'.match(/\w+/)
'admin42'.match(/\w*/)
'__init__'.match(/\w+/)
'__init__'.match(/\w*/)
'Z'.match(/\w+/)
'Z'.match(/\w*/)
'Z'.match(/\w?/)
''.match(/\w*/)
''.match(/\w?/)

IV. Grouping:

Parentheses () are used to group the patterns.

The samples below will all return a match, s become optional because of ? operator:

'Mr'.match(/(Mr)/)
'Mr'.match(/M(rs?|s)|Engr/)
'Mrs'.match(/M(rs?|s)|Engr/)
'Ms'.match(/M(rs?|s)|Engr/)
'Engr'.match(/M(rs?|s)|Engr/)

Practical Example:

 var nonProdServers = /(local|stage|dev).travel.cnn.com/;

 // Check if the widget script is hosted in non-prod servers.
 if (CNNTravelMaps.HOST.match(nonProdServers) !== null) {
   // Do the non-prod code here.
 }

V. Special Characters:

Meta-Characters like dot (.) and brackets ([]) are used for some special purposes. Dot is very powerful since it could match anything, including the dot itself. Dot is only avoided when dealing with patterns spanning multiple lines since it couldn't match the line breaks. But for most cases, dot is very handy and safe to use.

The samples below will all return a match, note that lines 3-5 has the same rule that could also apply to lines 1-2. The regex for line 3-5 could apply to all polynomials: it showcases the power of combining the concepts of predefined group (\d\w), grouping(()), quantification (+), and special characters (.) :

'2x+3y'.match(/\d\w.\d\w/)
'2x-3y'.match(/\d\w.\d\w/)
'2x-3y'.match(/(\d\w.?)+/)
'2x+3y-4z'.match(/(\d\w.?)+/)
'2x+3y-4z+5w-7z'.match(/(\d\w.?)+/)

Practical Example (simple email validator):

'lorem000ipsum@regex.edu'.match(/\w+@\w+.(com|org|edu)/)
'johny_bravo_2015@hellokitty.com'.match(/\w+@\w+.(com|org|edu)/)

Notes

These are the other constructs that are sometimes needed for other (less common) cases. It is important that you are also aware of them so that you may dig deeper into it later as needed.

Anchors (usually for efficiency). 

^ checks if the string/line starts with the specified regex.

'https://www.ranelpadon.com'.match(/^https/)

$ checks if the string/line ends with the specified regex.

'ftp://ranelpadon.com/sites/default/files/sitemap.pdf'.match(/pdf$/)

Regex is usually a contains operation (which includes checking for the start and end of the string as well.)

Ranged/Precise Selectors 

{m, n} repeats the regex on its left m-to-n times. Consequently, {m} repeats the regex on its left m times.

'12-19-2015'.match(/\d{2}-\d{2}-\d{4}/)

Note that if you want to match the forward slash character literally, you must "escape" it since it is used as a delimiter for regex pattern, like the case for /regex/. As you can see, injecting forward slashes in between of regex will cause confusion in the regex parser that is why you need to escape them. Same case also if you want to match the usual regex operators literally (+, *, ?, etc).

'12/19/2015'.match(/\d{2}\/\d{2}\/\d{4}/)

[] checks if the character is in the "box". Dash (-) inside [] means it will use the range of characters. This is useful If you want to match for a subset of alphabet or numbers.

'e'.match(/[aeiou]/)
'312312233311'.match(/[1-3]+/)
'THEBIGWORDSONLY'.match(/[A-Z]+/)

Regex Visualizer

There is a cool website that will help you visualize the regex pattern, for example, this is the infographic for the simple email validator.

That's it for now. Take your time to review for the last time the 8 operators cited in this article. In part 2, we'll utilize them again, and in the contexts of finding strings and manipulating them on-the-fly which is a very common regex workflow also and lot more fun.

Soon: Practical Regex - Part 2/2 (Capture and Manipulate)

Tags: