Regular Expressions

Before we look at the split(), match(), search(), and replace() methods of the String object again, we need to look at regular expressions and the RegExp object. Regular expressions provide a means of defining a pattern of characters, which we can then use to split, search, or replace characters in a string where they fit the defined pattern.

JavaScript's regular expression syntax borrows heavily from the regular expression syntax of Perl, another scripting language. The latest versions of languages, such as VBScript, have also incorporated regular expressions, as do lots of applications programs, such as Microsoft Word, in which the Find facility allows regular expressions to be used. You'll find your regular expression knowledge will prove useful even outside JavaScript.

The use of regular expressions in JavaScript is through the RegExp object, which is a native JavaScript object, as are String, Array, and so on. There are two ways of creating a new RegExp object. The easier is with a regular expression literal, such as the following:

var myRegExp = /\b'|'\b/;

The forward slashes (/) mark the start and end of the regular expression. This is a special syntax that tells JavaScript that the code is a regular expression, much as quote marks define a string's start and end. Don't worry about the actual expression's syntax yet (the \b'|'\b)—we'll be explaining that in detail shortly.

Alternatively, we could use the RegExp object's constructor function RegExp() and type the following:

var myRegExp = new RegExp("\\b'|'\\b");

Either way of specifying a regular expression is fine, though the former method is a shorter, more efficient one for JavaScript to use, and therefore generally preferred. For much of the remainder of the chapter, we'll use the first method. The main reason for using the second method is because it allows the regular expression to be determined at runtime (as the code is executing and not when writing the code), for example, if we want to base it on user input.

Once we get familiar with regular expressions, we will come back to the second way of defining them using the RegExp() constructor. As you can see, the syntax of regular expressions is slightly different when using the second method, and we'll explain this in detail then.

While we'll be concentrating on the use of the RegExp object as a parameter for the String object's split(), replace(), match(), and search() methods, the RegExp object does have its own methods and properties. For example, the test() method allows you to test to see if the string passed to it as a parameter contains a pattern matching that defined in the RegExp object. We'll see the test() method in use in an example shortly.

Simple Regular Expressions

Defining patterns of characters using regular expression syntax can get fairly complex. In this section we'll explore just the basics of regular expression patterns. The best way to do this is through examples.

Let's start by looking at an example where we want to do a simple text replacement using the replace() method and a regular expression. Imagine we have the following string:

var myString = "Paul, Paula, Pauline, paul, Paul";

and we want to replace any occurrence of the name "Paul" with "Ringo."

Well, the pattern of text we need to look for is simply Paul. Representing this as a regular expression, we just have this:

var myRegExp = /Paul/;

As we saw earlier, the forward slash characters mark the start and end of the regular expression. Now let's use this with the replace() method.

myString = myString.replace(myRegExp, "Ringo");

You can see the replace() method takes two parameters: the RegExp object that defines the pattern to be searched and replaced, and the replacement text.

If we put this all together in an example, we have the following:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<body>
<script language="JavaScript" type="text/JavaScript">
  var myString = "Paul, Paula, Pauline, paul, Paul";
  var myRegExp = /Paul/;
  myString = myString.replace(myRegExp, "Ringo");
  alert(myString);
</script>
</body>
</html>

If you load this code into a browser, you will see the screen shown in Figure 8-3.

Figure 8-3

We can see that this has replaced the first occurrence of Paul in our string. But what if we wanted all the occurrences of Paul in the string to be replaced? The two at the far end of the string are still there, so what happened?

Well, by default the RegExp object only looks for the first matching pattern, in this case the first Paul, and then stops. This is common and important behavior for RegExp objects. Regular expressions tend to start at one end of a string and look through the characters until the first complete match is found, then stop.

What we want is a global match, which is a search for all possible matches to be made and replaced. To help us out, the RegExp object has three attributes we can define. You can see these listed in the following table.

Attribute Character	Description
g	Global match. This looks for all matches of the pattern rather than stopping after the first match is found.
i	Pattern is case-insensitive. For example, Paul and paul are considered the same pattern of characters.
m	Multi-line flag. Only available in IE 5.5+ and NN 6+, this specifies that the special characters ^ and $ can match the beginning and the end of lines as well as the beginning and end of the string. We'll learn about these characters later in the chapter.

If we change our RegExp object in the code to

var myRegExp = /Paul/gi;

a global case-insensitive match will be made. Running the code now produces the result shown in Figure 8-4.

Figure 8-4

This looks like it has all gone horribly wrong. The regular expression has matched the Paul substrings at the start and the end of the string, and the penultimate paul, just as we wanted. However, the Paul substrings inside Pauline and Paula have also been replaced.

The RegExp object has done its job correctly. We asked for all patterns of the characters Paul to be replaced and that's what we got. What we actually meant was for all occurrences of Paul, when it's a single word and not part of another word, such as Paula, to be replaced. The key to making regular expressions work is to define exactly the pattern of characters so that only that pattern can match and no other. So let's do that.

We want paul or Paul to be replaced.
We don't want it replaced when it's actually part of another word, as in Pauline.

How do we specify this second condition? How do we know when the word is joined to other characters, rather than just joined to spaces or punctuation or just the start or end of the string?

To see how we can achieve this with regular expressions, we need to enlist the help of regular expression special characters. We'll look at these in the next section, by the end of which we should be able to solve the problem.

Regular Expressions: Special Characters

We will be looking at three types of special characters in this section.

Text, Numbers, and Punctuation

The first group of special characters we'll look at contains the character class's special characters. By the character class, I mean digits, letters, and white space characters. The special characters are displayed in the following table.

Character Class	Characters It Matches	Example
\d	Any digit from 0 to 9	\d\d matches 72, but not aa or 7a
\D	Any character that is not a digit	\D\D\D matches abc, but not 123 or 8ef
\w	Any word character; that is A–Z, a–z, 0–9, and the underscore character	\w\w\w\w matches Ab_2, but not Ј$%* or Ab_@
\W	Any non-word character	\W matches @, but not a
\s	Any white space character, including tab, newline, carriage return, formfeed, and vertical tab	\s matches tab
\S	Any non-white space character	\S matches A, but not tab
.	Any single character other than the new line character \n	. matches a or 4 or @
[...]	Any one of the characters between the brackets	[abc] will match a or b or c, but nothing else [a-z] will match any character in the range a to z
[^...]	Any one character, but not one of those inside the brackets	[^abc] will match any character except a or b or c [^a-z] will match any character that is not in the range a to z

Note that uppercase and lowercase characters mean very different things, so you need to be extra careful with case when using regular expressions.

Let's look at an example. To match a telephone number in the format 1-800-888-5474, the regular expression would be as follows:

\d-\d\d\d-\d\d\d-\d\d\d\d

You can see that there's a lot of repetition of characters here, which makes the expression quite unwieldy. To make this simpler, regular expressions have a way of defining repetition. We'll see this a little later in the chapter, but first let's look at another example.

Try It Out – Checking a Passphrase for Alphanumeric Characters

We'll use what we've learned so far about regular expressions in a full example in which we check that a passphrase contains only letters and numbers; that is, alphanumeric characters, and not punctuation or symbols like @, %, and so on.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Example</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<script language="JavaScript" type="text/JavaScript">
function regExpIs_valid(text)
{
   var myRegExp = /[^a-z\d ]/i;
   return !(myRegExp.test(text));
}
function butCheckValid_onclick()
{
   if (regExpIs_valid(document.form1.txtPhrase.value) == true)
   {
      alert("Your passphrase contains only valid characters");
   }
   else
   {
      alert("Your passphrase contains one or more invalid characters");
   }
}
</script>
<form name=form1>
Enter your passphrase:
<br>
<input type="text" name=txtPhrase>
<br>
<input type="button" value="Check Character Validity" name=butCheckValid
   onclick="butCheckValid_onclick()">
</form>
</body>
</html>

Save the page as ch8_examp2.htm, and then load it into your browser. Type just letters, numbers, and spaces into the text box; click the Check Character Validity button; and you'll be told that the phrase contains valid characters. Try putting punctuation or special characters like @, ^, $, and so on in the text box, and you'll be informed that it's invalid.

How It Works

Let's start by looking at the regExpIs_Valid() function defined at the top of the script block in the head of the page. That does the validity checking of our passphrase using regular expressions.

function regExpIs_valid(text)
{
   var myRegExp = /[^a-z\d ]/i;
   return !(myRegExp.test(text));
}

The function takes just one parameter: the text we want to check for validity. We then declare a variable myRegExp and set it to a new regular expression, which implicitly creates a new RegExp object.

The regular expression itself is fairly simple, but first let's think about what pattern we are looking for. What we want to find out is whether our passphrase string contains any characters that are not letters between A–Z and a–z, numbers between 0–9, or a space. Let's see how this translates into a regular expression.

First we used square brackets with the ^ symbol.

[^]

This means we want to match any character that is not one of the characters specified inside the square brackets. Next we added our a-z, which specifies any character in the range a through to z.

[^a-z]

So far our regular expression matches any character that is not between a and z. Note that, because we added the i to the end of the expression definition, we've made the pattern case-insensitive. So our regular expression actually matches any character not between A and Z or a and z.

Next we added \d to indicate any digit character, or any character between 0 and 9.

[^a-z\d]

So our expression matches any character that is not between a and z, A and Z, or 0 and 9. Finally, we decided that a space is valid, so we add that inside the square brackets as shown in the following:

[^a-z\d ]

Putting this all together, we have a regular expression that will match any character that is not a letter, a digit, or a space.

On the second and final line of the function we use the RegExp object's test() method to return a value.

return !(myRegExp.test(text));

The test() method of the RegExp object checks the string passed as its parameter to see if the characters specified by the regular expression syntax match anything inside the string. If they do, true is returned; if not, false is returned. Our regular expression will match the first invalid character found, so if we get a result of true, we have an invalid passphrase. However, it's a bit illogical for an is_valid function to return true when it's invalid, so we reverse the result returned by adding the NOT operator (!).

Previously we saw the two-line validity checker function using regular expressions. Just to show how much more coding is required to do the same thing without regular expressions, I've created a second function that does the same as regExpIs_valid() but without regular expressions.

function is_valid(text)
{
   var isValid = true;
   var validChars = "abcdefghijklmnopqrstuvwxyz1234567890 ";
   var charIndex;
   for (charIndex = 0; charIndex < text.length;charIndex++)
   {
      if ( validChars.indexOf(text.charAt(charIndex).toLowerCase()) < 0)
      {
         isValid = false;
         break;
      }
   }
   return isValid;
}

This is probably as small as the non-regular expression version can be, and yet it's still 12 lines long. That's six times the amount of code for the regular expression version.

The principle of this function is similar to the regular expression version. We have a variable, validChars, that contains all the characters we consider to be valid. We then use the charAt() method in a for loop to get each character in the passphrase string and check whether it exists in our validChars string. If it doesn't, we know we have an invalid character.

In this example, the non-regular expression version of the function is 12 lines, but with a more complex problem we could find it takes 20 or 30 lines to do the same thing a regular expression can do in just a handful of lines of code.

Back to our actual code, the other function defined in the head of the page is butCheckValid_onclick(). As the name suggests, this is called when the butCheckValid button defined in the body of the page is clicked.

This function calls our regExpis_Valid() function in an if statement to check whether the passphrase entered by the user in the txtPhrase text box is valid. If it is, an alert box is used to inform the user.

function butCheckValid_onclick()
{
   if (regExpIs_valid(document.form1.txtPhrase.value) == true)
   {
      alert("Your passphrase contains valid characters");
   }

If it isn't, another alert box is used to let the user know that his text was invalid.

   else
   {
      alert("Your passphrase contains invalid characters");
   }
}

Repetition Characters

Regular expressions include something called repetition characters, which are a way of specifying how many of the last item or character we want to match. This proves very useful, for example, if we want to specify a phone number that repeats a character a specific number of times. The following table lists some of the most common repetition characters and what they do.

Special Character	Meaning	Example
{n}	Match n of the previous item	x{2} matches xx
{n,}	Match n or more of the previous item	x{2,} matches xx, xxx, xxxx, xxxxx, and so on
{n,m} `	Match at least n and at most m of the preceding item	x{2,4} matches xx, xxx, and xxxx
?	Match the previous item zero or one times	x? matches nothing or x
+	Match the previous item one or more times	x+ matches x, xx, xxx, xxxx, xxxxx, and so on
*	Match the previous item zero or more times	x* matches nothing, or x, xx, xxx, xxxx, and so on

We saw earlier that to match a telephone number in the format 1-800-888-5474, the regular expression would be \d-\d\d\d-\d\d\d-\d\d\d\d. Let's see how this would be simplified using the repetition characters.

The pattern we're looking for starts with one digit followed by a dash, so we need the following:

\d-

Next are three digits followed by a dash. This time we can use the repetition special characters—\d{3} will match exactly three \d, which is the any digit character.

\d-\d{3}-

Next there are three digits followed by a dash again, so now our regular expression looks like this:

\d-\d{3}-\d{3}-

Finally, the last part of the expression is four digits, which is \d{4}.

\d-\d{3}-\d{3}-\d{4}

We'd declare this regular expression like this:

var myRegExp = /\d-\d{3}-\d{3}-\d{4}/

Remember that the first / and last / tell JavaScript that what is in between those characters is a regular expression. JavaScript creates a RegExp object based on this regular expression.

As another example, what if we have the string "Paul Paula Pauline," and we want to replace Paul and Paula with George? To do this, we would need a regular expression that matches both Paul and Paula.

Let's break this down. We know we want the characters Paul, so our regular expression starts as

Paul

Now we also want to match Paula, but if we make our expression Paula, this will exclude a match on Paul. This is where the special character ? comes in. It allows us to specify that the previous character is optional—it must appear zero (not at all) or one times. So, the solution is

Paula?

which we'd declare as

var myRegExp = /Paula?/

Position Characters

The third group of special characters we'll look at are those that allow you to specify either where the match should start or end or what will be on either side of the character pattern. For example, we might want our pattern to exist at the start or end of a string or line, or we might want it to be between two words. The following table lists some of the most common position characters and what they do.

Position Character	Description
^	The pattern must be at the start of the string, or if it's a multi-line string, then at the beginning of a line. For multi-line text (a string that contains carriage returns), we need to set the multi-line flag when defining the regular expression using /myreg ex/m. Note that this is only applicable to IE 5.5 and later and NN 6 and later.
$	The pattern must be at the end of the string, or if it's a multi-line string, then at the end of a line. For multi-line text (a string that contains carriage returns), we need to set the multi-line flag when defining the regular expression using /myreg ex/m. Note that this is only applicable to IE 5.5 and later and NN 6 and later.
\b	This matches a word boundary, which is essentially the point between a word character and a non-word character.
\B	This matches a position that's not a word boundary.

For example, if we wanted to make sure our pattern was at the start of a line, we would type the following:

^myPattern

This would match an occurrence of myPattern if it was at the beginning of a line.

To match the same pattern, but at the end of a line, we would type the following:

myPattern$

The word boundary special characters \b and \B can cause confusion, because they do not match characters but the positions between characters.

Imagine we had the string "Hello world!, let's look at boundaries said 007." defined in the code as follows:

var myString = "Hello world!, let's look at boundaries said 007.";

To make the word boundaries (that is, the boundaries between the words) of this string stand out, let's convert them to the | character.

var myRegExp = /\b/g;
myString = myString.replace(myRegExp, "|");
alert(myString);

We've replaced all the word boundaries, \b, with a |, and our message box looks like the one in Figure 8-5.

Figure 8-5

Note

Note that this does not work on Netscape Navigator version 4.0–4.05.

You can see that the position between any word character (letters, numbers, or the underscore character) and any non-word character is a word boundary. You'll also notice that the boundary between the start or end of the string and a word character is considered to be a word boundary. The end of this string is a full stop. So the boundary between that and the end of the string is a non-word boundary, and therefore no | has been inserted.

If we change the regular expression in the example, so that it replaces non-word boundaries as follows:

var myRegExp = /\B/g;

we get the result shown in Figure 8-6.

Figure 8-6

Note

Note that this does not work on Netscape Navigator version 4.0–4.05. Also note that on NN version 4.06+ (but not version 6 or later), no | is inserted at the end of the string after the period.

Now the position between a letter, number, or underscore and another letter, number, or underscore is considered a non-word boundary and is replaced by an | in our example. However, what is slightly confusing is that the boundary between two non-word characters, such as an exclamation mark and a comma, is also considered a non-word boundary. If you think about it, it actually does make sense, but it's easy to forget when creating regular expressions.

You'll remember from when we started looking at regular expressions that I used the following example

<html>
<body>
<script language="JavaScript" type="text/JavaScript">
  var myString = "Paul, Paula, Pauline, paul, Paul";
  var myRegExp = /Paul/gi;
  myString = myString.replace(myRegExp, "Ringo");
  alert(myString);
</script>
</body>
</html>

to convert all instances of Paul or paul into Ringo.

However, we found that this code actually converts all instances of Paul to Ringo, even when inside another word.

One option to solve this problem would be to replace the string Paul only where it is followed by a non-word character. The special character for non-word characters is \W, so we need to alter our regular expression to the following:

var myRegExp = /Paul\W/gi;

This gives the result shown in Figure 8-7.

Figure 8-7

Note

Note that on Netscape 4.0–4.05, this actually gives the result

Ringo Paula, Pauline, Ringo Ringo

It's getting better, but it's still not what we want. Notice that the commas after the second and third Paul substrings have also been replaced since they matched the \W character. Also, we're still not replacing the Paul at the very end of the string. That's because there is no character after the letter l in the last Paul. What is after the l in the last Paul? Nothing, just the boundary between a word character and a non-word character and therein lies our answer. What we want as our regular expression is Paul followed by a word boundary. Let's alter our regular expression to cope with that by entering the following:

var myRegExp = /Paul\b/gi;

Now we get the result we want, as shown in Figure 8-8.

Figure 8-8

At last we've got it right, and this example is finished.

Covering All Eventualities

Perhaps the trickiest thing about a regular expression is making sure it covers all eventualities. In the previous example our regular expression works with the string as defined, but does it work with the following?

var myString = "Paul, Paula, Pauline, paul, Paul, JeanPaul";

Here the Paul substring in JeanPaul will be changed to Ringo. We really only want to convert the substring Paul where it is on its own, with a word boundary on either side. If we change our regular expression code to

var myRegExp = /\bPaul\b/gi;

we have our final answer and can be sure only Paul or paul will ever be matched.

Grouping Regular Expressions

Our final topic under regular expressions, before we look at examples using the match(), replace(), and search() methods, is how we can group expressions. In fact it's quite easy. If we want a number of expressions to be treated as a single group, we just enclose them in parentheses, for example /(\d\d)/. Parentheses in regular expressions are special characters that group together character patterns and are not themselves part of the characters to be matched.

The question is, Why would we want to do this? Well, by grouping characters into patterns, we can use the special repetition characters to apply to the whole group of characters, rather than just one.

Let's take the string defined in myString below as an example.

var myString = "JavaScript, VBScript and Perl";

How could we match both JavaScript and VBScript using the same regular expression? The only thing they have in common is that they are whole words and they both end in Script. Well, an easy way would be to use parentheses to group the patterns Java and VB. Then we can use the ? special character to apply to each of these groups of characters to make our pattern any word having zero or one instances of the characters Java or VB, and ending in Script.

var myRegExp = /\b(VB)?(Java)?Script\b/gi;

If we break this expression down, we can see the pattern it requires is as follows:

A word boundary: \b
Zero or one instances of VB: (VB)?
Zero or one instances of Java: (Java)?
The characters Script: Script
A word boundary: \b

If we put this together, we get

var myString = "JavaScript, VBScript and Perl";
var myRegExp = /\b(VB)?(Java)?Script\b/gi;
myString = myString.replace(myRegExp, "xxxx");
alert(myString);

and the output of this code is shown in Figure 8-9.

Figure 8-9

If you look back at the special repetition characters table, you'll see that they apply to the previous item. This can be a character or, where they have been grouped using parentheses, the previous group of characters.

However, there is a potential problem with the regular expression we just defined. As well as matching VBScript and JavaScript, it also matches VBJavaScript. This is clearly not exactly what we meant.

To get around this we need to make use of both grouping and the special character |, which is the alternation character. It has an "or"-like meaning, similar to using || in if statements, and will match the characters on either side of itself.

Let's think about the problem again. We want the pattern to match VBScript or JavaScript. Clearly they have the Script part in common. So what we want is a new word starting with Java or starting with VB, and either way it must end in Script.

First, we know that the word must start with a word boundary.

\b

Next we know that we want either VB or Java to be at the start of the word. We've just seen that in regular expressions | provides the "or" we need, so in regular expression syntax we want

\b(VB|Java)

This would match the pattern VB or Java. Now we can just add the Script part.

\b(VB|Java)Script\b

So our final code looks like this:

var myString = "JavaScript, VBScript and Perl";
var myRegExp = /\b(VB|Java)Script\b/gi;
myString = myString.replace(myRegExp, "xxxx");
alert(myString);

Reusing Groups of Characters

We can reuse the pattern specified by a group of characters later on in our regular expression. To refer to a previous group of characters, we just type \ and the order of the group. For example, the first group can be referred to as \1, the second as \2, and so on.

Let's look at an example. Say we have a list of numbers in a string, with each number separated by a comma. For whatever reason, we are not allowed to have the same numbers repeated after each other, so while

009,007,001,002,004,003

would be OK, the following:

007,007,001,002,002,003

would not be valid, because we have 007 and 002 repeated after themselves.

How can we find instances of repeated digits and replace them with the word ERROR? We need to use the ability to refer to groups in regular expressions.

First let's define our string as follows:

var myString  = "007,007,001,002,002,003,002,004";

Now we know we need to search for a series of one or more number characters. In regular expressions the \d specifies any digit character, and + means one or more of the previous character. So far, that gives our regular expression as

\d+

We want to match a series of digits followed by a comma, so we just add the comma.

\d+,

This will match any series of digits followed by a comma, but how do we search for any series of digits followed by a comma, then followed again by the same series of digits? As the digits could be any digits, we can't add them directly into our expression like so

\d+,007

because this will not work with the 002 repeat. What we need to do is put the first series of digits in a group, then we can specify that we want to match that group of digits again. This can be done using \1, which says, "match the characters found in the first group defined using parentheses." Put all this together, and we have the following:

(\d+),\1

This defines a group whose pattern of characters is one or more digit characters. This group must be followed by a comma and then by the same pattern of characters as were found in the first group. Put this into some JavaScript, and we have the following:

var myString  = "007,007,001,002,002,003,002,004";
var myRegExp = /(\d+),\1/g;
myString = myString.replace(myRegExp,"ERROR");
alert(myString);

The alert box will show

ERROR,1,ERROR,003,002,04

That completes our brief look at regular expression syntax. Because regular expressions can get a little complex, it's often a good idea to start simple and build them up slowly, as we have done. In fact, most regular expressions are just too hard to get right in one step—at least for us mere mortals without a brain the size of a planet.

If it's still looking a bit strange and confusing, don't panic. In the next sections, we'll be looking at the String object's split(), replace(), search(), and match() methods with plenty more examples of regular expression syntax.