Regular expressions were introduced in JavaScript 1.2 and JScript 3.0 with the RegExp object, so much of their functionality is available through RegExp methods. However, many methods of the String object take regular expressions as arguments; so you will see regular expressions commonly used in both contexts.
Regular expressions are most often created using their literal syntax, in which the characters that make up the pattern are surrounded by slashes (/ and /). For example, to create a regular expression that will match any string containing “http,” you might write the following:
var pattern = /http/;
The way you read this pattern is an “h” followed by a “t” followed by a “t” followed by a “p.” Any string containing “http” matches this pattern.
Flags altering the interpretation of the pattern can be given immediately following the second slash. For example, to specify that the pattern is case-insensitive, the i flag is used:
var patternIgnoringCase = /http/i;
This declaration creates a pattern that will match strings containing “http” as well as “HTTP” or “HttP.” The common flags used with regular expressions are shown in Table 8-1 and will be illustrated in examples throughout the chapter. Don’t worry about any but i for the time being.
Character |
Meaning |
---|---|
I |
Case-insensitive. |
G |
Global match. Finds all matches in the string, rather than just the first. |
M |
Multiline matching. |
Regular expressions can also be declared using the RegExp() constructor. The first argument to the constructor is a string containing the desired pattern. The second argument is optional, and contains any special flags for that expression. The two previous examples could equivalently be declared as
var pattern = new RegExp("http"); var patternIgnoringCase = new RegExp("http", "i");
The constructor syntax is most commonly used when the pattern to match against is not determined until runtime. You might allow the user to enter a regular expression and then pass the string containing that expression to the RegExp() constructor.
The most basic method provided by the RegExp object is test(). This method returns a Boolean indicating whether the string given as its argument matches the pattern. For example, we could test
var pattern = new RegExp("http"); pattern.test("HTTP://WWW.W3C.ORG/");
which returns false because pattern matches only strings containing “http.” Or we could test using the case-insensitive pattern,
var patternIgnoringCase = new RegExp("http", "i"); patternIgnoringCase.test("HTTP://WWW.W3C.ORG/");
which returns true because it matches for strings containing “http” while ignoring case. Of course, you won't see much unless you use the returned value:
var patternIgnoringCase = new RegExp("http", "i"); alert(patternIgnoringCase.test("HTTP://WWW.W3C.ORG/"));
Because of JavaScript's automatic type conversion, you can invoke RegExp methods on regular expression literals (just like String methods on string literals). For example,
alert(/http/i.test("HTTP://WWW.W3C.ORG/"));
would alert out true as well.
The example patterns so far merely check for the presence of a particular substring; they exhibit none of the powerful capabilities to which we have alluded. Regular expressions use special character sequences enabling the programmer to create more complicated patterns. For example, special characters provide a way to indicate that a certain character or set of characters should be repeated a certain number of times or that the string must not contain a certain character.
The first set of special characters can be thought of as positional indicators, characters that mandate the required position of the pattern in the strings against which it will be matched. These characters are ^ and $, indicating the beginning and end of the string, respectively. For example,
var pattern = /^http/;
matches only those strings beginning with “http.” The following returns false:
pattern.test("The protocol is http");
The $ character causes the opposite behavior:
var pattern = /http$/;
This pattern matches only those strings ending with “http.” You can use both positional indicators in concert to ensure an exact match to the desired pattern:
var pattern = /^http$/;
This regular expression is read as an “h” at the beginning of the string followed by two “t”s followed by a “p” and the end of the string. This pattern matches only the string “http.”
You need to be very careful to employ positional indicators properly when doing matches, as the regular expression may match strings that are not expected.
Given the syntax of regular expression literals demonstrated so far, one might wonder how to specify a string that includes slashes, such as “http://www.w3c.org/.” The answer is that as with strings, regular expressions use escape codes to indicate characters having special meaning. Escape codes are specified using a backslash character (\). The escape codes used in regular expressions are a superset of those used in strings (there are far more characters with special meaning, like ^ and $, in regular expressions). These escape codes are listed in Table 8-2. You don’t have to memorize them all; their use will become clear as we explore more features of regexps.
Code |
Matches |
---|---|
>\f |
Form feed |
>\n Newline |
Newline |
>\r Carriage return |
Carriage return |
>\t Tab |
Tab |
>\v Vertical tab |
Vertical tab |
>\/ Foreslash / |
Foreslash / |
>\\ Backslash \ |
Backslash \ |
>\. Period . |
Period . |
>\* Asterisk * |
Asterisk * |
>\+ Plus sign + |
Plus sign + |
>\? Question mark ? |
Question mark ? |
>\| Horizontal bar, aka Pipe | |
Horizontal bar, aka Pipe | |
>\( Left parenthesis ( |
Left parenthesis ( |
>\) Right parenthesis ) |
Right parenthesis ) |
>\[ Left bracket [ |
Left bracket [ |
>\] Right bracket ] |
Right bracket ] |
>\{ Left curly brace { |
Left curly brace { |
>\} Right curly brace } |
Right curly brace } |
>\OOO ASCII character represented by octal value OOO |
ASCII character represented by octal value OOO |
>\xHH ASCII character represented by hexadecimal value HH |
ASCII character represented by hexadecimal value HH |
>\uHHHH Unicode character represented by the hexadecimal value HHHH |
Unicode character represented by the hexadecimal value HHHH |
>\cX
Control character represented by ^X, for example, \cH represents Ctrl-h
|
Control character represented by ^X, for example, \cH represents Ctrl-h |
Using the appropriate escape code, we can now define a regular expression that matches “http://www.w3c.org/” (and any other string containing it):
var pattern = /http:\/\/www\.w3c\.org\//;
Because / has special meaning in regular expression literals (it means the beginning or end of the pattern), all the forward slashes (/) in the pattern are replaced with their escaped equivalent, \/.
The important thing to remember is that whenever you want to include a character in a pattern that has a special regexp meaning, you must use its escape code instead.
Regular expression repetition quantifiers allow you to specify the number of times a particular item in the expression can or must be repeated. For now, consider that by “particular item” we mean “previous character.” The distinction will become clear later in the chapter. As an example of a repetition quantifier, * (the asterisk) indicates that the previous item may occur zero or more times. Any sequence of zero or more repetitions of the previous item can be present in the strings the pattern will match. For example:
var pattern = /ab*c/;
Read the * as “repeated zero or more times.” Doing so, we read this pattern as matching any string containing an “a” that is followed immediately by “b” repeated zero or more times, followed immediately by a “c.” All the following strings will match this expression:
ac
abc
abbbbbbbbbbbbbbbbbbbbbbbbbbbc
The letters abc begin the alphabet
Similarly, + specifies that the previous character must be repeated one or more times. The following declaration
var pattern = /ab+c/;
is read as “a” followed by “b” repeated one or more times, followed by “c.” Keeping this pattern in mind, you can see that it matches all the following strings:
abc
abbbbbc
The letters abc begin the alphabet
Conversely, the pattern does not match the string “ac” because it does not contain at least one “b” between “a” and “c.”
The ? quantifier indicates that the previous item may occur zero times or one time, but no more. For example:
var pattern = /ab?c/;
Read this pattern as “a” followed by zero or one “b”s followed by “c.” It matches “ac” and “abc,” but not “abbc.” The ? essentially denotes that the preceding item is optional.
The repetition quantifiers so far haven’t provided any way to specify that a particular character is to be repeated some exact number of times. Curly braces ({ }) are used to indicate the number of repetitions allowed for the preceding token (character). For example,
var pattern = /ab{5}c/;
specifies a pattern consisting of an “a” followed by exactly five “b” characters and then the letter “c.” Of course, this particular expression could have also been written as
var pattern = /abbbbbc/;
But this “long” version would be very cumbersome if you wanted to match, say, a character repeated 25 times.
Using the curly braces it is possible to precisely indicate that the number of repetitions falls within a specific range. To do so, list inside the curly braces the fewest number of repetitions allowed followed by a comma and the maximum allowed. For example,
var pattern = /ab{5,7}c/;
creates a regular expression matching a single “a” followed by between five and seven (inclusive) “b” characters and then the letter “c.”
Omitting the maximum amount from within the curly braces (but still including the comma) specifies a minimum number of repetitions. For example,
var pattern = /ab{3,}c/;
creates an expression matching an “a” followed by three or more letter “b” characters followed by a “c.”
The full list of repetition quantifiers is summarized in Table 8-3.
Character |
Meaning |
---|---|
>* |
Match previous item zero or more times. |
>+ |
Match previous item one time or more. |
>? |
Match previous item zero or one times. |
>{m, n} |
Match previous item at minimum m times, |
>{m, } |
Match previous item m or more times. |
>{m} |
Match previous item exactly m times. |
Now we’re really starting to glimpse the power of regular expressions, and there is still much more to cover. Don’t give up just yet—while learning regexps can initially be a challenge, it will pay off in the long run in the time saved by not having to write and debug complex code.
Notice how Table 8-3 indicates that the repetition quantifiers match the “previous item” a certain number of times. In the examples seen so far, the “previous item” has been a single character. However, JavaScript lets you easily group characters together as a single unit much like the way statements can be grouped together in a block using curly braces. The simplest way to group characters in a regular expression is to use parentheses. Any characters surrounded by parentheses are considered a unit with respect to the special regular expression operators. For example,
var pattern = /a(bc)+/;
is read as “a” followed by “bc” repeated one or more times. The parentheses group the “b” and “c” together with respect to the +. This pattern matches any string containing an “a” followed immediately by one or more repetitions of “bc.”
Another example is
var pattern = /(very){3,5} hot/;
This pattern matches strings containing “very” repeated three, four, or five times followed by a space and the word “hot.”
Sometimes it is necessary to match any character from a group of possibilities. For example, to match phone numbers, the group of characters might be digits, or if you wished to validate a country name, the group of valid characters might be alphabetic.
JavaScript allows you to define character classes by including the possible characters between square brackets ([ ]). Any character from the class can be matched in the string, and the class is considered a single unit like parenthesized groups. Consider the following pattern:
var pattern = /[pbm]ill/;
In general, a class [...] is read as “any character in the group,” so the class [pbm]ill is read as “p” or “b” or “m” followed by “ill.” This pattern matches “pill,” “billiards,” and “paper mill,” but not “chill.”
Consider another example:
var pattern = /[1234567890]+/;
The class [123456789] is a class containing all digits, and the + repetition quantifier is applied to it. As a result, this pattern matches any string containing one or more digits. This format looks like it could get very messy if you desired to set a large group of allowed characters, but luckily JavaScript allows you to use a dash (–) to indicate a range of values:
var pattern = /[0-9]+/;
This regular expression is the same as the previous example, just written more compactly.
Any time you use the range operator, you specify a range of valid ASCII values. So, for example, you might do this
var pattern = /[a-z]/;
to match any lowercase alphabetic character or
var pattern = /[a-zA-Z0-9]/;
to match any alphanumeric character. JavaScript allows you to place all the valid characters in a contiguous sequence in a character class, as in the last example. It interprets such a class correctly.
Character classes finally give us an easy way to construct our phone number validation pattern. We could rewrite our function as
function isPhoneNumber(phone) { var pattern = /[0-9]{3}-[0-9]{3}-[0-9]{4}/; return pattern.test(phone);}
This pattern matches strings containing any character from the class of digits 0 through 9 repeated three times followed by a dash, followed by another three digits, a dash, and a final four digits. Notice how our code to validate phone numbers presented at the start of the chapter went from about 20 lines without regular expressions to only four when using them! We can test that this function works:
document.write("Is 123456 a phone number? ");document.writeln(isPhoneNumber("123456"));document.write("Is 12-12-4322 a phone number? ");document.writeln(isPhoneNumber("12-12-4322"));document.write("Is 415-555-1212 a phone number? ");document.writeln(isPhoneNumber("415-555-1212"));
The output is shown in Figure 8-1.
The truth is that while it appears to work just fine, our isPhoneNumber() function has a subtle flaw commonly overlooked by those new to regular expressions: it is too permissive. Consider the following example:
alert(isPhoneNumber("The code is 839-213-455-726-0078. "));
The result is
Since we didn’t specify any positional information in our pattern, the regexp matches any strings containing it, even if the beginning and end of a string has data that doesn’t match. To correct this flaw we use the $ and ^ specifiers:
function isPhoneNumber(phone) { var pattern = /^[0-9]{3}-[0-9]{3}-[0-9]{4}$/; return pattern.test(phone);}
Now it will only return true if there are no spurious characters preceding or following the phone number.
As another example of the application of regular expressions, we create a pattern to match a case-insensitive username beginning with an alphabetic character followed by zero or more alphanumeric characters as well as underscores and dashes. The following regular expression defines such a pattern:
var pattern = /^[a-z][a-z0-9_-]*/i;
This will match, for example, “m,” “m10-120,” “abracadabra,” and “abra_cadabra,” but not “_user” or “10abc.” Note how the dash was included in the character class last to prevent it from being interpreted as the range operator.
Square brackets can also be used when describing “negative” character classes, namely, classes that specify which characters cannot be present. A negative class is specified by placing a carat (^) at the beginning of the class. For example,
var pattern = /[^a-zA-Z]+/;
will match any sequence of one or more non-alphabetic characters, for instance “314,” “!!%&^,” or “__0.” For a string to match the preceding expression, it must contain at least one non-alphabetic character.
Negative character classes are very useful when matching or parsing fields delimited with a certain value. Sometimes, there is no elegant alternative. For example, it is not straightforward to write a clean regular expression to check that a string contains five comma-separated strings without using a negative character class, but it is simple using negative character classes, as shown here:
var pattern = /[^,]+,[^,]+,[^,]+,[^,]+,[^,]+/;
Read this as one or more characters that isn’t a comma, followed by a comma, followed by one or more characters that isn’t a comma, and so on. You could even write this pattern more concisely:
var pattern = /[^,]+(,[^,]+){4}/;
You can test that these patterns work:
alert(pattern.test("peter, paul, mary, larry")); // shows false alert(pattern.test("peter, paul, mary, larry, moe")); // shows true
This is an important lesson: if you’re having trouble coming up with a regular expression for a particular task, try writing an expression using negative character classes first. It may often point the way toward an even cleaner solution.
Commonly used character classes have shorthand escape codes. A particularly useful notation is the period, which matches any character except a newline. For instance
var pattern = /abc..d/;
would match “abcx7d” or “abc_-d.” Other common classes are \s, any whitespace character; \S, any non-whitespace character; \w, any word character; \W, any non-word character; \d, any digit; and \D, any non-digit. (Notice the pattern: the uppercase version of shorthand is the opposite of the lowercase). The complete list of character classes is given in Table 8-4.
Character |
Meaning |
---|---|
>[chars] |
Any one character indicated either explicitly or as a range between the brackets. |
>[^chars] |
Any one character not between the brackets represented explicitly or as a range. |
>. |
Any character except newline. |
>\w |
Any word character. Same as [a-zA-Z0-9_]. |
>\W |
Any non-word character. Same as [^a-zA-Z0-9_]. |
>\s |
Any whitespace character. Same as [ \t\n\r\f\v]. |
>\S |
Any non-whitespace character. Same as [^ \t\n\r\f\v]. |
>\d |
Any digit. Same as [0-9]. |
>\D |
Any non-digit. Same as [^0-9]. |
>\b |
A word boundary. The empty “space” between a \w and \W. |
>\B |
A word non-boundary. The empty “space” between word characters. |
>[\b] |
A backspace character. |
We can use these shorthands to write an even more concise version of our isPhoneNumber() function:
function isPhoneNumber(phone) { var pattern = /^\d{3}-\d{3}-\d{4}$/; return pattern.test(phone);}
We’ve replaced each [0-9] character class with its shorthand, \d.
The final major tool necessary to define useful patterns is |, which indicates the logical OR of several items. For example, to match a string that begins with “ftp,” “http,” or “https,” you might write
var pattern = /^(http|ftp|https)/;
Unlike repetition quantifiers that only apply to the previous item, alternation separates complete patterns. If we had written the preceding example as
var pattern = /^http|ftp|https/;
the pattern would have matched a string beginning with “http” or a string containing “ftp” or a string containing “https.” The initial ^ would’ve been included only in the first alternative pattern. To further illustrate, consider the following regexp:
var pattern = /James|Jim|Charlie Brown/;
Since each | indicates a new pattern, this matches a string containing “James,” a string containing “Jim,” or a string containing “Charlie Brown.” It does not match a string containing “James Brown” as you might have thought. Parenthesizing alternatives limits the effect of the | to the parenthesized items, so you see the following pattern,
var pattern = /(James|Jim|Charlie) Brown/;
which matches “James Brown,” “Jim Brown,” and “Charlie Brown.”
The tools described so far work together to permit the creation of useful regular expressions. It is important to be comfortable interpreting the meaning of regular expressions before delving further into how they are used. Table 8-5 provides some practice examples along with strings they do and do not match. You should work through each example before proceeding.
Regular Expression |
Matches |
Does Not Match |
---|---|---|
>/\Wten\W/ |
ten |
ten, tents |
>/\wten\w/ |
aten1 |
ten, 1ten |
>/\bten\b/ |
ten |
attention, tensile, |
>/\d{1,3}\.\d{1,3}\.\ |
128.22.45.1 |
abc.44.55.42 |
>/^(http|ftp|https):\/ |
file:///etc/motd |
|
>/\w+@\w+\.\w{1,3}/ |
president@whitehouse.gov
|