Chapter 8

In this chapter we looked at string manipulation using the String object and the use of the RegExp object to match patterns of characters within strings

Question 1

What problem does the code below solve?

var myString = "This sentence has has a fault and and we need to fix it."
var myRegExp = /(\b\w+\b) \1/g;
myString = myString.replace(myRegExp,"$1");

If we now change our code, so that we create our RegExp object like this:

var myRegExp = new RegExp("(\b\w+\b) \1");

why would this not work, and how could we rectify the problem?

The problem is that the sentence has "has has" and "and and" inside it, clearly a mistake. A lot of word processors have an autocorrect feature that fixes common mistakes like this, and what our regular expression does is mimic this feature.

So the erroneous myString

"This sentence has has a fault and and we need to fix it."

will become

"This sentence has a fault and we need to fix it."

Let's look at how the code works, starting with the regular expression.

/(\b\w+\b) \1/g;

By using parentheses we have defined a group, so (\b\w+\b) is group 1. This group matches the pattern of a word boundary followed by one or more alphanumeric characters, that is, a–z, A–Z, 0–9, and_, followed by a word boundary. Following the group we have a space then \1. What \1 means is match exactly the same characters as were matched in pattern group 1. So, for example, if group 1 matched "has," then \1 will match "has" as well. It's important to note that \1 will match the exact previous match by group 1. So when group 1 then matches the "and," the \1 now matches "and" and not the "has" that was previously matched.

We use the group again in our replace() method; this time the group is specified using the $ symbol, so $1 matches group 1. It's this that causes the two matched "has" and "and" to be replaced by just one.

Turning to the second part of the question, how do we need to change the following code so that it works?

var myRegExp = new RegExp("(\b\w+\b) \1");

Easy; now we are using a string passed to the RegExp object's constructor, and we need to use two \ rather than one when we mean a regular expression syntax character, like this:

var myRegExp = new RegExp("(\\b\\w+\\b) \\1","g");

Notice we've also passed a g to the second parameter to make it a global match.

Question 2

Write a regular expression that finds all of the occurrences of the word "a" in the following sentence, and replaces them with "the".

"a dog walked in off a street and ordered a finest beer"

So the sentence becomes:

"the dog walked in off the street and ordered the finest beer"

<html>
<body>
<script language=JavaScript>
var myString = "a dog walked in off a street and ordered a finest beer";
var myRegExp = /\ba\b/gi;
myString = myString.replace(myRegExp,"the");
alert(myString)
</script>
</body>
</html>

Save this as ch08_q2.htm.

With regular expressions, it's often not just what you want to match, but also what you don't want to match that is a problem. Here we want to match the letter "a", so why not just write

var myRegExp = /a/gi;

Well, that would work, but it would also replace the "a" in "walked," which we don't want. We want to replace the letter "a" but only where it's a word on its own and not inside another word. So when does a letter become a word? The answer is when it's between two word boundaries. The word boundary is represented by the regular expression special character \b so the regular expression becomes

var myRegExp = /\ba\b/gi;

The gi at the end ensures a global, case-insensitive search.

Now with our regular expression created, we can use it in the replace() method's first parameter.

myString = myString.replace(myRegExp,"the");

Question 3

Imagine you have a website with a message board. Write a regular expression that would remove barred words. (I'll let you make up your own words!)

<html>
<body>
<script language=JavaScript>
   var myRegExp = /(sugar )?candy|choc(olate|oholic)?/gi
   var myString = "Mmm, I love chocolate, I'm a chocoholic. " +
      "I love candy too, sweet, sugar candy";
   myString = myString.replace(myRegExp,"salad");
   alert(myString)
</script>
</body>
</html>

Save this as ch08_q3.htm.

For our example, we'll pretend we're creating script for a board on a dieting site where text relating to candy is barred and will be replaced with a much healthier option, salad.

My barred words are

chocolate

choc

chocoholic

sugar candy

candy

Let's see how I built up the regular expression to remove the offending words.

I started with the two basic words, so to match "choc" or "candy," I use

candy|choc

Next I added the matching for "sugar candy." Since the "sugar" bit is optional, we group it by placing it in parentheses and adding the "?" after it. This means match the group zero times or one time.

 (sugar )?candy|choc

Finally we need to add the optional "olate" and "oholic" end bits. We add these as a group after the "choc" word and again make the group optional. We can match either of the endings in the group by using the | character.

 (sugar )?candy|choc(olate|oholic)?/gi

Finally, we declare it as

   var myRegExp = /(sugar )?candy|choc(olate|oholic)?/gi

The gi at the end means the regular expression will find and replace words on a global, case-insensitive basis.

So, to sum up

/(sugar )?candy|choc(olate|oholic)?/gi

reads as:

Either match zero or one occurrences of "sugar" followed by "candy." Or alternatively match "choc" followed by either one or zero occurrences of "olate" or match "choc" followed by zero or one occurrence of "oholic."

Finally, the following:

   myString = myString.replace(myRegExp,"salad");

replaces the offending words with "salad" and sets myString to the new clean version:

"Mmm, I love salad, I'm a salad. I love salad too, sweet, salad."