Saturday, September 17, 2005

 

Text Processing -- A Wrinkle means RegExp

I've run into a wrinkle that requires an excursion into the world of RegExp (regular expressions). The HTML codes in this text use a mix of cases. Some of the <br> tags are written <BR>. And other tags have the same inconsistency. So, the split("<br>.join("\n") approach doesn't work. While in this case I could just double-up with little penalty, I'm going to have to come up with a more general solution for the other tags.

At first glance, the solution to the immediate issue is easy. The RegExp we need looks like this:
var myRG = new RegExp("
", "gi");
The first string argument is what we're looking for, the g in the second parameter stands for global which means seek out every item, and the i indicates case insensitivity. So, it would seem that our cleanUpBreaks() function should look like this:
function cleanUpBreaks(theText) {
 // Work a paragraph at a time
 var myTexts = theText.paragraphs.everyItem().contents;
 var myRG = new RegExp("<br>", "gi");
 for (var j = myTexts.length - 1; j >=0; j--){
  var myText = myTexts[j];
  // Change break tags to forced new lines
  var myNewText = myText.replace(myRG, "\n");
  // Eliminate all space runs
  var myParts = myNewText.split(" ");
  while (myParts.length > 1) {
   myNewText = myParts.join(" ");
   myParts = myNewText .split(" ");
  }
  // Eliminate spaces on either side of forced new line
  myNewText = myNewText.split(" \n").join("\n").split("\n ").join("\n");
  // Write back if changed
  if (myText != myNewText) {
   theText.paragraphs[j].characters.itemByRange(0, -2).contents = myNewText.slice(0,-1);
  }
 }
 return true
}
But if you look a little closer, most of that splitting and joining that comes after the use of the RegExp can be bundled into the RegExp itself.

I say "most" because the changing of space runs to single spaces operates globally in the current version, but I did that only to make sure there was not more than one space before or after the inserted forced line break. So, let's forget about that for now (it's an unrelated function and so ought to be done elsewhere, if at all) and improve our regular expression to address any spaces surrounding the orignal break tag.
function cleanUpBreaks(theText) {
 // Work a paragraph at a time
 var myTexts = theText.paragraphs.everyItem().contents;
 var myRG = new RegExp(" *<br> *", "gi");
 for (var j = myTexts.length - 1; j >=0; j--){
  var myText = myTexts[j];
  // Change break tags to forced new lines
  var myNewText = myText.replace(myRG, "\n");
  // Write back if changed
  if (myText != myNewText) {
   theText.paragraphs[j].characters.itemByRange(0, -2).contents = myNewText.slice(0,-1);
  }
 }
 return true
}
And there you have it. The space-asterisk pair on either side of the <br> in the regular expression causes the replace command in the loop to seek out zero or more spaces followed by <br> followed by zero or more spaces.

Comments: Post a Comment

<< Home

This page is powered by Blogger. Isn't yours?