Friday, September 16, 2005
Text Processing -- Pause to Reflect
As is often the case with anything other than a simple, single-purpose script, real experience with real data can lead to revelations that require some rethinking. As a result of what I've done so far, I've realized that I have rushed to set the formatting of those symbol characters too early. Apart from that, everything we've done so far has been with text that from paragraph to paragraph is uniformly formatted; until we start applying character styling (or local styling -- but there won't be any of that), we can process the text in JavaScript rather than in situ in the document. This has huge speed benefits not to mention the possibility of using grep (not that I expect to).
So, that means that I need to move the symbol characters call later. It also means that I need to separate the html processing some. At least the conversion of the <br> tag to a forced line break needs to be handled separately because that is used only in the references paragraphs to delineate the members of lists. This first data exposes the fact that some of these lists are not well formed, so I need to process them to make them well formed. This is best done before inserting any formatting at a more detailed level than the paragraph styles.
With these considerations in mind, the function to convert the breaks to forced line breaks looks like this. It would be interesting to compare this solution with one that used Find/Change within the document to see which is fastest.
The most difficult line in the script is the one that writes the changed paragraphs back into the document. The problem here is that if you simply write the whole new paragraph over the existing one, the paragraph styles get messed up because that information is held in the return at the end of the paragraph, and we're overwriting it. Hence the use of itemByRange() on the characters of the paragraph and the use of slice on the text in myNewText to leave the paragraph mark alone, thereby preserving the paragraph style. The different numbering schemes used by these two methods can catch you out. While itemByRange is inclusive of the character at the second index, slice() isn't.
So, that means that I need to move the symbol characters call later. It also means that I need to separate the html processing some. At least the conversion of the <br> tag to a forced line break needs to be handled separately because that is used only in the references paragraphs to delineate the members of lists. This first data exposes the fact that some of these lists are not well formed, so I need to process them to make them well formed. This is best done before inserting any formatting at a more detailed level than the paragraph styles.
With these considerations in mind, the function to convert the breaks to forced line breaks looks like this. It would be interesting to compare this solution with one that used Find/Change within the document to see which is fastest.
function cleanUpBreaks(theText) {This is all relatively easy to follow thanks to the comments. The while loop eliminates space runs by changing all double spaces to single spaces until there aren't any double spaces left.
// Work a paragraph at a time
var myTexts = theText.paragraphs.everyItem().contents;
for (var j = myTexts.length - 1; j >=0; j--){
var myText = myTexts[j];
// Change break tags to forced new lines
var myNewText = myText.split("
").join("\n");
// Eliminate all space runs
var myParts = myNewText.split(" ");
while (myParts.length > 1) {
myNewText = myParts.join(" ");
myParts = myNewText .split(" ");
}
// Eliminate spaces on either side of forced new line
myNewText = myNewText.split(" \n").join("\n").split("\n ").join("\n");
// Write back if changed
if (myText != myNewText) {
theText.paragraphs[j].characters.itemByRange(0, -2).contents = myNewText.slice(0,-1);
}
}
return true
}
The most difficult line in the script is the one that writes the changed paragraphs back into the document. The problem here is that if you simply write the whole new paragraph over the existing one, the paragraph styles get messed up because that information is held in the return at the end of the paragraph, and we're overwriting it. Hence the use of itemByRange() on the characters of the paragraph and the use of slice on the text in myNewText to leave the paragraph mark alone, thereby preserving the paragraph style. The different numbering schemes used by these two methods can catch you out. While itemByRange is inclusive of the character at the second index, slice() isn't.