Friday, September 16, 2005
Processing Text -- Trim White Space
Let's face it, this is not something that happens often (at least, not at the start of paragraphs), and when it does, the chances of there being more than one space or tab is pretty low, and I can ignore all other kinds of white space (remember: this text came from Excel). So, let me try something straightforward and see how it does:
Why, when looking at the end of the paragraph didn't I just start at the character before the trailing return? Why go to the trouble of detecting and then ignoring it? Because some paragraphs don't have trailing returns, notably the ones at the end of a text flow -- note that a text flow could be a cell in a table. While, for the vast majority of the cells in the table we're processing, this is not an issue because I'm not going to trim the white space until after the table has been converted to text, for the first row of the table, this is not a good strategy because we don't want paragraph style names that start or end in space or tab.
This means we have to revisit a function we thought was finished the other day and beef it up. Remember this:
function trimText(theText) {As I wrote this various thoughts occurred, some of which I wrote into the script as comments because I might be tempted to use this script on other text. The issue of whether or not a tab is legit at the start or end of a paragraph is an important consideration. I've certainly been in situations where they were valid. But for this job, they are not. So the script removes them (if any were found -- I inserted a couple just to be sure while testing).
// Trims spaces from the front and back of all paragraphs in theText
// Trims tabs from the end
// Start with the tail end of each paragraph
var myContents = theText.paragraphs.everyItem().contents;
for (var j = myContents.length - 1; j >= 0; j--) {
var myContent = myContents[j];
for (var k = myContent.length - 1; k >=0; k--) {
var myChar = myContent.slice(k,k+1);
if (myChar == "\r") {continue} // do nothing to return
if ((myChar != " ") && (myChar != "\t")) {break};
// If we get here, paragraph has one or more spaces or tabs "at end"
// Thinks: for some kinds ot text tag is legit at end
// But this doesn't apply to this project
theText.paragraphs[j].characters[k].remove();
}
}
// Now look at start of paragraphs; for safety refresh myContents
var myContents = theText.paragraphs.everyItem().contents;
for (var j = myContents.length - 1; j >= 0; j--) {
var myContent = myContents[j];
for (k = 0; myContent.length > k; k++) {
myChar = myContent.slice(k,k+1);
if (myChar == "\r") {break} // empty paragraph
if ((myChar != " ") && (myChar != "\t")) {break};
// If we get here, paragraph has one or more spaces at start
// Tabs at start are not valid in this job. Could be in others.
theText.paragraphs[j].characters[k].remove();
}
}
}
Why, when looking at the end of the paragraph didn't I just start at the character before the trailing return? Why go to the trouble of detecting and then ignoring it? Because some paragraphs don't have trailing returns, notably the ones at the end of a text flow -- note that a text flow could be a cell in a table. While, for the vast majority of the cells in the table we're processing, this is not an issue because I'm not going to trim the white space until after the table has been converted to text, for the first row of the table, this is not a good strategy because we don't want paragraph style names that start or end in space or tab.
This means we have to revisit a function we thought was finished the other day and beef it up. Remember this:
var myLim = myTable.columns.length;Here, we passed off the untrimmed contents of the first row to be the names of our paragraph styles. It happens that the first cell had a space after its contents that shouldn't have been there, but since these things are invisible, one can hardly blame the data generator. So, now I have:
for (var j = 0; myLim > j; j++) {
var myStyle = getParaStyle(theTable.cells[j].contents);
}
trimText(theTable.cells[j].texts[0]);added into the script immediately before the call to getParaStyle() so that the name passed to the paragraph style funtions is not encumbered with leading or trailing white space.