Originally posted by marcnyc
I have tried your code and I got this:
Warning: Unknown modifier '|' in D:\WWW\chaindlk.com\httpdocs----\formatting.inc(8) : regexp code on line 1
Whoops - yeah the / should have been escaped (i.e., should have been written as \/ in the character class as well - one \ because it's in a regexp, and again because the \ needs to be escaped 'cos it's in a string. Since / is being used for the regexp delimiter, when the regexp engine saw that second / in there it thought it had come to the end of the regexp. The characters after that therefore were assumed to be modifers - and guess what? | isn't a known modifier.
Can we safely assume that your href's are all of the form
href="this.is.a.URL" (with or without spurious spaces around those dots)? Basically, we don't really care whether there are dots or not around the place - we just don't want any spaces (spaces aren't valid URL characters!).
$interview =
preg_replace(
"/(?<href=\")([^\"]*)(?=\")/ie",
'preg_replace(
"/\s/",
"",
"\1")',
$interview);
I'm not sure that the spurious backslashes on the quotes are coming from the preg_replace, since " is never mentioned in any of them.
Your multiple str_replaces() can be replaced with a could of preg_replace()s thus:
$interview = preg_replace("\s+([.,;:?!])","\\1",$interview);
$interview = preg_replace("([.,;:?!])(?! )","\\1 ",$interview);
The first strips off any whitespace preceding the punctuation, and the second adds a single space following it (but only if there isn't one there already).
OK: Change of tack
"Plan to throw one away; you will, anyhow." (Fred Brooks, The Mythical Man-Month, Chapter 11)
If you plan to throw one away, you will throw away two.
A quite different method is to pull all the links out, work on them or the remaining text in isolation from each other (so that messing with one poses no danger to the other), and then put the links back in when you're done. I got the idea for this from reading the Bugzilla source code:
The idea is to take the links out of the text, clean up the text, and then put the links back in.
First of all, we'll need an array to keep the links in.
$links=array();
Now we'll be replacing the links in the text with special notes, which (following Bugzilla) will be of the form ##n##, where n is the appropriate index in the $links[] array. To prevent already-present "##" strings from screwing us up, we'll make sure that there are none such in the text before we begin:
$interview = str_replace('#','%#',$interview);
Now the hairy bit. We'll find and replace <a> tags with ##n##, putting the tag into $links[n] as we go.
$count=0;
$placeholders = array();
$links = array();
while(preg_match("/pattern to match things/", $interview, $matches))
{ $placeholders[$count]="##$count##";
$interview = preg_replace("/pattern to match things/",
$placeholders[$count],
$interview,
1); // Not all at once!
$links[$count++]=$matches[0];
}
Where /pattern to match things/ is a regexp suitable for identifying <a> tags:
/<a[^>]+>/is
should do the trick. Yes, it will match <a name="thing"> as well, but there's no harm done there - we're taking these bits out so that they don't get mangled, after all.
I also set up an extra array $placeholders to contain the ##n## bits. Not really necessary, since they can be reconstructed, but it will make things more convenient later.
With that loop finished, the links have all been evacuated from the $interview text and are sitting in the $links array for safekeeping. Now you can go hard on the interview text, tidying up the punctuation and so forth - without having to worry about whether or not you're inside a link tag (because the answer is "no").
When you've done that, it's time to put the links back. Because I had the foresight to create a $placeholders array, this can be done in a single line:
$interview=str_replace($placeholders, $links, $interview);
And one last job, unescape the # characters:
$interview = str_replace('%#','#',$interview);
Obviously, one could do stuff to the contents of the $links array before putting them back into the main $interview text. You can also have several different arrays each containing different things - links, code blocks, etc. You'd have a separate $placeholders and $links array for each kind of thing, and the placeholder text could be identified by name - "##linkn##" or "##coden##", say.
For example, the same approach could be used to temporarly remove entire links (matching #<a[>]+>(((?!</a>).)*)</a>#is) so that you can match other bits to put links around them.