help with RegExp: substitute dot-space with dot in url between <a> tags

marcnyc

I am just about now trying to learn RegExp so I am not quite as confident as I need to be to get this done...

I need to remove spaces after dots in URLs (spaces after all other URL-allowed characters such as % @ ? etc as well) but because this URLs are inside a variable that also contains other text I need to do the replacement ONLY when the dot is between <a> tags, otherwise I'd have no spaces after regular dots in a normal text sentence.

I have come up with this:

$interview = preg_replace("/(.\s+)/i",".",$interview);

Which I think is right for removing space(s) after dots in URLs and I was thinking of simply repeating the same preg_replace (or str_replace) function for the other chars but I don't know how to delimit the substitution to only the <a tags... In other words I want this preg_replace action to ONLY take place from "<a" to "</a>".

Can anybody tell me how in RegExp I can do something only from a certain set of chars to a another set of chars?

Thanks

diego25

Try this:

$interview = preg_replace("/<a.(.\s+).>/i",".",$interview);

Diego

Weedpacket

A more robust approach (i.e., one that won't spill off into the text following the <a> tag, nor trash the rest of the tag) would be to replace

/<a([^.>]*)\. +/i

with

<a\1.

but it will only find one dot-spaces pair per <a> tag. You'll need to repeat this step until the string no longer changes:

$intemp = $interview;
while($interview != ($intemp = preg_replace("/<a([^>.]*)\. +/i", "<a\\1.", $intemp)))
	$interview = $intemp;
unset($intemp);

At least it will properly replace dot-space-space-space with dot in one pass, not three.

The problem is that I can't see how to use assertions so that all pairs can be found at once; I can't use a lookbehind assertion to see if there's no > between the previous <a and the dot-space pair, because I don't know how far back the <a might be, and lookbehind assertions are only fixed-length. But if I can't use lookbehind assertions to find the <a, then I end up having to declare it as part of the match - and I can only do that for one match at a time. I can't see how to use forward assertions either, not without having to count things - and regular expressions can't count.

So, yuck. But still probably preferable to breaking the string into pieces on the <a> tags, then running preg_replace on the bits of pieces that are inside, and then joining them all back together again:

$interview_a = preg_split('/<a/i',$interview);
$interview_ac = count($interview_a);
// The 0th element of the $interview_a array doesn't begin with the
// contents of an <a> tag. The others do.
for($i = 1; $i < $interview_ac; ++$i)
	$interview_a[$i] = preg_replace("/\. +(?=[^>]*>)/", ".", $interview_a[$i]);
$interview = join('<a', $interview_a);
unset($interview_ac, $interview_a);

PS: vBulletin, for reasons best known to itself, likes to mangle PCRE regexps. "." in the search string should be "." and "\1" should be "\1".

marcnyc

:-O :-O :-o :-)
woooooow, I can only wonder at your vast knowledge!!!

Sorry I missed your reply, for some reason I haven't got the email notification...

I haven't understood much of what you have so carefully explained but I have copied the last code and tested it (after adding the slash to the dot that vbulleting removed) and it works GREAT!!! I am in disbelief!!!

With my very very narrow regexp knowledge I wanted to also add a solution to remove any space(s) before the dot that could be found inside the <a> but rather than trying to mess with your majestic code I simply repeated it below yours and changed the search string from:

/. +(?=[^>]*>)/

/(\s+.)/

like this:

$interview_a = preg_split('/<a/i',$interview);
$interview_ac = count($interview_a);
// The 0th element of the $interview_a array doesn't begin with the
// contents of an <a> tag. The others do.
for($i = 1; $i < $interview_ac; ++$i)
$interview_a[$i] = preg_replace("/(\s+.)/", ".", $interview_a[$i]);
$interview = join('<a', $interview_a);
unset($interview_ac, $interview_a);

I was just wondering if this is correct...

Thanks a million for your super knowledgable assistance!

PS By the way, I think if you paste the code like I did (without using the vbuttellin's

 tags the code isn't messed with...

Weedpacket

Originally posted by marcnyc
I simply repeated it below yours and changed the search string from:
/\. +(?=[^>]*>)/
to
/(\s+\.)/
I was just wondering if this is correct...
[/b]

Almost; you'll want to leave the (?=) assertion in, though, or it will run on past the closing > as well.

In fact, you could probably match for spaces before and after simultaneously:

/\s*\.\s*(?=[^>]*>)/

Note the change from \s+ to \s* - necessary, otherwise it will match only dots that have spaces both fore and behind.

Idle question; does that first piece of code I wrote work?

$intemp = $interview;

while($interview != ($intemp = preg_replace("/<a([^>.]*)\.\s+/i", "<a\\1.", $intemp)))

$interview = $intemp;

unset($intemp);

PS By the way, I think if you paste the code like I did (without using the vbuttellin's
 tags the code isn't messed with... [/B][/QUOTE]Which is why I generally use [CODE] tags to render regexps - easier to read in a fixed-width font.

marcnyc

I am just a little confused as to what I should use now...

I tried:

// removes space(s) before dot between <a> tags
$interview_a = preg_split('/<a/i',$interview);
$interview_ac = count($interview_a);
// The 0th element of the $interview_a array doesn't begin with the
// contents of an <a> tag. The others do.
for($i = 1; $i < $interview_ac; ++$i)
$interview_a[$i] = preg_replace("/\s.\s(?=[^>]*>)/", ".", $interview_a[$i]);
$interview = join('<a', $interview_a);
unset($interview_ac, $interview_a);

and it works...
Are you saying I should use this istead:

$intemp = $interview;

while($interview != ($intemp = preg_replace("/<a([^{>.]*).\s+/i",} "<a\1.", $intemp)))

$interview = $intemp;

unset($intemp);

Also I want to make this check for all characters allowed in an URL... Is there any such list somewhere anyway? I know at least @ % & ? ' are allowed in URLs but I have seen weird looking urls so I would like to make it comprehensive...

Once I find the list somewhere should I replace the above (whichever is the one) code for every character or is there a way to do it all at once?

I have tried this:

// removes space(s) before dot between <a> tags
$interview_a = preg_split('/<a/i',$interview);
$interview_ac = count($interview_a);
// The 0th element of the $interview_a array doesn't begin with the
// contents of an <a> tag. The others do.
for($i = 1; $i < $interview_ac; ++$i)
$interview_a[$i] = preg_replace("/\s[.|\,]\s(?=[^>]*>)/", "/[.|\,]/", $interview_a[$i]);
$interview = join('<a', $interview_a);
unset($interview_ac, $interview_a);

But even if the search string is correct I am obviously confused on how what the syntax of a replace string should be when the replace is not one character but several characters that replace the same characters with space(s) before and/or after them in the search string.

Thanks

Weedpacket

That's okay, I did find a one-line solution in the end. It uses the /e modifier; it first uses preg_replace() to match the <a> tag:

/<a[^>]+>/e

and puts the matched bit through a call to preg_replace()

/\s*\.\s*/

which does the space-dot/dot-space business - replacing the matched string with just a dot.

Slightly eye-twisting at first glance, but the individual regexps are simple enough on their own - the most complicated part is keeping count of backslashes (which is why I'm not going to use syntax highlighting here!).

$interview = preg_replace(
	"/<a[^>]+>/e",
	"preg_replace('/\\s*\\.\\s*/', '.', '\\0')",
	$interview
);

marcnyc

Wow, you have just anticipated my reply to my earlier post by a few minutes... Based on your previou solution (which still works great) I had found the ultimate solution for myself like this:

$replace_url = array (".",
",",
"$",
"-",
"_",
"+",
"?",
"!",
"*",
"'",
"(",
")",
"@",
"#",
"&",
"/",
"|",
"%");

// $name_file = preg_replace ($search, $replace, $name_artist);
// removes space(s) before and after % between <a> tags
$interview_a = preg_split('/<a/i',$interview);
$interview_ac = count($interview_a);
// The 0th element of the $interview_a array doesn't begin with the contents of an <a> tag. The others do.
for($i = 1; $i < $interview_ac; ++$i)
$interview_a[$i] = preg_replace($search_url, $replace_url, $interview_a[$i]);
$interview = join('<a', $interview_a);
unset($interview_ac, $interview_a);

This works great (I tested it inside)... So should I use this or your new thing which I haven't tested yet?

Weedpacket

As for characters that can appear in URLs, the full list is

-/?$_.+!*'(),;:@&=
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789
%

Although it should be remembered of course that many of them have special meanings wrt URLs. %, for example, should only appear as the start of an escape sequence - ie., followed by something of the form [0-9A-Fa-f]{2}; ? should only appear once (to separate an hpath from a search string); and so forth.

marcnyc

Looks like I only missed = : ;

So should I use what I've got with your super-valuable help or should I go for your new solution??? What are the upsides and downsides???

Men, btw, you rock! You are a genius!

Weedpacket

Originally posted by marcnyc
Looks like I only missed = : ;

So should I use what I've got with your super-valuable help or should I go for your new solution??? What are the upsides and downsides???

Dunno, I like my later version - it's a fair bit cleaner (doesn't involve any additional variables) and shorter; I don't know if it's faster (though it does work), and it's debatable whether it's easier to read (I'm inclined to think the new code is slightly easier, but that's just my personal opinion - yours is more important 🙂).

BTW. your search and replace arrays can be folded into a single pair of expressions:

$search "'\s*([[i]urlcharacters[/i]])\s*(?=[^>]*>)'";

(probably need to escape some of the characters in that character class there.) and

$replace = "\\1";

(i.e., replace that matched string with the contents of the first ()-subexpression

My new code would be rewritten as

$interview = preg_replace(
	"/<a[^>]+>/e",
	"preg_replace('/\\s*([[i]urlcharacters[/i]])\\s*/', '\\\\1', '\\0')",
	$interview
);

See, that's the eye-twisting thing. \\1 is a back reference that belongs to the inner regexp, while \0 belongs to the outer one. That's why \\1 is more heavily escaped - to prevent the outer regexp messing with it.

No, I'm not a genius - I've just studied regexps on a professional basis in the past. Got a fair idea of what they can and can't do in the process.

marcnyc

Ok, I am not sure why I can put all characters into one string, don't they have to be in an array to be replaced by the corresponding character from the replacement array? If I write the search array (which in your case is a variable, right?) then how will the characters be matched to the same chars contained in the $replacement array?

I will try your new code tomorrow by the way... It's 3am here so I am off to bed now...

And yes you ARE a genius and if you don't wanna be a genius in my book (I see your point: study study study...) then you are a very generous person (I still can't see why you waste so much of your valuable time to help poor bastards like me solving their problems)...

marcnyc

Hey WP, I am sorry to say but your new code didn't work for me... I replaced the old one with your new one and I tried it in both configurations but it didn't do the trick... Have you tried it yourself? Anyway, it's not important because I can use your old code but I thought I should let you know... but thanks a lot anyway!

marcnyc

Hi Weedpacket, i am sorry to bother you again with this but it looks like the code actually has a flaw...

Your code I am using (I couldn't get the other one to work) is this:

$search_url = array ("'\s.\s(?=[^>]>)'si",
"'\s\,\s(?=[^>]>)'si",
"'\s\$\s(?=[^>]>)'si",
"'\s-\s(?=[^>]>)'si",
"'\s_\s(?=[^>]>)'si",
"'\s+\s(?=[^>]>)'si",
"'\s\=\s(?=[^>]>)'si",
"'\s:\s(?=[^>]>)'si",
"'\s\;\s(?=[^>]>)'si",
"'\s\?\s(?=[^>]>)'si",
"'\s!\s(?=[^>]>)'si",
"'\s*\s(?=[^>]>)'si",
"'\s\'\s(?=[^>]>)'si",
"'\s(\s(?=[^>]>)'si",
"'\s)\s(?=[^>]>)'si",
"'\s@\s(?=[^>]>)'si",
"'\s#\s(?=[^>]>)'si",
"'\s\&\s(?=[^>]>)'si",
"'\s\/\s(?=[^>]>)'si",
"'\s|\s(?=[^>]>)'si",
"'\s\%\s(?=[^>]*>)'si");

$replace_url = array (".",
",",
"$",
"-",
"_",
"+",
"=",
":",
";",
"?",
"!",
"*",
"'",
"(",
")",
"@",
"#",
"&",
"/",
"|",
"%");

// removes space(s) before and after all URL-allowed characters between <a> tags
$interview_a = preg_split('/<a/i',$interview);
$interview_ac = count($interview_a);
// The 0th element of the $interview_a array doesn't begin with the contents of an <a> tag. The others do.
for($i = 1; $i < $interview_ac; ++$i)
$interview_a[$i] = preg_replace($search_url, $replace_url, $interview_a[$i]);
$interview = join('<a', $interview_a);
unset($interview_ac, $interview_a);

But today doing final tests I have found that it actually does what it is supposed to do only until the SECOND url it finds... In other words if there is only ONE url in the text then it removes the spaces after and before the dots in the URL but and doesn't remove them from dots in the text (outside the URL) that come before and after the URL itself (as wanted), but if another URL comes later on in the text all the spaces before and after all the dots coming after the second URL are removed as well and the entire text becoms messed up (one big block with no spaces after and before all punctuation and other signs that you see in the array above)...

If you want me to use the other code of yours I'll do but maybe you can test it to see why I couldn't get it to work...

Please help me on this one so I can get it over with this script and finally finish it! Thanks a lot!

Here is what actually happens to the string:

STRING BEFORE:
This is a test. It contains a <a href="http://www . chaindlk . com">LI . NK</a> and more text. But more <a
href="li . nks">LI . NKS</a> as well! Do you like, it? Or. D on't you? T his is; a test.

STRING AFTER:
This is a test. It contains a <a href="http://www.chaindlk.com">LI.NK</a> and more text. But more <a href="li.nks">LI.NKS</a> as well!Do you like,it?Or.Don't you?This is;a test.

As you can see at the beginning (before the first link and between the first and the second link) everything is fine. After the second link all spaces are removed inside (as wanted) AND outside (after) the <a> tags (NOT WANTED!)

Weedpacket

Natch. So

$interview = preg_replace(
	"/<a[^>]+>/e",
	"preg_replace('/\\s*([-,\\\$_+=:;?!*\\'()@#&/|%])\\s*/', '\\\\1', '\\0')",
	$interview
);

doesn't work? (No, I didn't test it). I think I've got the
escaping right. Maybe \\1 instead of \\1, and maybe \\$ instead of
\\$, or \' instead of \'. One of the drawbacks of trying to quote
strings inside strings 🙁

But I'm wondering what exactly it is you're wanting to do - maybe
another approach is called for. Strip out all the spaces inside an <a>
tag's href attribute?

marcnyc

If you have a minute to test your code it'd be great cause I can only try different things but it's like stumbling in the dark and just guessing different versions at this point... Sometimes I think I understood something and then I realize I didn't so I have to re-read the whole thing... anyway...

Ok here is what I need to do exactly:

In the past month I have been creating my first ever PHP script for my non-profit electronic music webzine www.chaindlk.com
(by the way you will be credited for this script for all your help).

This script allows my contributors to use an online wizard to create interviews with bands and because I am the one (webmaster) who ultimately has to put the interview online and make sure it looks right I have to perform several adjustments...

I have one final string that contains and interview ($interview) with very few html tags (basically only and links' <a> and occasional and ).

Because I am sick and tired of people who type stuff like:

Hello,how're you?
I am well.And you?
I like it here;but more there!Do you agree?No?
I think so too (but I am not sure,) maybe , maybe not
Or maybe even so ,so terribly wrong.

These are just some of the common mistakes people do. In case someone reading this post doesn't know what I am talking about, I am referring to no space after commas and dots or spaces before dots and commas, or commas and dots inside a parentheses instead of outside... All these things are very wrong, not only aesthetically but also technically, because text-reflowing in a webpage does not occur correctly (if you have align="justify" in your or <div> tags the "Hello,how'are" for example will be considered an entire word and if it doesn't fit on one line the entire thing will go to the next line, as opposed to "Hello," on one line and "how're you?" on the next - not to mention how ugly it is to see "Hello" on one line and " ,how're you" because somebody wrote "Hello ,how're you?" with the space before the comma).

Enough of this...

To correct all this I have the following actions performe on my variable:

$interview = str_replace(' .','.',$interview);
$interview = str_replace(' ,',',',$interview);
$interview = str_replace(' ;',';',$interview);
$interview = str_replace(' :',':',$interview);
$interview = str_replace(' ?','?',$interview);
$interview = str_replace(' !','!',$interview);
$interview = str_replace('.','. ',$interview);
$interview = str_replace(',',', ',$interview);
$interview = str_replace(';','; ',$interview);
$interview = str_replace(':',': ',$interview);
$interview = str_replace('?','? ',$interview);
$interview = str_replace('!','! ',$interview);

Now the problem with this is that it fucks up all my URLs contained inside the variable because all the dots in the text are replaced with a dot and a following space and the correct links will stop working...

I also have many other actions performed to rectify the text so I won't go into that...

THUS the necessity to remove all spaces before and after all the URL-allowed characters ONLY when they are actually part of an URL (therefore, inside the <a> tags).

I hope this explains my needs clearly.

marcnyc

I have tried your code and I got this:

Warning: Unknown modifier '|' in D:\WWW\chaindlk.com\httpdocs----\formatting.inc(8) : regexp code on line 1

so I escaped the | (one, two and three ways) and kept getting:

Warning: Unknown modifier '\' in D:\WWW\chaindlk.com\httpdocs----\formatting.inc(8) : regexp code on line 1

Then I have tried all your suggested substitutions and I kept getting similar errors except in one case where I got this error:

Parse error: parse error in D:\WWW\chaindlk.com\httpdocs----\formatting.inc(8) : regexp code on line 1

Fatal error: Failed evaluating code: preg_replace('/\s([-,$_+=:;?!'()@#&/|%])\s*/','\1','') in D:\WWW\chaindlk.com\httpdocs----\formatting.inc on line 8

I would really apprecaite your guidence in this dark path :-(

marcnyc

Why don't you just test your code with this text string:

This is a test. It contains a <a href="http://www . chaindlk . com">LI . NK</a> and more text. But more <a
href="li . nks">LI . NKS</a> as well! Do you like, it? Or. Don't you? This is; a test.

Analyze the outcome (spaces after and before dots, commas and other punctuation signs) and if the result is anything different that the following it means something didn't work:

This is a test. It contains a <a href="http://www.chaindlk.com">LI.NK</a> and more text. But more <a href="li.nks">LI.NKS</a> as well! Do you like, it? Or. Don't you? This is; a test.

marcnyc

Weedpacket, I have further tried your code without all the URL chars which were obviously making it messy to troubleshoot:
I used this instead (same code, just without all the URL chars and the dot only):

$interview = preg_replace(
"/<a[^>]+>/e",
"preg_replace('/\s([.])\s/','\\1','\0')",
$interview
);

And it did strange things to my text...

Not only did it not do the replacement in the argument of the href links but it also added slashes to the double-quotes... Slashes can be stripped of course but the spaces have been removed only before and after the dots in the href argument (www.chaindlk.com and li.nks) but not in the text appearing as kink (LI . NK and LI . NKS). The result is:

This is a test. It contains a <a href=\"http://www.chaindlk.com\">LI . NK</a> and more text. But more <a href=\"li.nks\">LI . NKS</a> as well! Do you like, it? Or. Don't you? This is; a test. 
This should give us a; space!

I am not good at RegExp but I am guessing the problem is /<a[^>]+>/e

It looks to me as if it only looks for dots from <a href= to the next > instead it should do so from <a href to </a> which will include the text displayed in the link!

With my very bare knowledge I have attempted this:

$interview = preg_replace(
"/<a(.)<\/a>/esi",
"preg_replace('/\s+([.])\s+/','\\1','\0')",
$interview
);

but of course Murphy's Law prevented it from working ;-)

BTW, I have a question: if we need to escape everything in the second preg_replace shouldn't we escape the RegExp delimters / as well, like this?

$interview = preg_replace(
"/<a(.)<\/a>/esi",
"preg_replace('\/\s+([.])\s+\/','\\1','\0')",
$interview
);

I tried that too but it didn't work either (Murphy!!!)

Weedpacket

Originally posted by marcnyc
I have tried your code and I got this:

Warning: Unknown modifier '|' in D:\WWW\chaindlk.com\httpdocs----\formatting.inc(8) : regexp code on line 1

Whoops - yeah the / should have been escaped (i.e., should have been written as \/ in the character class as well - one \ because it's in a regexp, and again because the \ needs to be escaped 'cos it's in a string. Since / is being used for the regexp delimiter, when the regexp engine saw that second / in there it thought it had come to the end of the regexp. The characters after that therefore were assumed to be modifers - and guess what? | isn't a known modifier.

Can we safely assume that your href's are all of the form

href="this.is.a.URL" (with or without spurious spaces around those dots)? Basically, we don't really care whether there are dots or not around the place - we just don't want any spaces (spaces aren't valid URL characters!).

$interview =
preg_replace(
	"/(?<href=\")([^\"]*)(?=\")/ie",
	'preg_replace(
		"/\s/",
		"",
		"\1")',
	$interview);

I'm not sure that the spurious backslashes on the quotes are coming from the preg_replace, since " is never mentioned in any of them.

Your multiple str_replaces() can be replaced with a could of preg_replace()s thus:

$interview = preg_replace("\s+([.,;:?!])","\\1",$interview);
$interview = preg_replace("([.,;:?!])(?! )","\\1 ",$interview);

The first strips off any whitespace preceding the punctuation, and the second adds a single space following it (but only if there isn't one there already).

OK: Change of tack
"Plan to throw one away; you will, anyhow." (Fred Brooks, The Mythical Man-Month, Chapter 11)
If you plan to throw one away, you will throw away two.

A quite different method is to pull all the links out, work on them or the remaining text in isolation from each other (so that messing with one poses no danger to the other), and then put the links back in when you're done. I got the idea for this from reading the Bugzilla source code:

The idea is to take the links out of the text, clean up the text, and then put the links back in.

First of all, we'll need an array to keep the links in.

$links=array();

Now we'll be replacing the links in the text with special notes, which (following Bugzilla) will be of the form ##n##, where n is the appropriate index in the $links[] array. To prevent already-present "##" strings from screwing us up, we'll make sure that there are none such in the text before we begin:

$interview = str_replace('#','%#',$interview);

Now the hairy bit. We'll find and replace <a> tags with ##n##, putting the tag into $links[n] as we go.

$count=0;
$placeholders = array();
$links = array();
while(preg_match("/pattern to match things/", $interview, $matches))
{	$placeholders[$count]="##$count##";
	$interview = preg_replace("/pattern to match things/",
	        $placeholders[$count],
	        $interview,
	        1); // Not all at once!
	$links[$count++]=$matches[0];
}

Where /pattern to match things/ is a regexp suitable for identifying <a> tags:

/<a[^>]+>/is

should do the trick. Yes, it will match <a name="thing"> as well, but there's no harm done there - we're taking these bits out so that they don't get mangled, after all.

I also set up an extra array $placeholders to contain the ##n## bits. Not really necessary, since they can be reconstructed, but it will make things more convenient later.

With that loop finished, the links have all been evacuated from the $interview text and are sitting in the $links array for safekeeping. Now you can go hard on the interview text, tidying up the punctuation and so forth - without having to worry about whether or not you're inside a link tag (because the answer is "no").

When you've done that, it's time to put the links back. Because I had the foresight to create a $placeholders array, this can be done in a single line:

$interview=str_replace($placeholders, $links, $interview);

And one last job, unescape the # characters:

$interview = str_replace('%#','#',$interview);

Obviously, one could do stuff to the contents of the $links array before putting them back into the main $interview text. You can also have several different arrays each containing different things - links, code blocks, etc. You'd have a separate $placeholders and $links array for each kind of thing, and the placeholder text could be identified by name - "##linkn##" or "##coden##", say.

For example, the same approach could be used to temporarly remove entire links (matching #<a[^{>]+>(((?!</a>).)*)</a>#is}) so that you can match other bits to put links around them.