[RESOLVED] REGEX to replace relative paths with absolute paths

meltmedown · Jan 19, 2009

Hi guys,

I use the bellow code to replace all the relative links in a HTML file with the absolute ones:

$string="
<a href="link.htm">1</a>
<a href="http://www.site.com/link.htm">2</a>
<a href="#link">3</a>
<a href="mailto:mail@example.com">4</a>
";

$link_noabs = "/<(a.*?href=\")([^http^#])([^>]+)>/";
$link_abs = '<\\1http://www.example.com/folder/\\2\\3 >';
$replace=preg_replace($link_noabs, $link_abs, $string);

The results are shown bellow:

<a href="http://www.example.com/folder/link.htm" target="_blank">1</a>
<a href="http://www.site.com/link.htm" target="_blank" >2</a>
<a href="#link" >3</a>
<a href="http://www.example.com/folder/mailto:mail@example.com" target="_blank" >4</a>

My problem is here:

$link_noabs = "/<(a.*?href=\")([^http^#])([^>]+)>/";

The REGEX above will search for all the links that do not begin with http and/or #. I need a regex that will search for all the links that do not begin with http, # and mailto: because, otherwise, the result will look like this:

<a href="http://www.example.com/folder/mailto:mail@example.com" target="_blank" >4</a>

I've tried the following code:

$link_noabs = "/<(a.*?href=\")([^http^#^mailto])([^>]+)>/";

but with no result.

Please help!

Thank you,

nrg_alpha · Jan 19, 2009

Here's how I would tackle this:

$str=' 
<a href="link.htm">1</a> 
<a href="http://www.site.com/link.htm">2</a> 
<a href="#link">3</a> 
<a href="mailto:mail@example.com">4</a> 
';

$arr = preg_split('#(<a\b[^>]+>)#', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
foreach($arr as &$val){
   if(preg_match('#<a\b#', $val) && !preg_match('~(?:http|#|mailto)~', $val)){
      $val = preg_replace('#^([^"]+")([^"]+)#', '$1'.'http://www.example.com/folder/'.'$2', $val);
   }
}
$arr = implode('', $arr);
echo $arr;

Ouput (when viewed as source code):

<a href="http://www.example.com/folder/link.htm">1</a> 
<a href="http://www.site.com/link.htm">2</a> 
<a href="#link">3</a> 
<a href="mailto:mail@example.com">4</a>

I think a big problem with what you have is this: [^{http^#]} in your pattern...
This is a negated character class, which basically says, any character that is not an h, nor a t, nor a p, nor a carot, nor a hash... but this doesn't work out as planned..as only the first carot (⁾ at the beginning acts a negative.. everything else is simply a list of characters that are not perimissable.

Another problem is perhaps trying to pack it all into regex.. while it can be done, I am more of a fan in mixing regex with addition (and often faster) functionality.. you can still get the same results as one done in pure regex, but often with quicker execution.

EDIT - Come to think of it, we can get rid of the first conditional preg_match of the if statement within the foreach loop and replace with strpos instead...

if(strpos($val, '<a ') !== false && !preg_match('~(?:http|#|mailto)~', $val)){

meltmedown · Jan 20, 2009

Many many thanks nrg_alpha. Works great .

nrg_alpha · Jan 20, 2009

Perfect. Please don't forget to flag this thread as resolved (top menu, Thread Tools > Mark Thread Resolved).

Cheers

[RESOLVED] REGEX to replace relative paths with absolute paths

Mmeltmedown

Nnrg_alpha

Mmeltmedown

Nnrg_alpha