preg_match problems with PHP4

Neff

I'm using the preg_match function in PHP and I want to grab a link tag
from a page and the preceding and following text to establish a
context.
If the link is inside a set of <div> tags I only want to grab the text
within that div, if there are no <div>s I want to grab the text all
the way to the <body> tags.

To give an example if the html is:

[FONT="Courier New"]<html>
<head>
</head>
<body>
aaa aaa aaa
<div>
bbb bbb bbb
<a href='http://www.domain.com/index,html'>link text</a>
ccc ccc ccc
</div>
ddd ddd ddd
</body>
</html>[/FONT]

I want to match three groups
1: [FONT="Courier New"]bbb bbb bbb[/FONT]
2: [FONT="Courier New"]<a href='http://www.domain.com/index,html'>link text</a>[/FONT]
3: [FONT="Courier New"]ccc ccc ccc[/FONT]

but on the other hand if the divs weren't there and the html was

[FONT="Courier New"]<html>
<head>
</head>
<body>
aaa aaa aaa
bbb bbb bbb
<a href='http://www.domain.com/index,html'>link text</a>
ccc ccc ccc
ddd ddd ddd
</body>
</html>[/FONT]

I'd want to match
1: [FONT="Courier New"]aaa aaa aaa bbb bbb bbb[/FONT]
2: [FONT="Courier New"]<a href='http://www.domain.com/index,html'>link text</a>[/FONT]
3: [FONT="Courier New"]ccc ccc ccc ddd ddd ddd[/FONT]

The expression I'm working with is

[FONT="Courier New"]#.<(?:div|body).?>(.?)(<a\s[^>]?href\s?=\s?["']{0,1}http://www.domain.com/index.html['"]{0,1}.?>.?</a>)(.*?)</(?:div|body)#i[/FONT]

Which is nearly there because it works as expected in the Rad Software
Regular Expression Designer (http://www.radsoftware.com.au/
regexdesigner/) and in the similar Expresso tool (http://
www.ultrapico.com/ExpressoBeta.htm) but returns no matches when I use
it in PHP.

I guess this means something is not implemented the same way in PHP
but what? Does anyone have a work around to get the expression working
in PHP?

rmbarnes82

The only thing I can think of in PHP is that you may need to escape the single or double quotes that you use within your regex (depending on whether you have your regex string wrapped in single or double quotes).

Robin

Neff

Yep I know, I wrote out the reg-exp like that in the post for clarity.. in the source the actual code is...

[FONT="Courier New"]$hrefpattern = "#.<(?:div|body).?>(.?)(<a\s[^>]?href\s?=\s?[\"']{0,1}http://www.domain.com/index.html['\"]{0,1}.?>.?</a>)(.*?)</(?:div|body)#i";
[/FONT]

NogDog

You need to escape any literal periods, too, such as those in the URL.

xblue

As the text you are applying your pattern to is multiline, I think you should set the s-modifier to make the dot match line breaks as well.

Neff

When I'm searching I hate finding threads that don't have a solution so....

I did a little reading, a little bit of rethinking and re-did my expression and I've got one that's working. The PHP code to create the expression is as follows...

$hrefpattern = "#(?:<(?:body|div|td|p)[^>]*>)((?:.(?!</?(?:body|div|td|p)))*)"
             . "(<a\s[^>]*?href\s*?=\s*?[\"']{0,1}"
             . preg_quote($targetURL)
             . "[\"']{0,1}.*?>.*?</a>)"
             . "(.*?)</(?:body|div|td|p)>#is";

It finds me the link tag, the preceding and following text inside whatever is the inner most of body, div, table cell or paragraph tags. Well it does on my tests and the half does "in the wild" pages I've tried it on, I'm always prepared to be proved wrong.