[RESOLVED] Simple RegEx ?

Good2CU

I'm trying to parse the url out of a <base href=""> tag in html

I'm having trouble getting either ' or "

I'm using eregi(), and I think it should look like:

eregi("<base href=('|\")(.+)('|\")>", $text_to_parse, $result);

I want to capture anything between '' or ""

so how to I get it to match ' or "?

Are there any other improvements I can make (besides using pregi, which I'm just learning)

Thanks

Weedpacket

Behold the results of a search:
http://www.phpbuilder.com/board/showthread.php?t=10220796&highlight=regex+href
http://www.phpbuilder.com/board/showthread.php?t=10270729&highlight=regex+href
More can be found if you look.

Good2CU

My question is specifically about capturing ' or "

I believe it is ('|\"), but it doesn't seem to work. Can someone please tell me if this is correct or not?

Thanks

nrg_alpha

To the op, I would suggest distancing yourself from the use of ereg.. it will not be in included in the core come PHP 6. As you have mentioned, keep learning preg... here would be one way at your problem:

$str = 'This is a base tag, <base href="http://www.somesite.bork"></base> which is a base tag with a url in it.';
preg_match('#<base.+?href=["\']([^"\']+)["\']#i', $str, $match);
echo $match[1];

Output:

http://www.somesite.bork

When I look at your example, I noticed you use ('|"). Alternations are slower in performance than using, say a character class ["\'] Note I had to escape the single quote, as my pattern is surrounded with single quotes. Character class calculations are better than using alternations (at least in this case, as we are only interested in single characters.. not a string of characters like say jpeg|gif|png).

Lastly, I noticed you used .+ This is a very highly inefficient way of doing things...

As an extremely simplified example:
Say your string is: "I'm Gerry, hi!"
And suppose you are using this pattern: #(.+)gerry#i

What happens is regex sees .+ and greedily captures everything up to the end.. so at this point, the capture is equal to the whole string "I'm Gerry, hi!". Along the way, the engine is inserting saved states (position markers between each character) as it captures each character greedily.

However, since there is more after the (.+) (in this case, the letters g,e,r,r and y) in the pattern, the engine now needs to backtrack..so it goes to the last saved state (which points to the ! character) and checks to see if this character matches the first character that follows (.+) [in this case, the g in gerry]. Nope.. now the engine must go backtrack yet another saved state and see if that character (i) matches g. Nope. So on and on the engine backtracks in reverse order and checks to see if the newest backtracked character from the capture matches.. only when the engine reaches g, does things start to move forward again..thus gerry (case insensitive) matches (and is thus excluded from the capture)...this is alot of work to capture what comes before Gerry in the string.. While this example is a small one, it gets worse when you use .* or .+ to find something in the middle of much larger chunks of data.

The better way it to use lazy quantifiers..
so if the pattern is rewritten as #(.+?)gerry#i , what this does is force the regex engine to capture the first character (the I in I'm), then looks to see if the next character (')is a g. Nope.. include that non matching character into the capture, advance to the next character. Check to see if it (m) matches g. Nope. so it keeps creeping forward in this fashion.. so long as it doesn't match the g, it is included in the capture. Once the letters g,e,r,r and y are matched, the capture is complete (without including those matched characters). This is a much faster way of doing tasks like this, as you eliminate the need to have the regex from doing any backtracking.

An excellent book to get which explains how the mechnics of the regex engine works it this one.

Act now! It's x-mas time 😉

Hope all this helps...

Cheers,

NRG

Good2CU

Thanks for the reply, I will try to use the class instead.

I already have that book, I haven't had much time to get too far in it. I'm focusing on the RegEx sections in the PHP books I have first. I'll get there eventually though.

Thanks