Regex URL Parsing problem

wshost · Aug 29, 2007

Goal:
Create regex to capture ALL urls in a string.

The regex should find all URLS but end the match as soon as a whitespace OR a punctuation mark (?,.!'") followed by a whitespace character (i.e. \s in regex) is found.

Here's what I have:

([h|H][t|T][t|T][p|P][s|S]?:\/\/)([w]{0,3}[.]{0,1}[a-zA-Z0-9.-]+.[a-zA-Z0-9]{2,10}[^\s])

It works for everything, the only problem is that it only ends when a white space it detected. I need it to end when a whitespace is detected OR a punctuation mark followed by a whitespace is detected.

Any help is greatly appreciated!

alimadzi · Aug 29, 2007

Not sure about your whitespace issue, but you can make your regex case-insensitive if you're using preg_match(). That would simplify the pattern considerably. Also, not every URL starts with 'www.'. You need to account for (multiple) subdomains.

wshost · Aug 29, 2007

Actually, the regex does take into account multiple subdomains, and does not require the domain to start with www.

I'm actually using preg_replace(), does that allow for the same case insensitivity command (i)?

Anyway, it's 99% correct and I know it's some stupid thing I'm missing. Can anyone help?

alimadzi · Aug 29, 2007

Ah, OK. I see it now. You do indeed account for multiple subdomains and the optionality of 'www.' Instead of {0,1}, you can just use a question mark for better readability. Actually I don't think you need to specifically handle the 'www.' at all. Just treat it like any other subdomain and get rid of that part of the pattern.

As far as preg_replace() and case insensitivity, yes. Just add an 'i' after the closing / of your pattern.

Regex URL Parsing problem

Wwshost

Aalimadzi

Wwshost

Aalimadzi