URL Parser script

nrg_alpha

I took notice of this thread and thought it would be a nice little exercise to take a URL and break it up into parts (including sub domains).

Now I immediately started thinking of simply using parse_url and use ['host'] to get the domain. But if one wants explicit detailed sub domains within what is returned from ['host'], it seems more work is needed. So I came up with this script. I would like to know how to make this more streamlined / elegant / efficient (yet display the results you see here).

So here is the code:

function urlParser($url){
	$urlHost = parse_url($url);
	echo 'url - ' . $urlHost['host'] . '<br />';
	preg_match('#(?:\.\w{2}\.\w{2})|(?:\.\w{2,})$#', $urlHost['host'], $match); // $match[0] = Top-level Domain
	$urlRevised = substr($urlHost['host'], 0, strlen($urlHost['host']) - strlen($match[0]));
	$urlArray = explode('.', $urlRevised);
	array_shift($urlArray); // eliminate $url[0] = 'www.'
	//////////////////
	// display info //
	//////////////////
	if(count($urlArray) > 1){
		for($i = 0, $x =1, $count = count($urlArray) -1; $i < $count; $i++){
			echo 'Sub domain ' . $x++ . ': ' . $urlArray[$i] . '<br />';
		}
		echo 'Domain : ' . end($urlArray) . $match[0] . '<br />' . '-------------------------------------------------------------' . '<br />';
	} else {
		echo 'Domain : ' . $urlArray[0] . $match[0] . '<br />' . '-------------------------------------------------------------' . '<br />';
	}
}

$str = array('http://www.example.ca', 'http://www.subdomain1.phpbuilder.com/board/', 'http://www.subdomain1.subdomain2.subdomain3.example.co.uk/images/gif/');
foreach($str as $url){
	urlParser($url);
}

Output:

url - www.example.ca
Domain : example.ca
-------------------------------------------------------------
url - www.subdomain1.phpbuilder.com
Sub domain 1: subdomain1
Domain : phpbuilder.com
-------------------------------------------------------------
url - www.subdomain1.subdomain2.subdomain3.example.co.uk
Sub domain 1: subdomain1
Sub domain 2: subdomain2
Sub domain 3: subdomain3
Domain : example.co.uk
-------------------------------------------------------------

Ok, so here is one thought process design snag I ran into.. notice the preg aspect? I was at first trying to devise a way to detect between formats ending in '.xx.xx' (think .co.uk), or every other format '.xx or .xxx or .xxxx, etc...) but capture everything else and use this as the basis for further calculations.. I could only get either the '.xx.xx' format with the rest as captured or the '.xx / .xxx / .xxxx' format with the rest as captured, but not both.. so I used $match[0] (which does successfully find the ending in either format) and subtracted this length from the length of the full URL and went from there.

So the code does work. But again, I sense the function urlParser() can be streamlined / more elegant / effective. I would be very interested in seeing the feedback from all this.

Time to make a fool of myself (again) 😉

Cheers,

NRG

nrg_alpha

Ok, after tightening a few nuts 'n bolts.. here is a revised version:

function urlParser($url){
   $urlHost = parse_url($url);
   echo 'url - '. $urlHost['host'] . '<br />';
   preg_match('#(.+?)((?:\.\w{2}\.\w{2})|(?:\.\w{2,}))$#', $urlHost['host'], $match);
   $urlArray = explode('.', $match[1]);
   array_shift($urlArray);
   $i = 0;
   $domNumber = 1;
   while($i != count($urlArray)-1){
      echo 'Sub domain ' . $domNumber++ . ': ' . $urlArray[$i++] . '<br />';
   }
   echo 'Domain: ' . end($urlArray) . '<br />Top-level domain: ' . $match[2] . '<br />' . '-----------------------------------------------' . '<br />';
}

$str = array('http://www.example.ca', 'http://www.subdomain1.phpbuilder.com/board/', 'http://www.subdomain1.subdomain2.subdomain3.example.co.uk/images/gif/');
foreach($str as $url){
	urlParser($url);
}

Output:

url - www.example.ca
Domain: example
Top-level domain: .ca
-----------------------------------------------
url - www.subdomain1.phpbuilder.com
Sub domain 1: subdomain1
Domain: phpbuilder
Top-level domain: .com
-----------------------------------------------
url - www.subdomain1.subdomain2.subdomain3.example.co.uk
Sub domain 1: subdomain1
Sub domain 2: subdomain2
Sub domain 3: subdomain3
Domain: example
Top-level domain: .co.uk
-----------------------------------------------

I managed to fix the preg regex so that it successfully finds not only the final .xx.xx or .xxx or .xxxx etc, but captures the first complete section (minus this final format) as well.. this saved a line of code. I also streamlined the output code.

I think this is about as streamlined and efficient as I can get it.

What do you think?

Cheers,

NRG

nrg_alpha

I realised that after the scheme (http), not all web addresses are listed as www (as sometimes you may find for example www2, or in this real example: http://gear.ign.com/, we have 'gear'.

Now we all know you can omit the 'www.' and simply type in the rest of the url and the browser will compensate for this. But in this case (with http://gear.ign.com/), I tried 'http://www.gear.ign.com' and there was no such location! So this is admittedly an odd ball.. going back to my previous function, I realized I can replace the line 'array_shift($urlArray);' with a line to output $urlArray[0] and simply set $i = 1 instead.. So now I output the scheme (I inlcude www (or equivalent) in this). So the lines of code stay the same, with that extra bit of added functionality:

function urlParser($url){
   echo 'url - '. $url . '<br />';
   $urlHost = parse_url($url);
   preg_match('#(.+?)((?:\.\w{2}\.\w{2})|(?:\.\w{2,}))$#', $urlHost['host'], $match);
   $urlArray = explode('.', $match[1]);
   echo 'Scheme : ' . $urlHost['scheme'] . '://' . $urlArray[0] . '<br />';
   $i = 1;
   $domNumber = 1;
   while($i != count($urlArray)-1){
      echo 'Sub domain ' . $domNumber++ . ': ' . $urlArray[$i++] . '<br />';
   }
   echo 'Domain: ' . end($urlArray) . '<br />Top-level domain: ' . $match[2] . '<br />' . '-----------------------------------------------' . '<br />';
}

$str = array('http://gear.ign.com/', 'https://www.subdomain1.subdomain2.phpbuilder.com/board/', 'ftp://ftp.funet.fi/pub/standards/RFC/rfc959.txt');
foreach($str as $url){
	urlParser($url);
}

Which outputs:

url - http://gear.ign.com/
Scheme : http://gear
Domain: ign
Top-level domain: .com
-----------------------------------------------
url - https://www.subdomain1.subdomain2.phpbuilder.com/board/
Scheme : https://www
Sub domain 1: subdomain1
Sub domain 2: subdomain2
Domain: phpbuilder
Top-level domain: .com
-----------------------------------------------
url - ftp://ftp.funet.fi/pub/standards/RFC/rfc959.txt
Scheme : ftp://ftp
Domain: funet
Top-level domain: .fi
-----------------------------------------------

I also echo out the complete url to start with instead of the stripped down ['host'] version. As well, I changed the samples in the array to be plugged into the function to include the oddball url mentioned at the beginning of this reply as well as include an ftp one.

Cheers,

NRG

laserlight

Now we all know you can omit the 'www.' and simply type in the rest of the url and the browser will compensate for this.

That is not true. It depends on the server configuration, and in fact the choice between no-www and yes-www is a holy war.

nrg_alpha

laserlight;10888560 wrote:
That is not true. It depends on the server configuration, and in fact the choice between no-www and yes-www is a holy war.

Ah.. I thought this was simply the browser checking the url you entered, and if 'www.' is missing, it (internally) auto inserts it and away it goes...

Well, yet another thing learned! Thanks for the correction, Laserlight.
This was a big misunderstanding on my part.

Cheers,

NRG