Generating sitemap automatically via cron

Derokorian · Mar 8, 2012

Hi guys, I wrote this script to generate a sitemap.xml file for me. Its executed weekly via a cron job and seems to be working perfectly. I'm looking for any feedback you might have. Thanks!

<?php

/**
 * Site map generator
 */
// Setup
$s = microtime(TRUE);
set_time_limit(0);
ob_start();

/** Site URL **/
define( 'url', 'http://---.com/'); // obfuscated for public forum

/** Sitemap file with full path **/
$xmlfile = '/home/---/public_html/sitemap.xml'; // obfuscated for public forum

// URLs to add to put in the sitemap
$urls = array(url);

// Fill the array
getUrls(url,$urls);

// xml starter
$xml = '<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'."\r\n";

// pattern to use for each page
$pagePatt = "\t<url>\r\n\t\t<loc>%s</loc>\r\n\t\t<changefreq>%s</changefreq>\r\n\t</url>\r\n";

// Change frequency options
$freqs = array('weekly','weekly','monthly');
foreach( $urls as $url ) {
   // determine how often the current page "might" change
   $freq = substr_count($url,'/') - 3;
   // add current page to the sitemap
   $xml .= sprintf($pagePatt,$url,$freqs[$freq]);
}

// xml closer
$xml .= '</urlset>';

// Attempt to put the new sitemap in the file
if( file_put_contents($xmlfile,$xml) ) {
   // New sitemap success!
   echo 'Generated new sitemap at '.$xmlfile.' on '.date('Y-m-d')."\n";
} else {
   // New sitemap failure :(
   echo 'Failed to generate new site map on '.date('Y-m-d')."\n";
}

// Log this scripts results
$f = fopen('sitemap.log','a');

// Elapsed time for sitemap creation
$e = microtime(TRUE);
$tot = $e - $s;
echo 'Completed sitemap in '.$tot."\n";

// Turn the output into log contents
$out = ob_get_clean();
fputs($f,$out);
fclose($f);


function getUrls($url,&$urls) {
   // Get the page contents from the supplied $url
   $page = getCurlContents($url);
   sleep(5);

   // Get all the links on the page
   if( preg_match_all('/href="([^"]+)"/i',$page,$matches) ) {
      foreach( $matches[1] as $match ) {
         // Check if the link is for this site
         if( strpos($match,url) === FALSE ) continue;
         // Check if the link is a resource
         if( preg_match('/\.(css|js|jpg|jpeg|bmp|gif|zip)/i',$match) ) continue;
         // Check if the link is already in the current result array
         if( !in_array($match,$urls) ) {
            // add link to the result array
            $urls[] = $match;
            // Page hasn't been parsed, get the links on that page (Recurse this function)
            getUrls($match,$urls);
         }
      }
   }
}

// Use curl to get the contents of a $url
function getCurlContents($url) {
   // Create curl handle
   $ch = curl_init();
   // Set curl options
   curl_setopt($ch, CURLOPT_URL,$url);
   curl_setopt($ch, CURLOPT_HEADER, 0);
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
   curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
   curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
   curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1'); 
   curl_setopt($ch, CURLOPT_HEADER, false);
   // get the contents
   $contents = curl_exec($ch);
   // close the handle
   curl_close($ch);
   // return the contents
   return $contents;
}

dalecosp · Mar 9, 2012

Pretty cool ... you've got a start on a search spider too, if you want.

I did this for one of our sites last fall. Since it's essentially like an ecommerce site, I didn't do any spidering ... it's basically a big DB dump. But the Google-love has been Real Nice(tm).

gammaster · Mar 11, 2012

I don`t see this as the best solution if I understand it right.

Why use a cronjob? Why not run a function every time you do an update and just then? Then your sitemap is always up to date, and you will never run a script if nothing is changed. As described it could take a week before an edit will be updated.

my2c

bradgrafelman · Mar 12, 2012

How often does your site actually change structure enough to warrant a new sitemap be generated? How many resources would you rather devote to constantly generating new sitemaps rather than serving up requests for visitors?

Plus, search engines don't depend solely upon highly detailed/extensive sitemaps, so if you're that dependent upon your sitemap then I'd say the real problem is with your website in general.

Derokorian · Mar 12, 2012

gammaster;10998726 wrote:
I don`t see this as the best solution if I understand it right.

Why use a cronjob? Why not run a function every time you do an update and just then? Then your sitemap is always up to date, and you will never run a script if nothing is changed. As described it could take a week before an edit will be updated.

my2c

Using cron because then I don't have to change any application logic and I also didn't want to slow down new posts by adding a file read/write to the equation.

bradgrafelman;10998749 wrote:
How often does your site actually change structure enough to warrant a new sitemap be generated? How many resources would you rather devote to constantly generating new sitemaps rather than serving up requests for visitors?

Plus, search engines don't depend solely upon highly detailed/extensive sitemaps, so if you're that dependent upon your sitemap then I'd say the real problem is with your website in general.

I'm not dependent at all on a sitemap. I didn't have one and I wanted to have one, but I also didn't want to have to update said sitemap myself. My site gets something like 200 hits per week, so the resources this takes is most likely not affecting any page loads. I usually post to the site once or twice a week, so that's about how often it changes.

Thanks for the responses, however I was more looking for critique on the code and the logic in that rather than my choice to use cron, or my choice to create a sitemap automagically.

gammaster · Mar 12, 2012

preg_match_all("/<a.*? href=\"(.*?)\".*?>/i", $page, $matches);

With this regex you will find the urls straight away as an array in $matches[1] and can skip the images, css and js checking part.

Derokorian · Mar 12, 2012

gammaster;10998840 wrote:
preg_match_all("/<a.*? href=\"(.*?)\".*?>/i", $page, $matches);
With this regex you will find the urls straight away as an array in $matches[1] and can skip the images, css and js checking part.

Thanks but that wouldn't prevent me from checking css or js, since I may link to css or js in the content of a post =D Plus I was already using a href based pattern:

if( preg_match_all('/href="([^"]+)"/i',$page,$matches) ) {

And I saw no reason to match all the other text that makes up the tag like you did.

bradgrafelman · Mar 12, 2012

I personally would've went with a DOM approach rather than regexp, since parsing through an HTML document with a pattern-based approach just feels wrong to me. It would potentially result in less unintended matches, more intended matches, and overall make the code more readable as far as understanding the intent/purpose at a glance. With XPath, you could match all 'href' attributes that begin with your site URL (no regexps or even string comparisons needed) all in one step.

Same goes with building the XML file; they call them HTML/XML documents for a reason, y'know. :p

Some other miscellaneous comments:

fopen()/fwrite()/fclose() could be reduced down to a single call to [man]file_put_contents/man for the 'sitemap.log' file.
What's with the sleep(5) in getUrls()? Is this simply meant to limit the rate of requests?
You could probably optimize the script a bit and make the $ch cURL resource a static variable. The second through nth calls to getCurlContents() really only need to execute just these two statements:
```
curl_setopt($ch, CURLOPT_URL,$url); 
$contents = curl_exec($ch); 
```
if they could re-use the same cURL resource from the first request.

Derokorian · Mar 12, 2012

I always forget I can append with file_put_contents
Yes thats why the sleep is there
Thanks will change
Unnumbered will try to use this to learn DOM :-)

Weedpacket · Mar 13, 2012

Still with the css|js|etc pattern matching; that can be moved out of the loop containing it by using [man]preg_grep[/man] instead.

I was a little confused for a minute between [font=monospace]url[/font] and [font=monospace]$url[/font]. Calling the constant [font=monospace]SITE_URL[/font] would have prevented that.

Derokorian · Mar 13, 2012

mmm preg_grep never heard of that, thanks.

Derokorian · Mar 13, 2012

Ok, looking for more feedback now lol. Changes made: SITE_URL instead of url, DOMDocument instead of preg_match_all, preg_grep instead of checking each value myself, SimpleXML to build the xml, and make a curl class because I was having a problem getting a static handle to work in the function. I've tested it and the following works and gives me what I would expect:

<?php

/**
 * Site map generator
 */
// Setup
$s = microtime(TRUE);
set_time_limit(0);
ob_start();

/** Site URL **/ 
define( 'SITE_URL', 'http://---.com/'); // obfuscated for public forum 

/** Sitemap file with full path **/ 
$xmlfile = '/home/---/public_html/sitemap.xml'; // obfuscated for public forum 

// URLs to add to put in the sitemap
$urls = array(SITE_URL);

// Fill the array
getUrls(SITE_URL,$urls);
$urls = array_unique($urls);

// xml starter
$xml = '<?xml version="1.0" encoding="UTF-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"></urlset>';
$xml = new SimpleXMLElement($xml);

// Change frequency options
$freqs = array('weekly','weekly','monthly');
foreach( $urls as $url ) {
   // determine how often the current page "might" change
   $freq = substr_count($url,'/') - 3;

   // add current page to the sitemap
   addURL($xml,$url,$freqs[$freq]);
}

// Attempt to put the new sitemap in the file
if( file_put_contents($xmlfile,$xml->asXML()) ) {
   // New sitemap success!
   echo 'Generated new sitemap at '.$xmlfile.' on '.date('Y-m-d')."\n";
} else {
   // New sitemap failure :(
   echo 'Failed to generate new site map on '.date('Y-m-d')."\n";
}

// Elapsed time for sitemap creation
$e = microtime(TRUE);
printf("Completed sitemap in %f\n",($e-$s));

// Turn the output into log contents
$out = ob_get_clean();
file_put_contents('sitemap.log','a',FILE_APPEND);

// Adds a given url and frequency to the specified xml
function addURL(SimpleXMLElement &$xml,$url,$freq) {
	$t = $xml->addChild('url');
	$t->addChild('loc',$url);
	$t->addChild('changefreq',$freq);
}

function getUrls($url,&$urls) {
	static $myCurl;
	if( !isset($myCurl) ) $myCurl = new myCurl();

// Get the page contents from the supplied $url
$page = $myCurl->getContents($url);
sleep(5);

// Load document
$doc = new DOMDocument();
$doc->loadHTML($page);

// get all anchor tags
$anchors = $doc->getElementsByTagName('a');

$hrefs = array();
// loop thru all the anchor tags
foreach( $anchors as $anchor ) {
	$hrefs[] = $anchor->getAttribute('href');
}

// Remove external links
$hrefs = preg_grep('/^'.preg_quote(SITE_URL,'/').'/',$hrefs);

// Remove links to resources
$hrefs = preg_grep('/\.(css|js|jpg|jpeg|bmp|gif|zip|txt)$/i',$hrefs,PREG_GREP_INVERT);

// Remove links already in our list
$hrefs = array_diff($hrefs,$urls);

foreach( $hrefs as $href ) {
	// add link to the result array
	$urls[] = $href;
	// Page hasn't been parsed, get the links on that page (Recurse this function)
	getUrls($href,$urls);
}
}

// Use curl to get the contents of a $url
class myCurl {
	private $ch;
	public function __construct() {
		// Create curl handle
		$this->ch = curl_init();
		// Set curl options
		curl_setopt($this->ch, CURLOPT_HEADER, 0);
		curl_setopt($this->ch, CURLOPT_RETURNTRANSFER, 1);
		curl_setopt($this->ch, CURLOPT_SSL_VERIFYHOST, false);
		curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, 1);
		curl_setopt($this->ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1'); 
		curl_setopt($this->ch, CURLOPT_HEADER, false);
	}
	public function getContents($url) {
		// set option to the url
		curl_setopt($this->ch, CURLOPT_URL,$url);
		// returns the contents
		return curl_exec($this->ch);
	}
	public function __destruct() {
		// close the handle
		curl_close($this->ch);
	}
}

Thanks again for all your help.

Generating sitemap automatically via cron

DDerokorian

dalecosp

Ggammaster

Bbradgrafelman

DDerokorian

Ggammaster

DDerokorian

Bbradgrafelman

DDerokorian

Weedpacket

DDerokorian

DDerokorian