Crawl/Spider the website

maheswaran

HI all,

I have a website with more than 70+ static pages and lots of contents. Contents have some good structure format.Now we are converting this static website into CMS and we going to insert all the contents from this website into CMS site. To copy and paste is the huge work as every static page have lot of topics. What we want is we want to crawl/spider these content between the tags -- Here contents of topics are loaded -- .

For more clear

<p class="topic_start">

<p class="title"> The Maths Evaluation </p>

<p class="strips">&nbsp;</p>

<p class="contentlioad">

The Maths evaluation period......
...
.....
.....
</p>
</p>
<p class="topic_footer"><img src="images/footer_img.gig"></p>

Here i want to get the topic and content between the tags and ...

Can any body help.... I tried with google but not get succeed......

halojoy

1. If all those pages are in same directory
I would use [man]glob[/man] to get all pages names in directory.

2. Then I should, in a loop, use [man]file_get_contents[/man]
to load each page one after one into a string.

3. On each such page-string I would use [man]preg_match[/man]

to find between and

This regex will NOT include tags: and 
But this is easily changed if you wish.

<?php

//REGEX Pattern we will searchfor
//The part inside parentesis we capture= (.*)
$searchfor = '#<p class="topic_start">(.*)</p>.*<p class="topic_footer">#s';
// where we will collect all results
$results = array();

//find each page with extension .html in current directory
foreach (glob('*.html') as $file) {
	//read the whole page into $string
	$string = file_get_contents($file);
	//find the part we $searchfor
    preg_match($searchfor, $string, $match);
    //store each $match into results array
    $results[] = trim($match[1]);
    //loop to next and repeat foreach    

}

//display all results
print_r($results);

exit('the end');


/* MY TESTING
$string = '<p class="topic_start">

<p class="title"> The Maths Evaluation </p>

<p class="strips">&nbsp;</p>

<p class="contentlioad">

The Maths evaluation period......
...
.....
.....
</p>
</p>
<p class="topic_footer"><img src="images/footer_img.gig"></p>';

$search = '#<p class="topic_start">(.*)</p>.*<p class="topic_footer">#s';
preg_match($search, $string, $match);
print_r(trim($match[1]));
exit();
*/

?>