How to extract web data

btfans · Aug 8, 2005

This site (http://www.rthk.org.hk/rthk/news/expressnews/index_.html) provides update news (encode big5) , I want to extract the first 5 or 10 headlines (right hand column), can anyone show me a more neat/simple way (using preg_match_all) ??

<?php

function getnews()
{
$file = "http://www.rthk.org.hk/rthk/news/expressnews/index_.html";

$contents = file($file);
$size = sizeof($contents);

for($i = 0; $i < $size; $i++) {
	$data = strip_tags($contents[$i]);

	$alldata = $contents[$i];
	if ($i == 296 || $i == 300 || $i == 304 || $i == 308 || $i == 312)  
	{ echo iconv("BIG5", "UTF-8", "$data <br>");	}

}
}
?>
[/COLOR]

drew010 · Aug 8, 2005

<?php

$data = file_get_contents('http://www.rthk.org.hk/rthk/news/expressnews/index_.html');

preg_match_all('/<!-- news headline --><font size=2>\s*<a href=\/rthk\/news\/expressnews\/newsframe\.htm\?([^>]+)>(.*?)<\/a>/i', $data, $matches);

print_r($matches);

?>

Roger_Ramjet · Aug 9, 2005

Don't they do an RSS feed then?

btfans · Aug 9, 2005

Very neat code, will try....

Thanks drew010 .

LoganK · Aug 10, 2005

Isn't this considered scraping?

Roger_Ramjet · Aug 10, 2005

Yes, or stealing content if you want to be crude about it, which is why I asked if they have an RSS feed cos that means they want you to use their content.

LoganK · Aug 10, 2005

I thought so...perhaps btfans can get in touch with the site's developers and ask them to add an RSS feed. Everyone likes RSS feeds! No one likes scraping

How to extract web data

Bbtfans

Ddrew010

RRoger_Ramjet

Bbtfans

LLoganK

RRoger_Ramjet

LLoganK