This site (http://www.rthk.org.hk/rthk/news/expressnews/index_.html) provides update news (encode big5) , I want to extract the first 5 or 10 headlines (right hand column), can anyone show me a more neat/simple way (using preg_match_all) ??


<?php

function getnews()
{
$file = "http://www.rthk.org.hk/rthk/news/expressnews/index_.html";

$contents = file($file);
$size = sizeof($contents);

for($i = 0; $i < $size; $i++) {
	$data = strip_tags($contents[$i]);

	$alldata = $contents[$i];
	if ($i == 296 || $i == 300 || $i == 304 || $i == 308 || $i == 312)  
	{ echo iconv("BIG5", "UTF-8", "$data <br>");	}

}
}
?>
[/COLOR]

    <?php
    
    $data = file_get_contents('http://www.rthk.org.hk/rthk/news/expressnews/index_.html');
    
    preg_match_all('/<!-- news headline --><font size=2>\s*<a href=\/rthk\/news\/expressnews\/newsframe\.htm\?([^>]+)>(.*?)<\/a>/i', $data, $matches);
    
    print_r($matches);
    
    ?>
    

      Very neat code, will try....

      Thanks drew010 .

        Isn't this considered scraping?

          Yes, or stealing content if you want to be crude about it, which is why I asked if they have an RSS feed cos that means they want you to use their content.

            I thought so...perhaps btfans can get in touch with the site's developers and ask them to add an RSS feed. Everyone likes RSS feeds! 😃 No one likes scraping 🙁

              Write a Reply...