Greetings, I have gotten some great help here before. I was hoping someone could help modify this.
The php below works well when scraping until it runs across a headline with a weird character like a "&" and a few others. Is there a fix for the code below?
Thanks in advance.
<?php
// Screen scraping your way into RSS
// Example script, by Dennis Pallett
// http://www.phpit.net/tutorials/screenscrap-rss
// Get page
$url =
"http://www.urlgoeshere.com/";
$data = implode("", file($url));
// Get content items
preg_match_all ("/<div class=\"headline\">([^`]*?)<\/a/", $data, $matches);
// Begin feed
header ("Content-Type: text/xml; charset=ISO-8859-1");
echo "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" ?>\n";
?>
<rss version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:admin="http://webns.net/mvcb/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<channel>
<title>Browns News</title>
<description>The latest news from</description>
<link>http://www.urlgoeshere.com</link>
<language>en-us</language>
<?
// Loop through each content item
foreach ($matches[0] as $match) {
// First, get title
preg_match ("/\>([^`]*?)<\/a/", $match, $temp);
$title = $temp['1'];
$title = strip_tags($title);
$title = trim($title);
// Second, get url
preg_match ("/<a href=\"([^`]*?)\">/", $match, $temp);
$url = $temp['1'];
$url = trim($url);
// Echo RSS XML
echo "<item>\n";
echo "\t\t\t<title>" . strip_tags($title) . "</title>\n";
echo "\t\t\t<link>http://www.urlgoeshere.com" . strip_tags($url) . "</link>\n";
echo "\t\t\t<description>" . strip_tags($text) . "</description>\n";
echo "\t\t\t<content:encoded><![CDATA[ \n";
echo $text . "\n";
echo " ]]></content:encoded>\n";
echo "\t\t\t<dc:creator>" . strip_tags($author) . "</dc:creator>\n";
echo "\t\t</item>\n";
}
?>
</channel>
</rss>