Hello,

I have dumped a number of RSS feeds into a mySQL table. These feeds are news RSS feeds (AP, Reuters, ABC, CNN, CBS news, digg, slashdot, etc)

I am having difficulty writing a script that displays the content of each of the feeds WITHOUT showing duplicate articles. Instead I want duplicate articles to be listed under the initial article as "Related Headlines" -- much like Google News does. The question is: How to do this?

Example:

le: Rumsfeld Resigns from Iraq.
Source: Yahoo! News Date: Dec 3, 2006
Article Body: Today, Donald Rumsfeld has resigned from Iraq. Millions of Iraqis are partying in the streets.

Related Headlines: "Bush Fires Rumsfeld" - ABC News
"Rumsfeld is No More" - CNN News

Get the idea? I am using similar_text() to compare the headlines of each article. If the similarity is >70% then the compared headline is to be removed from the array so that it isn't displayed as an independent article but will be displayed as a Related Headline for that article.

So here is my code attempt:


<?php

$query = "select id, headline, intro, body, author, date, source, vote, xmlsitetype from anews1 where xmlsitetype = 0 ORDER BY date DESC LIMIT 10";

$result = mysql_query($query);    


// Go through each news item from the database table

while ($rownews = mysql_fetch_assoc($result)){


		  // find similar_text and flag duplicate stories to be displayed as Related Headlines
		  $dupquery = "SELECT id, headline, intro, body, author, date, source, vote, xmlsitetype from anews1";

		  echo $dupquery;
		  $dupresult = mysql_query($dupquery);

		  // strip unneeded characters like quotes out of headlines to clean them up:
		  $cleanheadlines = array("\"", "'");

		  while ($dupcheck = mysql_fetch_assoc($dupresult)){
			   foreach ($dupcheck as $key => $dupcheck[headline]){

				  $str1 = str_replace($cleanheadlines, "", addslashes($rownews[headline]));
				  $str2 = str_replace($cleanheadlines, "", addslashes($dupcheck[headline]));
				  echo "<p><b>str1</b> is: ".$str1."<br><b>str2</b> is: ". $str2."<p>";
				  if (similar_text( $str1, $str2, $p ) > .70){
				  	 echo '<b><u>phrases are similar</u></b>';
				  	 	// flag str2 in the array so that it is not displayed as an independent article, but rather as a Related Headline
				  	} else {

				  		echo 'phrases not similar';
            // if phrases are not similar then print this article as the next independent news article in the news list

				  }
				  echo "Percent: $p%";

   			   }
		   }
		?>
<ol>
				<li>
					<strong><a href="<?php echo $rownews[source]; ?>"><?php echo $rownews[headline]; ?></a></strong><br />
					<span style="font-size:0.8em; color:#999; height: 10px;">&rarr; <a href="<?php echo $rownews[source]; ?>"><?php echo $rownews[date]; ?></a> | <?php echo $rownews[date]; ?></span><br />
					<div style="font-size:1em; color:#000; height: 130px;"><?php echo $rownews[body]; ?></div>
				</li>
			<?php  


  }



?>

</ol>
</div>

The problem with this code is that the foreach loop is printing and comparing EVERY element in the array... so that $str2 is assigned each element in the array.. I only want $str2 to be assigned the second array element (which is the headline element) so that I can then compare it with $str1.

How do I do this?

    Write a Reply...