Grab NewsHeadline

harris97

Does anyone know the technical basic for grabbing news headline at cnn.com, yahoo etc.
thanks in advance!

terje-s

Well, I have done this a couple of times, and usually it turns out to be a living nightmare. Unless whoever you're grabbing headlines from is offering them for grabbing, it can be pretty hard.

You write some sort of application that connects to the webserver you are grabbing headlines from, retrieves a document and saves it. Then you have to write a script (perl/sed/awk is pretty good for this) that determins where the headlines are located and does whatever you want with the data.

Be aware that this probably includes a considerable amount of work. I have also found that using lynxs -dump option can help you out, e.g. fetch html document, then

lynx -dump <document.html> > rendered.txt

or similar. This might help your script determine the headlines actual text, and it can be easier to fetch data from the original html file (And hyperlinks in -dump mode are listed at the end, can be very useful as well).

What does piss one off is when you've spent three weeks making this work, and they suddenly change site layout 🙂

Anon

I'm doing something VERY similar to this right now. I've only started learning PHP within the last few days but (without trying to sound cocky) already I can extract text from a site that doesn't use embedded tags. It's not all that hard once you've figured out how you're gonna make it work. If any of the site's layouts change I only need to change a few values and it would start working again.

terje-s

Then you probably have some tips & tricks to share with the rest of us? I havn't researched too much in the field, but at least when I was grabbing headlines from some norwegian newspapers etc, it turned out to be difficult (for me anyway). I may very well have missed something important, so please do explain! 🙂

markawmaw

Why not use a newsfeed service that is legit, such as www.moreover.com - there newsfeeds include stuff from CNN, BBC, Yahoo etc. There very good, updated every 15 mins and avaiable in most formats(XML, RDF, javascript etc.....) An example of where they are used is in my site (http://markw.com/news/ it might take a while todownload, but that is cachin other stuff as well)

Hope it helps

Anon

hey people,

just look at the code library:
http://www.phpbuilder.com/snippet/detail.php?type=snippet&id=34

theres everything you need

Andreas

Anon

I used that script that Andreas posted a link to and messed around with it and built it up so it can retrieve information from (hopefully) any page rather than just pages that use embedded tags.