Getting only the text then dividing it up into seperate parts

rhodesy

Hi,

I am parsing html documents and I need to get only the text from the documents - the easy way to do this is to use strip_tags of course - but then I want to divide the information up into areas because a html page such as http://news.bbc.co.uk/ has many different areas and I would obviously not just want all of the text clumped together - I need each area added to an array.

I'm a bit puzzled as to how to go about this. Thanks

madwormer2

If you're getting info from BBC news, they have a nifty XML feed that you should probably look into using instead.

rhodesy

I'm guessing I need to divide the document by splitting it using /n as the pattern however I also want to keep items that were obviously close together (such as A headline title and then the news underneath).

Thanks

madwormer2

What is it exactly that you are trying to achieve?

Something like the scrolling news at the top of http://dhost.info/neonradio/msnoogle/ ?

rhodesy

I just used bbc news as an example. Thanks anyway though

madwormer2

Ahhh, right, if the page you want info from has an XML feed, always use it. It'll make your life easier.

rhodesy

No - i want to be able to do this on any html page

madwormer2

Hmm, then you want to parse the structure of data holding tags, such as <td> or <span> or <div> and put it into an array.

I'm not sure how you'd do this though, lol.

rhodesy

Well I've found that splitting the stripped html by 3 \n (so \n\n\n) proves effetive enough but now I have the trouble of having loads of excess white space and /n all over the place

rhodesy

trim has pretty much sorted it