Hi,
I am parsing html documents and I need to get only the text from the documents - the easy way to do this is to use strip_tags of course - but then I want to divide the information up into areas because a html page such as http://news.bbc.co.uk/ has many different areas and I would obviously not just want all of the text clumped together - I need each area added to an array.
I'm a bit puzzled as to how to go about this. Thanks
If you're getting info from BBC news, they have a nifty XML feed that you should probably look into using instead.
I'm guessing I need to divide the document by splitting it using /n as the pattern however I also want to keep items that were obviously close together (such as A headline title and then the news underneath).
Thanks
What is it exactly that you are trying to achieve?
Something like the scrolling news at the top of http://dhost.info/neonradio/msnoogle/ ?
I just used bbc news as an example. Thanks anyway though
Ahhh, right, if the page you want info from has an XML feed, always use it. It'll make your life easier.
No - i want to be able to do this on any html page
Hmm, then you want to parse the structure of data holding tags, such as <td> or <span> or <div> and put it into an array.
I'm not sure how you'd do this though, lol.
Well I've found that splitting the stripped html by 3 \n (so \n\n\n) proves effetive enough but now I have the trouble of having loads of excess white space and /n all over the place
trim has pretty much sorted it