hi there jdorsch
mille grazie for the posting - i will go the hard way - through the overload of codind and headache. Admins do not will want to help me here. I have bad expereincene.
jdorsch wrote:Bernard,
The ideal method is to talk to the admins and get ahold of the PHPBB database data. If that is not an option then you have some tough coding ahead of you.
Basically you need to do the following
- Open a CURL connection the the first forum page
- Use CURL to fetch the contents of the page into a gigantic string variable.
- Search through the page either by parsing it into a DOM object or by breaking it into usable sections with regular expressions.
- On the first page, which is the list of threads, you will parse through and find thread links, you then will need to fetch each of those links with CURL and parse the contents of each page for the data you are looking for.
- Then your code will look for the link to the next page of threads, you will then repeat step 4.
If you don't understand any of this then you need to get some books and start reading.
got it - but it is a bit over my capabilities - so i will take some books onto my table.
btw - what is Beautiful Soup - did you ever hear;
Beautiful Soup - is an HTML/XML parser for Python that can turn even poorly written markup code into a parse tree, so you can extract information. Download the free program and documentation from: http://www.crummy.com/software/BeautifulSoup/
in a perl-way we have the following options;
WWW::Mechanize to get pages and interact with them,
HTML::TokeParser::Simple (which is a simplified interface to HTML::TokeParser, itself a simplified interface of HTML:😛arser) to parse stuff out of them,
DBI to insert the data into a database and do queries.
we would use one of the LWP modules ( recommend LWP::RobotUA so that we respect the rules of the site about automatically) to fetch that page, then one of
the HTML:😛arser variants to get the links within that main table. we
have a choice then to either 'get' the NEXT page of links, and keep
doing that until we have a list of every link to fetch, or before moving away from this page we could fetch each of the links weve already collected.
To fetch them, we simply use LWP to 'get' each one in turn, looping over
the array of links you've built. On each thread page, It would be a pretty straightforward matter to extract the text for your database using HTML:😛arser. we would probably use HTML::TokeParser myself. There is only one table
where class="forumline", so we would just locate that, then move through that table taking what you need.
well jdorsch, i would be happy to hear from you
best regards
bernard