data grabbing and mining - need scripthelp

bernard_hinault

all,

first of - i have to explain something; I have to grab some data out of a phpBB in order to do some field reseach. I need the data out of a forum that is runned by a user community. I need the data to analyze the discussions.

Nothing harmeful - nothing bad - nothing serious and dangerous. But the issue is.
i have to get the data - so what?

I need the data in a allmost full and complete formate. So i need all the data like

username .-
forum
thread
topic
text of the posting and so on and so on.

how to do that?

i need some kind of a grabbing tool - can i do it with that kind of tool. How do i sove the storing-issue into the local mysql-database.

Well you see that is a tricky work - and i am pretty sure taht i am getting help here. So for any and all help i am very very thankful

#many many thanks in advance

bernard

jdorsch

Bernard,

The ideal method is to talk to the admins and get ahold of the PHPBB database data. If that is not an option then you have some tough coding ahead of you.

Basically you need to do the following

Open a CURL connection the the first forum page
Use CURL to fetch the contents of the page into a gigantic string variable.
Search through the page either by parsing it into a DOM object or by breaking it into usable sections with regular expressions.
On the first page, which is the list of threads, you will parse through and find thread links, you then will need to fetch each of those links with CURL and parse the contents of each page for the data you are looking for.
Then your code will look for the link to the next page of threads, you will then repeat step 4.

If you don't understand any of this then you need to get some books and start reading.

bernard_hinault

hi there jdorsch

mille grazie for the posting - i will go the hard way - through the overload of codind and headache. Admins do not will want to help me here. I have bad expereincene.

jdorsch wrote:
Bernard,

The ideal method is to talk to the admins and get ahold of the PHPBB database data. If that is not an option then you have some tough coding ahead of you.

Basically you need to do the following

Open a CURL connection the the first forum page

Use CURL to fetch the contents of the page into a gigantic string variable.

Search through the page either by parsing it into a DOM object or by breaking it into usable sections with regular expressions.

On the first page, which is the list of threads, you will parse through and find thread links, you then will need to fetch each of those links with CURL and parse the contents of each page for the data you are looking for.

Then your code will look for the link to the next page of threads, you will then repeat step 4.

If you don't understand any of this then you need to get some books and start reading.

got it - but it is a bit over my capabilities - so i will take some books onto my table.

btw - what is Beautiful Soup - did you ever hear;

Beautiful Soup - is an HTML/XML parser for Python that can turn even poorly written markup code into a parse tree, so you can extract information. Download the free program and documentation from: http://www.crummy.com/software/BeautifulSoup/

in a perl-way we have the following options;

WWW::Mechanize to get pages and interact with them,
HTML::TokeParser::Simple (which is a simplified interface to HTML::TokeParser, itself a simplified interface of HTML:😛arser) to parse stuff out of them,
DBI to insert the data into a database and do queries.

we would use one of the LWP modules ( recommend LWP::RobotUA so that we respect the rules of the site about automatically) to fetch that page, then one of
the HTML:😛arser variants to get the links within that main table. we
have a choice then to either 'get' the NEXT page of links, and keep
doing that until we have a list of every link to fetch, or before moving away from this page we could fetch each of the links weve already collected.

To fetch them, we simply use LWP to 'get' each one in turn, looping over
the array of links you've built. On each thread page, It would be a pretty straightforward matter to extract the text for your database using HTML:😛arser. we would probably use HTML::TokeParser myself. There is only one table
where class="forumline", so we would just locate that, then move through that table taking what you need.

well jdorsch, i would be happy to hear from you

best regards
bernard

MarkR

Screen-scraping is very complicated and error-prone.

Loading the HTML and parsing it is not the tricky bit. We can load HTML from HTTP using fopen() and we can parse it using DOM::loadHtml.

The hard bit is ensuring that your application is robust enough to work despite design changes to the target pages.

Something like PHPBB probably doesn't use semantic HTML at all and just has a heap of nested tables, making finding a specific piece of content very hard.

I strongly recommend that you obtain the data from the admin, either via some machine readable feed such as RSS or by sending you copies of (some of) the database.

If you don't get permission from the admin, they may ban your robot (or worse, feed it junk data), and your work will be entirely wasted as you won't be able to get it un-banned after you've pissed off the site admin.

Mark

bernard_hinault

hi MarkR,

MarkR wrote:
Screen-scraping is very complicated and error-prone.

Loading the HTML and parsing it is not the tricky bit. We can load HTML from HTTP using fopen() and we can parse it using DOM::loadHtml.

The hard bit is ensuring that your application is robust enough to work despite design changes to the target pages.

Something like PHPBB probably doesn't use semantic HTML at all and just has a heap of nested tables, making finding a specific piece of content very hard.

I strongly recommend that you obtain the data from the admin, either via some machine readable feed such as RSS or by sending you copies of (some of) the database.

If you don't get permission from the admin, they may ban your robot (or worse, feed it junk data), and your work will be entirely wasted as you won't be able to get it un-banned after you've pissed off the site admin.

Mark

good thoughts. You raise vaild points. i will muse bout them and think about your recommended ways of getting data.

hmm- - i will turn back here and let you all know how it isgoing on.

best regards

bernard 🙂

bernard_hinault

hello all,

just had a look at the phpBB

It looks like it is fairly easy since PHPBB uses CSS styles to
indicate different parts of the thread. For instance <span
class="postbody">, <span class="postdetails">, <span class="name">,
etc.

we can search for span tags with those CSS classes and get the
information contained in those tags. If neccessary, we can also
navigate from those tags to nearby tags. It doesn't matter how many
nest tables the span tag is in, since once we find the span tag we
can use it as an anchor and find other tags relative to it.

what do you think -

look forward to hear from you

regards
bernard