Hi folks, okay
i know Screen-scraping is very complicated and error-prone. but i have some taks ahead: i have to parse a huge list of threads I have to parse threads like this http://www.phpbb.com/phpBB/viewtopic.php?p=2382640 here - and i have to get data...and i do not know how to parse
Quesion: which one to take : preg_match() preg_replace_ preg_split() see the following
http://www.php.net/manual/en/function.preg-replace.php
http://php.net/str_replace
http://www.php.net/manual/en/function.preg-replace.php
data...and i do not know how to parse the following http://www.phpbb.com/phpBB
This is a phpBB bulletinboard. I need the data in a allmost full and complete formate. So i need all the data like see here an example thread: http://www.phpbb.com/phpBB/viewtopic.php?p=2382640
So i need all the data like:
username
forum
thread
topic
text of the posting and so on and so on. how to do accomplish that.
<?php
// BBCode parser by Sevendust
Class BBParse
{
Var $InputString;
Var $OutputString;
Function BBParse ( $InputString, $OutputString )
{
If ( $This -> Input == '' )
{
return_error ( 'You have to provide at least a message of 20 characters!', true );
}
else
{
// Define some default BB code tags, such as bold, italic, and url.
$BBCode[0] = '/[b]/';
$BBCode[1] = '/[/b]/';
$BBCode[2] = '/[i]/';
$BBCode[3] = '/[/i]/';
$BBCode[4] = '/[url]/';
$BBCode[5] = '/[/url]/';
// Replacement strings, in HTML ofcourse.
$BBReplace[0] = '<b>';
$BBReplace[1] = '</b>';
$BBReplace[2] = '<i>';
$BBReplace[3] = '</i>';
$BBReplace[4] = '<a href=' . $InputString . '>';
$BBReplace[5] = '</a>';
$BBParsedOutput = Preg_Replace ( $BBCode, $BBReplace, $InputString );
}
}
}
?>
Well Loading the HTML and parsing it is not the tricky bit.
We can load HTML from HTTP using fopen() and we can parse it using DOM::loadHtml. The hard bit is ensuring that our application is robust enough to work despite design changes to the target pages. Something like PHPBB probably doesn't use semantic HTML at all and just has a heap of nested tables, making finding a specific piece of content very hard.
Basically, we can do the following
- we Open a CURL connection the the first forum page
- we Use CURL to fetch the contents of the page into a gigantic string variable.
- we Search through the page either by parsing it into a DOM object
or by breaking it into usable sections with regular expressions.
- On the first page, which is the list of threads, we can parse through and find thread links, we then will need to fetch each of those links with CURL and parse the contents
of each page for the data we are looking for.
- Then our code will look for the link to the next page of threads, we will then repeat step 4.
note - this is the site: =http://www.phpBB.com/phpbb
there i need all the threads: for the parsing job - just had a look at the phpBB
It looks like it is fairly easy since PHPBB uses CSS styles to
indicate different parts of the thread. For instance <span
class="postbody">, <span class="postdetails">, <span class="name">, etc.
we can search for span tags with those CSS classes and get the information contained in those tags. If neccessary, we can also navigate from those tags to nearby tags. It doesn't matter how many nest tables the span tag is in, since once we find the span tag we can use it as an anchor and find other tags relative to it.
what do you think - look forward to hear from you
metabo
btw - i have to explain something; I have to grab some data out of a phpBB in order to do some field reseach. I need the data out of a forum that is runned by a user community.
I need the data to analyze the discussions.
Nothing harmeful - nothing bad - nothing serious and dangerous. But the issue is. i have to get the data - so what?
I need the data in a allmost full and complete formate. So i need all the data like
username,
forum
thread
topic
text of the posting and so on and so on. How to do that?
i need some kind of a grabbing tool - can i do it with that kind of tool. How do i sove the storing-issue into the local mysql-database. Well you see that is a tricky work - and i am pretty sure taht i am getting help here. So for any and all help i am
very very thankful many many thanks in advance