I would like to screen scrape a forum page. I just want to scrape the hyperlinks to the threads and nothing else.

I have looked in to the source code and seen a pattern emerging when the thread title is displayed
<a class="title threadtitle_unread" href="{url}">{title of thread}</a>

So basically what I need to do is the following
1) Get the webpage source code
2) Find instances of code which have <a class="title threadtitle_unread" href="{url}">{title}</a>
3) Print these instances

My end goal is to create a central page of all news from various sites for my football club.

    This is something that is easily accomplished with regular expressions (and still somewhat easily accomplished with non-regex string functions). Go ahead and read up on the things provided by PHP and this should be easy peasy for you:

    preg_match()
    preg_match_all()

    (And for extra credit...)
    strpos()
    substr()

      2 years later

      Hi,

      I want to extract data from forums automatically.

      Just to give you an example:

      From this forum:

      http://www.kadinlarkulubu.com/forum.php

      keywords: android, telefon, iphone, samsung

      I want to extract search results based on the keywords I entered to an excel file.

      I want to see: date &time, message, nickname of the person who wrote the message, url link of the message, keywords passing on the message..

      For seeing search results normally you should login to the forum.

      And if search results is listed as 24 pages, I want to program to extract all of the results in all pages automatically to excel.

      Can you suggest any software that can provide this?

      thanks a lot!!

        ozgurk;11024501 wrote:

        Can you suggest any software that can provide this?

        Well, seeing as how this is a PHP forum, I'll go ahead and suggest the obvious: PHP!

          bradgrafelman wrote:

          Well, seeing as how this is a PHP forum, I'll go ahead and suggest the obvious: PHP!

          +1!! 😃

          The PHPExcel class is handy for writing xls and xlsx files: http://phpexcel.codeplex.com/

            dalecosp;11024537 wrote:

            +1!! 😃

            The PHPExcel class is handy for writing xls and xlsx files: http://phpexcel.codeplex.com/

            thanks a lot for your replies..

            actually I cannot program myself, I did last time when I was in 2nd year of university, visual C++.. :😉
            I need to buy a software or find someone to help me for this..

            after testing few things, best solution for me is looks to copy all forum to a table where there will be fields:
            subforum, date&time, nickname, message
            I give this message in this thread as example:
            php help ; general help; screen scraping - how to scrape just the URL's;26.02.2013 23:05; dalecosp; The PHPExcel class is handy for writing xls and xlsx files: http://phpexcel.codeplex.com/

            do you know any software that can do this? or anyone can do this for some money?

            thanks a lot again

              5 days later
              ozgurk;11024587 wrote:

              do you know any software that can do this? or anyone can do this for some money?

              thanks a lot again

              I don't know of anything that's commercially available, but that's what search engines are for ;-)

              As for someone who'd do it for money, I couldn't say. Most people like to be employed. Whether or not you've got enough funds to attract the right kind of person and pay them for long enough to get a project like this off the ground is quite another question, I suppose.

              I have worked in a firm that did a "scraper" project. It was more than 4 months from conception to BETA (and the jury's still out on whether or not it helped the bottom line). Nothing's certain.

                If you have some money to throw at the problem, you might consider posting a help wanted on craigslist or something. If you are interested in trying to code it yourself, it doesn't sound exceptionally difficult to do a single page. Different forums will have different HTML structures, but they might be pretty consistent depending on the forum software used. I.e., parsing any forum that uses VBulletin might be possible with one script. phpBB might be possible with one other script, etc.

                The basic idea is:
                1) Fetch the HTML of the page
                2) Using either [man]preg_match_all[/man] or [man]DOM[/man], parse out the HTML contents
                3) Write said contents to a file -- either XLS using the provided link above or just straight csv using [man]fputcsv[/man].

                  Write a Reply...