Screen scrape with javascript:__doPostBack?

Inf51 · Mar 3, 2011

I'm trying to scrape the contents of a page using simple_html_dom but the problem I have is that the page I'm trying to scrape is an aspx page and the links on that page all use javascript:__doPostBack to display the data (i.e. there's no data on that page itself until you click one of the links and it then displays the data on the same page).

So if I try to scrape the data off that page I'd need to somehow follow the links but as they're javascript:__doPostBack links and post back to the same aspx page I'm not sure if it is possible or where I'd start?

Thanks.

Steve.

johanafm · Mar 7, 2011

A javascript http request doesn't differ at all from any other http request in how it's performed. Some data is sent to some web server indicating a specific resource as its target. The web server passes on the data to whatever is found there, collects the output and sends it back to whomever made the request.
This is how the page is displayed to begin with. The __doPostBack function then deals with new http requests to inform the server that it wants information that belong to a specific link. The response however, may come in any variety of formats (other than (X)HTML), since the web browser is no longer directly responsible for parsing the data. This is done through javascript. And although the response may be sent as (X)HTML, it's common to send it as JSON (javascript string object notation) so that you can get javascript data structures (arrays and object) from the response, XML or plain text.

You only have to figure out what is sent in these requests, and then mimic the same requests. After that you can inspect the responses and deal with parsing them as necessary.

Inf51 · Mar 7, 2011

Thanks!

Now I know that in theory it's possible I can try figure out how to do it.

johanafm · Mar 7, 2011

Get Firefox and the Firebug addon. Then open firebug, click "Console" menu item, then only enable "Show XMLHttpRequests", then click a menu on the page and you can directly and easily inspect both the request and response, including headers.

Inf51 · Mar 7, 2011

Great - thanks again.

I was thinking of using a proxy server to log the requests but firebug makes it much easier.

Screen scrape with javascript:__doPostBack?

IInf51

Jjohanafm

IInf51

Jjohanafm

IInf51