Imagine an image gallery, whose Page 1 has a URL like:
http://www.myserver.com/gallery.php

The sub pages also share the same URL. Pages 2 and 3 are accessed via Javascript links on the page that load new data without refreshing the entire page. (think AJAX)

Easy: curl into gallery.php and take whatever I want.

Hard: curl into gallery.php, redirect to Javascript links to Pages 2 and 3 and get data off those pages.

Anyone know how to "follow" the Javascript links with curl and get the data off Pages 2 and 3?

thanks!
Jason
Los Angeles, CA

    It's really a hard problem because you have to essentially write a web browser.

    If you want a quick hack, you could find the pattern of what new pages the JS calls and make curl call those pages manually - or you could even find a pattern of what the images are called and call those images through curl.

    But if you are looking for a REAL, completely functional Javascript interpreter, you are going to need to build your own web browser.

    Think about it this way: What if the Ajax does something really bizarre like count the number of seconds that you've been on the page and then it loads content based on how long you've been reading. 1-10 seconds, it pulls in content for version #2, 11-20 seconds and it pulls content for version #3, etc. Now imagine that it checks to see what images you moused over. If you moused over images 1,3,5,7, then it loads content for version #4, and if you clicked any of the radio buttons, then it pulls content for version #5.

    So a quick hack is to assume that the Ajax code is ultra simple - essentially load the next page. The real solution (which you may not need) is to include a Javascript interpreter into your new PHP Curl web browser script which can follow the Ajax calls exactly as they would on a real site.

    If you can examine the Ajax (Javascript) code that calls the next page, then you don't really have to "follow" the links, you can just calculate what they should be. Truly, the Ajax could be assembling the URL to the next page from wacky code like: URL = "h" + "t" +"tp://www" + ".mysite" + ".com/page" + "2.php" in which case, you can't simply "read" the contents of where the Javascript will bring you next... you have to let the Javascript execute so you can let it build the URL for you. And for that, you'd need to have a Javascript interpreter.

    The hack is easier assuming that there is a pattern to what pages get called next.

    If you are really going to pursue the complicated solution, I'd start by obtaining the source code for Firefox since it already has the "curl" and the Javascript parts in place - but that's what I would call a hard project.

      etully,

      I see you understand my dilemma. Let me clarify a bit further.

      Yes, I can curl to gallery.php, and retrieve what the Javascript links to the "next" pages are. That is very structured and consistent. No problems there.

      Unfortunately, the Javascript links look like
      "javascript:doLInk('xyz$Main$ImageListings','2')"
      or
      "javascript:
      doLInk('xyz$Main$ImageListings','3')"

      So those Javascript functions do not call a new HTML/php page. It is like an Ajax call, and simply replaces the gallery images on the page with new ones.

      I have no way of predicting what the image URL's on pages 2 and 3 are. I am stuck with figuring out a way to "follow" the Javascript functions calls to the next page, and then preg_match 'ing my image src's.

      Does this make things more clear to you? Does this change the way you see the problem?

      thanks,
      Jason

        If it's just replacing an image with a new image (like the age old mouse overs)... then yes, it needs to go out to the Internet to retrieve the new image... but that doesn't make it Ajax. When you change the property of an image so that it has a new source, you are forcing the browser to load a new image, but it's not necessarily using Ajax to retrieve that image.

        If it's Ajax, then you should see a part of the code that calls a real live URL. If it's simply changing the source of the image, then it's 1996 technology.

        So I guess I would need to understand what the function __doLInk looks like. XMLHttpRequest? Or .src =

        I suspect that you could copy the javascript on their page to a page on your own web site and put some alert commands in it to see what URL it's calling (if Ajax) or what image it's calling (if not Ajax) and that would get you closer to understanding how the JS functions are finding the new images which gets you closer to building a curl app that can predict what images to call.

        As you can tell, I'm grasping at straws here (and maybe not making complete sense) because I can't see the remote site.

        But the root of your problem is that unless you embed a JS interpreter into your PHP based Curl script, you can't execute the JS to truly follow the links... so you'll have to figure out what the JS does and simulate that to predict the next "thing" to call whether it's an image or a URL in Ajax.

          thank you etully!

          Believe it or not, I think we're making progress.

          Here's some of the javascript code:

          <script type="text/javascript">
          <!--
          var theForm = document.forms['aspnetForm'];
          if (!theForm) {
          theForm = document.aspnetForm;
          }
          function doPostBack(eventTarget, eventArgument) {
          if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
          theForm.
          EVENTTARGET.value = eventTarget;
          theForm.__EVENTARGUMENT.value = eventArgument;
          theForm.submit();
          }
          }
          // -->
          </script>

          Unfortunately, my knowledge of javascript is somewhat limited (read maybe a book and a half on the subject).

          But it's clear the page submits a form.

            Looks like it's an asp script:

            More code:
            <form name="aspnetForm" method="post" action="script.aspx?%3ffuseaction=gallery" id="aspnetForm">
            <input type="hidden" name="EVENTTARGET" id="EVENTTARGET" value="" />
            <input type="hidden" name="EVENTARGUMENT" id="EVENTARGUMENT" value="" />

            So the javascript links perform a submit on this form, and supply the eventtarget and eventargument post variables.

            I think I can do this!! Just use curl to send those arguments via post to the script identified in the action parameter of the form tag...

              UPDATE

              it worked! what I suggested in the above post is exactly what it took.

              thanks for the helpful brainstorming etully!

                10 months later

                Hi I m having the same problem. Can you help me?
                Take a look at this web page
                http://www.zap.com.br/imoveis/resultado-busca-imoveis.aspx?IDTransacao=3&Transacao=Comprar+um+im%u00f3vel&IDUF=19&UF=RIO+DE+JANEIRO&IDLocalidade=63118&Localidade=RIO+DE+JANEIRO&IDTipo=1&Tipo=Apartamento&ZonaGrupo=9&IDDistrito=0&Distrito=Todos&TipoBusca=Simples

                this page has page 1, page 2, page 3 and page 4

                code:
                #######################################################
                <a href="javascript:doPostBack('ctl00$ContentPlaceHolder1$grdResultadoBusca','Page$1')">1</a></td><td><span>2</span></td><td><a href="javascript:doPostBack('ctl00$ContentPlaceHolder1$grdResultadoBusca','Page$3')">3</a></td><td><a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdResultadoBusca','Page$4')">4</a>
                #######################################################

                Next look the code of javascript:__doPostBack :

                #######################################################
                function doPostBack(eventTarget, eventArgument) {
                if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
                theForm.
                EVENTTARGET.value = eventTarget;
                theForm.__EVENTARGUMENT.value = eventArgument;
                theForm.submit();
                }
                #######################################################

                All right, i need to send via POST the VIEWSTATE too:
                #######################################################
                <input type="hidden" name="
                EVENTTARGET" id="EVENTTARGET" value="" />
                <input type="hidden" name="
                EVENTARGUMENT" id="__EVENTARGUMENT" value="" />

                <input type="hidden" name="LASTFOCUS" id="LASTFOCUS" value="" />
                <input type="hidden" name="VIEWSTATE" id="VIEWSTATE" value="/wEPDwUJMzY4NDQzMjQxDxYQHgZhRGFkb3MWAB4IUXRkTGluaGEIAADOQh4Vc1RpdHVsb1Jlc3VsdGFkb0J1c2NhBXNGb3JhbSBlbmNvbnRyYWRvcyAxMDMgcmVzdWx0YWRvcyBjb20gb3Mgc2VndWludGVzIGNyaXTDqXJpb3M6IDxicj5BcGFydGFtZW50bywgVG9kb3MsIFJJTyBERSBKQU5FSVJPL1JJTyBERSBKQU5FSVJPHg1DZWx1bGFzVmF6aWFzBQVGYWxzZR4QUHJpbWVpcm9EZXN0YXF1ZQUEVHJ1ZR4JQ2FiZWNhbGhvBQRUcnVlHgppUGFnZUluZGV4AgEeDlByaW1laXJhT2ZlcnRhBQRUcnVlFgJmD2QWBmYPZBYIAgEPZBYCZg8WAh4EVGV4dGVkAgIPFgIfCAUkPG1ldGEgbmFtZT0iZGVzY3JpcHRpb24iIGNvbnRlbnQ9IiI+ZAIDDxYCHwgFITxtZXRhIG5hbWU9ImtleXdvcmRzIiBjb250ZW50PSIiPmQCIg8WAh8IBTU8bGluayByZWw9InN0eWxlc2hlZXQiIGhyZWY9Ii9jc3MvemFwX25vZnJhbWUuY3NzIiAvPmQCAQ8WAh8IBRFUb3AsUmlnaHQseDEwLHgwOWQCAg9kFg4CAg8WAh8IBTM8U0NSSVBUIExBTkdVQUdFPUphdmFTY3JpcHQ+T0FTX0FEKCdUb3AnKTs8L1NDUklQVD5kAgMPZBYIZg8WAh4HVmlzaWJsZWhkAgEPZBYEZg8PFgQfCAUNU0FPIFBBVUxPLCBTUB4HVG9vbFRpcAUNU0FPIFBBVUxPLCBTUGRkAgEPZBYCAgEPD2QWAh4Jb25rZXlkb3duBbMBaWYoZXZlbnQud2hpY2ggfHwgZXZlbnQua2V5Q29kZSl7aWYgKChldmVudC53aGljaCA9PSAxMykgfHwgKGV2ZW50LmtleUNvZGUgPT0gMTMpKSB7ZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQoJ2N0bDAwJEJhcnJhX2xvZ2luJGJ0bk9LJykuY2xpY2soKTtyZXR1cm4gZmFsc2U7fX0gZWxzZSB7cmV0dXJuIHRydWV9OyBkAgMPZBYCZg8PZBYCHwsFvQFpZihldmVudC53aGljaCB8fCBldmVudC5rZXlDb2RlKXtpZiAoKGV2ZW50LndoaWNoID09IDEzKSB8fCAoZXZlbnQua2V5Q29kZSA9PSAxMykpIHtkb2N1bWVudC5nZXRFbGVtZW50QnlJZCgnY3RsMDAkQmFycmFfbG9naW4kYnRuT0tNaW5oYVNlbmhhJykuY2xpY2soKTtyZXR1cm4gZmFsc2U7fX0gZWxzZSB7cmV0dXJuIHRydWV9OyBkAgYPFgIfCWhkAgQPFgIfCAUzPFNDUklQVCBMQU5HVUFHRT1KYXZhU2NyaXB0Pk9BU19BRCgneDA5Jyk7PC9TQ1JJUFQ+ZAIFD2QWGGYPFgIfCAVUPGlucHV0IHR5cGU9J3RleHQnIGlkPSdoaWRDaGVja2VkJyBuYW1lPSdoaWRDaGVja2VkJyBzdHlsZT0nZGlzcGxheTpub25lOycgdmFsdWU9Jyc+ZAIBDxYCHglpbm5lcmh0bWwF8QE8YSBocmVmPScvZGVmYXVsdC5hc3B4Jz5Ib21lPC9hPiZuYnNwOyZndDsmbmJzcDs8YSBocmVmPScvaW1vdmVpcy9idXNjYS1kZS1pbW92ZWlzLXNpbXBsZXMuYXNweCc+SW0mb2FjdXRlO3ZlaXM8L2E+Jm5ic3A7Jmd0OyZuYnNwOzxhIGhyZWY9Jy9pbW92ZWlzL2J1c2NhLWRlLWltb3ZlaXMtc2ltcGxlcy5hc3B4Jz5CdXNjYSBwb3IgdG9kb3M8L2E+Jm5ic3A7Jmd0OyZuYnNwOzxiPlJlc3VsdGFkb3MgZGEgYnVzY2E8L2I+ZAIDDxYEHgRocmVmBQEjHgdvbmNsaWNrBR5qYXZhc2NyaXB0OmFicmVNb2RhbCgnbG9naW4nKTtkAggPFgIfDAUGQmFpcnJvZAIXDxBkEBUJDk9yZGVuYXIgcG9yLi4uCkFudW5jaWFudGUFw4FyZWEERGF0YRFEaXN0cml0byAvIEJhaXJybwdRdWFydG9zB1N1w610ZXMFVmFnYXMFVmFsb3IVCQASb3JkZW0sTm9tZUZhbnRhc2lhCm9yZGVtLEFyZWEPb3JkZW0sRGF0YU9yZGVtDm9yZGVtLGRpc3RyaXRvFG9yZGVtLFF0ZERvcm1pdG9yaW9zD29yZGVtLFF0ZFN1aXRlcw5vcmRlbSxRdGRWYWdhcxBvcmRlbSxQcmVjb09yZGVtFCsDCWdnZ2dnZ2dnZxYBZmQCGA8WAh8JaBYCZg8WAh8JaGQCGQ8WAh8JaGQCGg8WAh8JaBYCZg8WAh8JaGQCGw8WAh8JaGQCHA8WAh8IZWQCHQ9kFgJmDxYEHgNzcmMFKi9pbWFnZW0vaW1vdmVpcy9yZXN1bHRhZG9fZGVzdGFxdWVfdGl0LmdpZh4DYWx0BSlJbSZvYWN1dGU7dmVpcyBOb3ZvcyAtIFByb250b3MgcGFyYSBNb3JhcmQCIg8WAh8JaGQCBg8WBB8IZR8JaGQCBw8WAh8JaBYKAgcPDxYCHghSZWFkT25seWhkZAIJDw8WAh8RaGRkAgsPEGRkFgFmZAINDxYCHgpvbmtleXByZXNzBTBqYXZhc2NyaXB0OnJldHVybiBNYXhMZW5ndGhUZXh0QXJlYSh0aGlzLCA0MDAwKTtkAg8PD2QWAh8OBSJqYXZhc2NyaXB0OnJldHVybiBWYWxpZGFaYXBFcnJvKCk7ZAIIDxYCHwgFNTxTQ1JJUFQgTEFOR1VBR0U9SmF2YVNjcmlwdD5PQVNfQUQoJ1JpZ2h0Jyk7PC9TQ1JJUFQ+ZBgEBR5fX0NvbnRyb2xzUmVxdWlyZVBvc3RCYWNrS2V5X18WCAUXY3RsMDAkQmFycmFfbG9naW4kYnRuT0sFIWN0bDAwJEJhcnJhX2xvZ2luJGJ0bk9LTWluaGFTZW5oYQUfY3RsMDAkQmFycmFfbG9naW4kTG5rTG9naW5Nb2RhbAUeY3RsMDAkQmFycmFfbG9naW4kY2hrVGVybW9zVXNvBSNjdGwwMCRCYXJyYV9sb2dpbiRjaGtSZWNlYmVyT2ZlcnRhcwUdY3RsMDAkQmFycmFfbG9naW4kYnRuQ29uZmlybWEFGmN0bDAwJEJhcnJhX2xvZ2luJEJ0bkxvZ2luBTBjdGwwMCRDb250ZW50UGxhY2VIb2xkZXIxJGNoa0RpdkNhcmFjdGVyaXN0aWNhJDAFK2N0bDAwJENvbnRlbnRQbGFjZUhvbGRlcjEkZ3JkUmVzdWx0YWRvQnVzY2EPPCsACAECAgFkBSFjdGwwMCRDb250ZW50UGxhY2VIb2xkZXIxJGdyZE5vdm8PZ2QFKGN0bDAwJENvbnRlbnRQbGFjZUhvbGRlcjEkZ3JkTGFuY2FtZW50b3MPZ2ShqbHM/VXXWUz5y7/voE3vAAAAAA==" />
                #######################################################

                The problem:
                When i look the headers at FirefoxBrowser plug-in, the __VIEWSTATE has a diferente value.
                I think this is the problem that i can t acess the next page.

                $postfields = "$IDTransacao%3D3%26amp%3BTransacao%3DComprar%2Bum%2Bim%25u00f3vel%26amp%3BIDUF%3D19%26amp%3BUF%3DRIO%2BDE%2BJANEIRO%26amp%3BIDLocalidade%3D63118%26amp%3BLocalidade%3DRIO%2BDE%2BJANEIRO%26amp%3BIDTipo%3D1%26amp%3BTipo%3DApartamento%26amp%3BZonaGrupo%3D9%26amp%3BIDDistrito%3D0%26amp%3BDistrito%3DTodos%26amp%3BTipoBusca%3DSimples%26EVENTTARGET%3Dctl00%24ContentPlaceHolder1%24grdResultadoBusca%26EVENTARGUMENT%3DPage%241%26__VIEWSTATE%3D";

                ###############################################################################

                Thank&#347; for any help;

                  My code:
                  #####################################################################
                  acessaPaginaClassificadosOGLOBO('http://www.zap.com.br/imoveis/rio-de-janeiro/venda/centro/apartamento/rio-de-janeiro-venda-centro-apartamento.html');

                  function acessaPaginaClassificadosOGLOBO($pagina_alvo){
                  #

                  Acesando a página Alvo

                  #

                  $sessao_curl = curl_init($pagina_alvo);	
                  
                  //
                  curl_setopt($sessao_curl, CURLOPT_HEADER, 1);
                  curl_setopt($sessao_curl, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
                  curl_setopt($sessao_curl, CURLOPT_FOLLOWLOCATION, 1);    
                  curl_setopt($sessao_curl, CURLOPT_RETURNTRANSFER, 1);   
                  $subject = curl_exec($sessao_curl);
                  curl_close($sessao_curl);
                  
                  
                  #acessaPaginaAlvoIndividualClassificadosOGLOBO($subject);
                  #/*
                  #
                  # Capturando variáveis para o POST de mudança de páginas
                  #
                  
                  # cálculo de páginas
                  //Busca do Número Total de Resultados
                  $match = '/(?<=encontrados )\d*(?= resultados)/';
                  preg_match_all($match, $subject, $result, PREG_PATTERN_ORDER);
                  $num_resultado = $result[0][0];
                  
                  //Cada página possui 30 resultados, logo total de páginas = resultados / 30
                  //função ceil() arredonda frações para cima
                  $num_total_paginas = ceil(($num_resultado / 30));

                  ################################################################################

                  Acessa cada página da busca resultado

                  ################################################################################
                  if ($num_total_paginas){
                  for ($i=1; $i <= $num_total_paginas; $i++){

                  # montando post para cada página	
                  //Busca do post action padrao
                  
                  $match = '%(?<=<form name="aspnetForm" method="post" action="/imoveis/resultado-busca-imoveis\.aspx\?).*(?=" )%';
                  
                  preg_match_all($match, $subject, $result, PREG_PATTERN_ORDER);
                  $postfields = $result[0][0];		
                  $EVENTTARGET = 'ctl00$ContentPlaceHolder1$grdResultadoBusca';
                  //$EVENTTARGET = urlencode($EVENTTARGET);
                  $postfields .= "&__EVENTTARGET=$EVENTTARGET";
                  $EVENTARGUMENT = "Page$$i";
                  //$EVENTARGUMENT = urlencode($EVENTARGUMENT);
                  $postfields .= "&__EVENTARGUMENT=$EVENTARGUMENT";
                  //$postfields .= "&__LASTFOCUS=";
                  
                  //Encontra o &__VIEWSTATE=
                  $match = '/(?<=<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value=").*(?=")/';
                  preg_match_all($match, $subject, $result, PREG_PATTERN_ORDER);	
                  $VIEWSTATE1 = $result[0][0];					
                  $postfields .= "&__VIEWSTATE=$VIEWSTATE";			
                  
                  #echo $postfields . "<br/>";
                  
                  $postfields = urlencode($postfields);
                  
                  echo $postfields . "<br/>";
                  /*
                  
                  $url_pagina_alvo_individual = "http://www.zap.com.br/imoveis/resultado-busca-imoveis.aspx";
                  $reffer = 'http://www.zap.com.br/imoveis/resultado-busca-imoveis.aspx?IDTransacao=3&Transacao=Comprar+um+im%u00f3vel&IDUF=19&UF=RIO+DE+JANEIRO&IDLocalidade=63118&Localidade=RIO+DE+JANEIRO&IDTipo=1&Tipo=Apartamento&ZonaGrupo=9&IDDistrito=0&Distrito=Todos&TipoBusca=Simples';
                  
                  // INIT CURL
                  	$ch = curl_init();
                  	//curl_setopt($ch, CURLOPT_HEADER, 1);
                  	//curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
                  
                  	// SET URL FOR THE POST 
                  	curl_setopt($ch, CURLOPT_URL, $url_pagina_alvo_individual);
                  
                  	// ENABLE HTTP POST
                  	curl_setopt($ch, CURLOPT_POST, 1);
                  
                  
                  	# Setting CURLOPT_RETURNTRANSFER variable to 1 will force cURL
                  	# not to print out the results of its query.
                  	# Instead, it will return the results as a string return value
                  	# from curl_exec() instead of the usual true/false.
                  	curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
                  	curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
                  	curl_setopt($ch, CURLOPT_REFERER, $reffer);
                  
                  	// SET POST PARAMETERS : FORM VALUES FOR EACH FIELD
                  	curl_setopt($ch, CURLOPT_POSTFIELDS, $postfields);	
                  
                  
                  	// EXECUTE 1st REQUEST (FORM LOGIN)
                  
                  	if (curl_errno($ch)) {
                  	       echo "Erro CURL: " . curl_error($ch);
                  	    }
                  
                  	$pagina_alvo_individual = curl_exec($ch);
                  	curl_close($ch);
                  
                  	echo $pagina_alvo_individual;
                  	#acessaPaginaAlvoClassificadosOGLOBO($pagina_alvo_individual);

                  */

                  } #fim fo for	

                  } #fim do if

                  ###########################

                  Thank's for help

                    Could you put your Extremely LONG URLs inside a code tag, and your code inside PHP tags?

                    I don't have a 50" wide monitor.

                      Write a Reply...