Hi all,
I tried to write a program that retrieve any information from flight company website.
One of these, (easyjet flight company) don't work, and I don't understand why
!
Anyone can help me?
I would learn how perform screen scraping to all website. Are there rules?
If anyone would try, parameter to post are:
http://www.easyjet.com/it/Prenota/step1.asp?
step=1&
action=goto&
goto=step2.asp&
txtorigID=&
txtdestID=&
txtdorig=&
txtddest=&
numOfPax=1&
STATEDATA=
MIIBpAYJKwYBBAGCN1gDoIIBlTCCAZEGCisGAQQBgjdYAwGgggGBMIIBfQIDAgAA|AgJmAwICAMAECE+j/dEt3wAjBBCXEN1rLO6jks7hwkXwd8kMBIIBUD6Y0Xj8NNMY|0S44VAm2kbDR95yFdrxc0wM36bPfkStMTXLU0PuXN6zXXDDFqyuSIREMSlWvievf|k/XQyh9BYVeZRkYKWCJfx8iibLUQsO6ms+b5OppBRNjHp/y1PQyC090OI7YybPaW|YQzeZP5UvcU6+JpOVk4nM0jQx0k+C4UCLS1Y3tPYWSgU4bphalDBa3nUXoalNs8W|WNOuczmCua/ffJuXYDjz1DfEdeJLcM9AwYAVfVdcILApAWmHcQncFGjh/L0B91ET|+o/BNF+eZXf8mbWocGq0x4B76fYkjRFvnKS8fkNlyGoUTam849v+ZK9Ez7qG5Tcz|/GyQPi+ciL/qTzMpLiQaTIbjma4qVDeTPy6k0FGdk/01bqfx0MjJObWdujtQ9iIZ|lyAC+2VOL/E7X8ByOGxInlBkcz4X8xpzzSy08ty4duNYMdewSePX9Q==|
orig=ATH&
dest=LTN&
oDay=12&
oMonYear=022005&
rDay=00&
rMonYear=00&
numOfAdults=1&
numOfKids=0&
numOfInfants=0

Thanks
P.S. sorry for my english

    Did you fake a user-agent in your http request header?

      I try but its not resolve the problem. Are there steps to perform screen scraping ?
      Please I need help now...

        Gotta take one step at a time ("divide and conquer").

        First: Make a copy of your link. Then close down all browsers. Open up your favorite browser and paste the link in and what happens? Do you get the page you were looking for? If so, great! Your coding job will be a bit easier. Your page didn't come up? Uh oh. Is it requesting a login? Or does it say the session timed out?

        If its a login, you need to fake the login BEFORE having your code go to that page.

        If the session timed out, then you need to have your code go to the search page and submit it as if it was a user. Difficult, but definitely not impossible. You just have to sit down and look very closely at the form they are using and find all the hidden fields and figure out what data you have to put in.

        You didn't post any code, but if you're not using it, I'd HIGHLY recommend checking out [man]curl[/man] for this kind of job. It'll take the burden of having to figure out how to handle cookies (aka: sessions) and handling sending post data. Check out the curl pages for documentation. There's a solid handful of great examples on php.net.

          I try with other flyght company: RYANAIR, VIRGIN, HAPAG LLOYD, etc and my script work very well.
          But easyJet (www.easyjet.com) return this header:
          secondiHTTP/1.1 100 Continue Server: Microsoft-IIS/5.0 Date: Mon, 31 Jan 2005 20:53:41 GMT HTTP/1.1 200 OK Server: Microsoft-IIS/5.0 Date: Mon, 31 Jan 2005 20:53:41 GMT Connection: close Content-Length: 0 Content-Type: text/html; Charset=UTF-8 Cache-control: private
          ...and where is web page?

          P.S. I don't use curl, but I tried and don't work.

          Please help me! my job is very heavy!

          SONO STRESSATOOOOOO

          Grazie.

            From the header, it looks like you failed to establish a connection. How are you treating the cookies? Are you sending the cookies back to their site?

              No i don't use cookies in my HTTP request and perhaps this is the problem...
              How should I use cookies?

              Can show me a script that use cookies?

              Thanks

                Originally posted by Benedetto
                Are there rules?

                There are these ones; have you asked Easyjet for assistance? Maybe they have restrictions in place to prevent unauthorised copying of their material, and will be able to tell you what you are missing.

                  I read this rules, and i think that this Flight Conpany control "referrer" parameter..

                  I see that in my HardDisk there are 2 cookies that contains:

                  and another similar...
                  So what can I do with these one?

                    If you were using [man]curl[/man], it would be 2 lines of code (use curl's set_opt command to setup a cookie jar).

                    If you don't want to use curl, you'll have to figure out how to assemble the headers to include the cookie data. I don't have code to do that.

                      if I use curl function, script will be 2 line?
                      Have you any script to do that?
                      Thanks

                        No. The cookie handling would only take 2 lines. The rest of the curl commands could take an additional 3 - 6 more lines depending on how you wish to configure your connection.

                        For examples, check out the curl link I provided. Read through the curl set_opt function and its users' comments. The comments have a lot of great little code snippets.

                          I try with this script:
                          $url = "http://www.easyjet.com/it/Prenota/step1.asp/";

                          $user_agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";
                          $post = "__step=1&__action=goto__goto=step2.asp&txtorigID=&txtdestID=&txtdorig=&txtddest=&numOfPax=1&__STATEDATA=".urlencode ("MIIBpAYJKwYBBAGCN1gDoIIBlTCCAZEGCisGAQQBgjdYAwGgggGBMIIBfQIDAgAA|AgJmAwICAMAECP3CkQb2IXGHBBBKfDW+eavLpBAz99Jt4S23BIIBUAX79dX48dsC|zWQMlEQqSfo28S3Jopi16S+oTP+l+7CD6BsPW8aUchjnfVhDOeRK2W32l8NLAPU+|oePxEqvLnxG/0a1wB4coUFqCjlV3hwDSxx2ptQm2M5MFD2Bwt/F+XqeqD5TKjWfN|hqWMO5HiwQ4SGaeJ2DzIjVYaJpzqx7QgJtQMiXfnSk0Wvjj7G8M0e3CkphRcG+eA|32uKn1INQDoIHJRrBAfZQ1F9tHJYJyV/rGUyVvmzI/VspMIjsrSdX7uSKFG0x9q/|kOYFIrQc2+Jh6Umt/Rh8dKVEk2ONkgtkWvcdzCSFyS5VjrR/FUv+nOevfF5PcuVV|K4PkKZW8QcD/zNc7OjnUMEivpLQ2zBWiOBVbX1jlCEADmSrmHRQPO+OVE3Ttk86o|fIvaITuDnbEyRYoEHUD7NSuhmn1Lh9u2U3l2/fn/B/e5ZJTWeTPl3g==|")."&orig=ATH&dest=LTN&oDay=12&oMonYear=022005&rDay=00&rMonYear=00&numOfAdults=1&numOfKids=0&numOfInfants=0&strEmail=&strPassword=&strBookingRef=";
                          $ch = curl_init();
                          
                          curl_setopt($ch, CURLOPT_HEADER, 1);
                          curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1); 
                          curl_setopt($ch, CURLOPT_URL, $url );      
                          
                          curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
                          curl_setopt($ch, CURLOPT_POST, 1 );
                          curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
                          curl_setopt($ch, CURLOPT_COOKIE, "c:\cookie.txt");
                          curl_setopt($ch, CURLOPT_COOKIEJAR, "c:\cookie.txt");
                          curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
                          
                          $response = curl_exec( $ch ); 
                          curl_close ($ch); 

                          but the result is the same....
                          Anyone can try to write some code?

                            Oops - you missed the instructions on php.net on how this line works (although I've made the same mistake before too): curl_setopt($ch, CURLOPT_POSTFIELDS, $post);

                            Here, $post is an array. You've setup post as a string. Strings won't work.

                            Just do:

                            $post['step'] = 1;
                            $post['
                            action'] = goto__goto=step2.asp
                            ...

                            Continue on for all the post variables you wish to send.

                            Otherwise, if you wish to send them on the URL, just keep the string and attach it to the URL.

                              Thanks AstroTeg !!
                              It's seems work, but it's not perfect yet...
                              I'm sure that I can finish it !

                              AstroTeg you're fantastic !

                                3 years later

                                please benedetto could you post the code.??

                                  8 days later

                                  could some one help me whit this library??
                                  e need to make the same code as benedetto.

                                    Write a Reply...