[RESOLVED] Screen Scrap with Curl - Need some Help

nutt318 · Dec 19, 2007

Just want to say Hi first of all, this is my first post and I am needing some help. Ive searched google and many other tutorials along with php.net and I cant seem to figure this out. I originally got help with this code from someone I can no longer get a hold of, and so im stuck now and need some help.

Anyways I have this code below that worked perfect about 3-4 weeks ago and now for some reason its not working. The code took values of stock prices from a website listed in the code and put those values into my website.

I have not changed anything with my code so I am guessing they may have changed something on their end. I would like to know what may be wrong with this code and if I am not searching the right fields or variables.

Please let me know if you have any ideas. Thanks.

<?php
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "http://www.bloomberg.com/markets/commodities/energyprices.html");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, 0);

$contents = curl_exec($ch);
curl_close($ch);

function find_values ($string, $page)
{
	$string = preg_quote($string, '#');

// takes everything from the given string to end of row
preg_match("#$string(.*)</tr>#Us", $page, $match);

// Get the values from the row we found previously
preg_match_all("#<span[^>]*>([^<]*)</span>#s", $match[1], $values);

// Return the values	
return $values[1];
}

$find1 = find_values('nymex crude future', $contents);
echo "Nymex Crude Future: Price = $find1[0], Change = $find1[1], & Change = $find1[2], Time = $find1[3]<br>";

$find2 = find_values('Dated Brent Spot', $contents);
echo "Nymex Heating Oil Future: Price = $find2[0], Change = $find2[1], & Change = $find2[2], Time = $find2[3]<br>";

$find3 = find_values('WTI Cushing Spot', $contents);
echo "Nymex RBOB Gasoline Future: Price = $find3[0], Change = $find3[1], & Change = $find3[2], Time = $find3[3]<br>";))>


?>

laserlight · Dec 19, 2007

I have not changed anything with my code so I am guessing they may have changed something on their end.

Start looking at their clientside source then.

nutt318 · Dec 19, 2007

laserlight wrote:
Start looking at their clientside source then.

I have looked at their site source code and it seems to be fine to what i am searching for. Here is their site source code, I think im going crazy or blind because I cant find the mismatch

<td><span class="tbl_txt">Nymex Crude Future</span></td><td align="right"><span class="tbl_num">91.13</span></td><td align="right"><span class="tbl_txt_green">1.05</span></td><td align="right"><span class="tbl_txt_green">1.17</span></td><td align="right"><span class="tbl_num">11:08</span></td>

.

bradgrafelman · Dec 20, 2007

Well the first regex grabs data from the string up until a '</tr>', and since the latter of the two doesn't appear in your HTML snippet, I'd say their site source code does not seem to be fine.

Weedpacket · Dec 20, 2007

Of course, as soon as the site's publishers change the layout of the site you're going to be starting all over again.

Oh, and I think I should point out that you're in violation of Bloomberg's Terms of Service.

Weedpacket · Dec 20, 2007

Of course, as soon as the site's publishers change the layout of the site you're going to be starting all over again.

Oh, and I think I should point out that you seem to be in violation of Bloomberg's Terms of Service.

nutt318 · Dec 20, 2007

Bradgrafelamn, Do you have any suggestions on what part of my code is incorrect? If you can give me an example on what you see thats wrong?

Thanks.

bradgrafelman · Dec 20, 2007

Well basically what I see is wrong is what Weedpacket mentioned; you're attempting to scrape data from HTML source that can change and break your code without warning. If you really want to include the data from a remote site, you should look into contacting that site and see if they will provide you with some sort of XML feed.

Other than that, your regexp pattern is looking for HTML that isn't there. Learn up on regexp's (I recommend Regular-Expressions.info) and adjust it according to the current HTML code of the remote site.

nutt318 · Dec 20, 2007

Looks like I needed to add a few lines of code.

curl_setopt($ch, CURLOPT_FILETIME, true); //My Habbit
curl_setopt($ch, CURLOPT_REFERER,"");  //I belong here
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)");  //Im not scrapping i am a browser

MarkR · Dec 21, 2007

Once you have your agreement for a machine-readable feed in place with bloomberg.com, they will provide a feed in a documented, convenient format that you can develop code with and will remain stable.

Until then you're at the mercy of their team:
1. Blocking your bot for service abuse
2. Sending junk data to your bot
3. Sending the lawyers around to sue your ass.

Mark

[RESOLVED] Screen Scrap with Curl - Need some Help

Nnutt318

laserlight

Nnutt318

Bbradgrafelman

Weedpacket

Weedpacket

Nnutt318

Bbradgrafelman

Nnutt318

MMarkR