hey everyone
I was wondering if it'd be possible to generate a variable containing every word on a web page (given by url e.g. http://roblox.com/)
If so, how?
Thanks.
hey everyone
I was wondering if it'd be possible to generate a variable containing every word on a web page (given by url e.g. http://roblox.com/)
If so, how?
Thanks.
The easiest way I can think of is loading the page up with file_get_contents, and then use strip_tags to remove the code.
You could then use a regex to remove punctuation, and explode the string in an array of individual words.
Imperialoutpost;10958607 wrote:The easiest way I can think of is loading the page up with file_get_contents, and then use strip_tags to remove the code.
You could then use a regex to remove punctuation, and explode the string in an array of individual words.
file_get_contents doesn't work for me on other websites. fopen and fread do.
You might try using [man]curl[/man]. Or even shell_exec('wget [url]http://remotesite.com');[/url]
cUrl gives you a lot more control...
This is a basic version chopped down version of what I use for similar tasks:
<?php
$ch = curl_init(); // Start the curl function
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)'; // Set our UserAgent as Googlebot
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL, 'http://www.roblox.com/Default.aspx');
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 45);
$html = curl_exec($ch); // Pull the page HTML into the string $html
if (!$html) { // If it hasn't worked, print an error and fail
echo "cURL error number: " .curl_errno($ch) . '<br/>';
echo "cURL error:" . curl_error($ch) . '<br/>';
die();
}
echo strip_tags($html);
?>
I notice if the useragent isn't set as Googlebot, then the site returns an error.
I've found that it's helpful to forward the user agent of whoever is visiting your page.