Hello all.
I have this script, which scraps the categories from a website (this is being done legally) and sticks them into an array for me. The array goes something like this:
Key
Cat
Key
Sub-Cat
Key
Sub-Cat
etc...
What I need to do is to make this array unique. To make sure that I am getting all of the categories, I am searching by vowels. Most words have more than one vowel, hence the duplicates.
I have tried a couple things to do this myself, but it just doesn't want to work right. Anyways, on to the good stuff ... the script and the output.
<?php
$ch = curl_init("http://www.REMOVED.com/home.asp?searchstr=a");
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
curl_close($ch);
$ch = curl_init("http://www.REMOVED.com/home.asp?searchstr=e");
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data .= curl_exec($ch);
curl_close($ch);
$ch = curl_init("http://www.REMOVED.com/home.asp?searchstr=i");
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data .= curl_exec($ch);
curl_close($ch);
$ch = curl_init("http://www.REMOVED.com/home.asp?searchstr=o");
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data .= curl_exec($ch);
curl_close($ch);
$ch = curl_init("http://www.REMOVED.com/home.asp?searchstr=u");
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data .= curl_exec($ch);
curl_close($ch);
$ch = curl_init("http://www.REMOVED.com/home.asp?searchstr=y");
curl_setopt($ch, CURLOPT_COOKIEFILE, 'cookies.txt');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data .= curl_exec($ch);
curl_close($ch);
$data = strip_tags($data, '<a>');
$lines = array();
$lines = explode("\n", $data);
foreach($lines as $key => $line){
if(!eregi("^Home", trim($line))){
unset($lines[$key]);
continue;
}
$lines[$key] = eregi_replace("^[^<]*", "", $lines[$key]);
$lines[$key] = explode(">", $lines[$key]);
foreach($lines[$key] as $nkey => $nline){
$lines[$key][$nkey] = eregi_replace("<a(.*)href=", "", $lines[$key][$nkey]);
$lines[$key][$nkey] = eregi_replace("home\.asp\?cid=", "", $lines[$key][$nkey]);
$lines[$key][$nkey] = eregi_replace("\'>", "", $lines[$key][$nkey]);
$lines[$key][$nkey] = strip_tags($lines[$key][$nkey]);
$lines[$key][$nkey] = str_replace("'", "", $lines[$key][$nkey]);
$lines[$key][$nkey] = trim($lines[$key][$nkey]);
if(empty($lines[$key][$nkey])) unset($lines[$key][$nkey]);
}
}
//$lines = array_unique($lines);
/*foreach($lines as $key => $value){
$lines[$key] = array_unique($value);
}*/
echo '<pre>';
print_r($lines);
echo '</pre>';
?>
Here is my output ... well, some of it since there are about 6,000 results returned.
Array
(
[87] => Array
(
[0] => 1
[1] => Dog
[3] => 3
[4] => Arthritis, Joint & Pain
)
[89] => Array
(
[0] => 1
[1] => Dog
[3] => 4
[4] => Ear & Eye Care
)
[91] => Array
(
[0] => 1
[1] => Dog
[3] => 5
[4] => Flea & Tick
)
[93] => Array
(
[0] => 1
[1] => Dog
[3] => 6
[4] => Dental
)
[95] => Array
(
[0] => 1
[1] => Dog
[3] => 8
[4] => Shampoo & Rinses
)
[97] => Array
(
[0] => 1
[1] => Dog
[3] => 9
[4] => Skin and Coat
)
[99] => Array
(
[0] => 1
[1] => Dog
[3] => 10
[4] => Treats & Toys
)
[101] => Array
(
[0] => 11
[1] => Cat
)
[103] => Array
(
[0] => 11
[1] => Cat
[3] => 12
[4] => Antibacterial & Antifungal
)
[105] => Array
(
[0] => 11
[1] => Cat
[3] => 13
[4] => Arthritis, Joint & Pain
)
)
There are a lot more, but I will spare you the pain of scrolling forever and ever.
Any help would be great.
TIA.