Hi. I was looking around for PHP Spider/web crawler tutorials and couldn't find anything that worked well for me. I am creating my own Search engine (the SE part is done) and I need a bot that will go to random websites and allow me to run a mysql code. Any clues? Thanks.
get page, preg_match get url's loop, etc, shouldn't be hard.
Actually, I have no clue how to loop domain names. That's what I am trying to find out, actually. I need a code that will loop and find every domain name existing, or something such as that.
But I need the code to get url's loop. I need something to loop every url existing.
i would use DomXpath myself, can also be done with regular expression.
This is a useful tutorial:
http://www.merchantos.com/makebeta/php/scraping-links-with-php/
Be aware that there is some very awful, ill-formed HTML code out there and DOM functions WILL NOT LOAD ill-formed documents.
It is much better to use regex to do this.
Imperialoutpost;10958621 wrote:This is a useful tutorial: http://www.merchantos.com/makebeta/php/scraping-links-with-php/
I had a look at that yesterday. It doesn't fit what I need, really. I am looking for something in PHP that is a loop and finds almost every existing domain as possible.
Also, remember that most URLs in HTML are relative, not absolute.