Web Crawler

Tr1cky

I'm looking for some information on where to find an open source web crawler that I can modify to fit my needs.

-Be able to crawl a site's source code and look for a particular piece of code
-If the code is found I need it to then crawl the code looking for a different piece of code.
-If the code is found and the second code isn't found it will then pull the URL and place it in a txt file if the second piece of code is found it will just continue to the next site. So basically if contains this and doesn't contain that append to text file.

I need it to be able to crawl a list of URL but would prefer for it to just automatically go from site to site to site without a list of URL's but either is fine.

Anyone know of an open source web crawler that is similar to what I need that can be customized? Thanks a lot

NogDog

What do you mean by "source code"? All a web crawler can see is the HTTP output for any given URL, which typically is going to be the [X]HTML text. If that is what you mean by "source code", then OK, but if you are talking about something like the underlying PHP/Perl/JSP/etc. source code, then no, that will not be accessible to a web crawler (unless there's something terribly wrong with the site being crawled).

Tr1cky

By source code I mean the html source code. I didn't know that this was impossible.

NogDog

Tr1cky;10883329 wrote:
By source code I mean the html source code. I didn't know that this was impossible.

I think you misread my response. Reading the HTML is no problem; I just wasn't sure what you meant by "source code".

I'm sure if you search the usual suspects such as sourceforge.net, hotscripts.com, etc., you will find a number of candidates.