I wonder where search engines get list of urls to crawl? From a local database or they have ability to crawl all web by themselves. Is source code of a search engine available for public?

Thanks.
A.

    alexks wrote:

    I wonder where search engines get list of urls to crawl? From a local database or they have ability to crawl all web by themselves. Is source code of a search engine available for public?

    Thanks.
    A.

    The exact workings of search engines are a closely guarded trade secret. As I understand it, once a search engine finds a web page, they will crawl all the pages that are linked to from that page.

    You can submit URLs to any search engine if you are building a new site. You can also specify which files are not to be crawled by creating a file called robots.txt at the root of your domain (http://yourdomain.com/robots.txt).

    Google has some tools so you can get your site thoroughly searched.

      Actually, I want to build a small search engine. The question is how search engines find web sites to search for a keyword?

        You can ask until your vocal chords (or in this case, fingers) give out. The problem is that you're asking for trade secrets of the search engine business. Exactly how they find websites and how they crawl them is never going to be released.

        What you could do is utilize the AARIN database or some other domain registrar database to begin with. But your questions unfortunately are going to lead you to a dead end.

          That's a loaded question. Basically, search engines take a list of URLs and download the content of the page and analyze it to build a 'distilled' index which describes what it contains. Analysis involves a lot of things:
          1) Ignore content that is layout-based like CSS, Javascript, etc.
          2) Scan the page content for links (like <a> tags) so you can follow them too - I think they call this 'spidering'.
          3) Build an index of the page's content which provides fast and effective search results. There are many ways to do this. Google's engine apparently involves over 200 data points, many of which are not even in the page itself but rather stored in cookies or whatever.

          It's a really elaborate process. For starters, read this:
          http://www.mathworks.com/company/newsletters/news_notes/clevescorner/oct02_cleve.html

          If you're looking for a simpler approach, I would recommend a process something like this:
          1) Fetch a list of URLs - they can be user submitted or extracted from a page or you can just provide them yourself.
          2) Check a given url against the robots.txt at that domain according to the robot exclusion standard algorithm I linked agove. I.E., If the URL is http://somedomain.com/a/path/to/a/file.html then you'd look for a robots file at http://somedomain.com/robots.txt. If the robot file says not to look at it, don't.
          3) Capture the contents of the page.
          4) Use pattern matching to parse the page so you can separate useful content from useless javascript and html. This is going to be a pain in the rear.
          5) When you finally have the distilled, useful information, distill it into some kind of index. The way phpBB works is to break all the text of a given topic into words and then it creates a simple database table that associates all non-trivial words with the topic in which they occur.

          Or something like that...

            There are two parts, a web spider and a search engine. In reality there are probably a lot more.

            The web spider finds pages to read and goes and reads them. This could include:
            - Pages already in the database
            - Pages linked from pages in the database
            - Pages manually submitted
            - Possibly entries in various directories

            Each page is read and the relevant text extracted. The exact method for doing this will vary, but they'd typically use a HTML parser. Major search engines (such as Google) can also extract text and links from MSWord documents, PDFs etc.

            The next part is to put the page into a highly specialised database which allows for very fast searching - effectively they have a type of full-text index which is very specialised for their application.

            All the search engine really does is search this database using this highly specialised index.


            I've written a spider in PHP, it wasn't very easy and didn't work particularly well. There are a lot of issues you will need to think about, chiefly how to get around broken servers (e.g. ones which send back a 200 OK response for files which don't exist) and servers / domains which send content to deliberately confuse or pollute your spider.

            Mark

              Write a Reply...