Hiya

I've an idea for something I'd like to tinker around with. It would involve quite a few automated Google searches.

I know that Google have an API which you can use to send a query to them, but there is a limit on the free use of that. Other than that it looks like exactly what I want.

I just want a to be able to send a query term to google and get responses back in XML. Does any of you good folks happen to know of any other source of this particular wheel, or is this something I'll need to build myself by parsing the html?

Thanks

    Other than it being illegal for you to do it in any way OTHER than using Google's API, I have no suggesstions.

      ..interesting.

      I'm just "spidering" their site. Surely Google can't make an issue of me doing what they do all day every day.

      They don't even include directives on their pages to instruct any spider not to index their pages.

      I wonder how it's illegal. Where did you get that advice?

      Also I wonder where it's illegal. Is it illegal in my part of the world?

        Legality aside, it'd be pretty trivial for them to notice a spike in traffic from a single IP or group of IPs, and just block them. Unless you plan to use trojan/zombie machines all over the world, I'd think if they noticed the traffic (probably), and didn't like it (maybe), they could make it stop at will (definitely).

          Can't argue with any of that.

          I work in a reasonably large office, behind a proxy server, and I have noticed occasional google searches failing with a "too many searches from your IP" type message, so they obviously do have some automated frequency limit.

          The idea I originally had was to put a tool together to search in google for a phrase and monitor daily where in the results my site appeared for that phrase.

          I then thought of extending it a little to watch for where my URL was being linked to or referenced on other web pages, so that I could generate for myself a daily update of "Site mentioned on..." and "The following sites have recently linked to our site..." type stuff.

          Then, finally, it ocurred to me to make this service available to others via the web.

          I wouldn't have thought that the traffic would be particularly noticeable, but then again, 800 desk jockeys can occasionally accidentally trigger it in the course of a normal day's work, so who knows.

          Maybe I'll just develop this for myself and let anyone who wants it take a copy of the code and run it locally.

            http://www.google.com/intl/en/terms_of_service.html

            No Automated Querying

            You may not send automated queries of any sort to Google's system without express permission in advance from Google. Note that "sending automated queries" includes, among other things:

            * using any software which sends queries to Google to determine how a website or webpage "ranks" on Google for various queries;
            * "meta-searching" Google; and
            * performing "offline" searches on Google.

            Please do not write to Google to request permission to "meta-search" Google for a research project, as such requests will not be granted.

            You can see all the legality issues on that page. Just by making the software available for download, you're opening yourself up to lawsuits. Personal use also leaves you open.

            As a general rule, you should check the TOS of any site before trying to script against it. Otherwise, you're being negligent, and if you get sued over it, it really is your own fault.

              Originally posted by justsomeone
              ..interesting.

              I'm just "spidering" their site. Surely Google can't make an issue of me doing what they do all day every day.

              Do they directly use the content derived from your site?

              They don't even include directives on their pages to instruct any spider not to index their pages.

              http://www.google.com/robots.txt

              I wonder how it's illegal. Where did you get that advice?

              http://www.google.com/intl/en/terms_of_service.html

              Also I wonder where it's illegal. Is it illegal in my part of the world?

              If you violate their TOS regardless of location, you are liable...whether they'll win or not is another matter.

                Well, I guess that answers most of my questions then. Certainly the "Why can't I?" ones.... so thanks for all your efforts on that. You did spend some time on this.

                Strange how no-one offered anything on the "How could I?" side of this. But I guess you get that sometimes.

                I wonder if anyone has ever sued Google for breach of copyright for creating a copy (sorry, "cache") of their site.

                Anyway it's all by the way. It looks like what would have been a very useful tool wouldn't be appreciated by the folks at Google. Though it would be alright If I paid some minimum wager to do the very same task.

                Thanks again for the time you all put into this - I can't say it exactly cheered me up, and my concerns may be more moral than legal, but I appreciate you folks spent some time on it.

                J

                  The "How" part is pretty simple, past the legality of it. Since google uses GET variables to perform the search, you can just populate the URL and fget the HTML.

                    Originally posted by justsomeone

                    I wonder if anyone has ever sued Google for breach of copyright for creating a copy (sorry, "cache") of their site.

                    For it to be a copyright violation they would have to be altering, selling, manipulating, your data. What they do clearly falls under fair-use because they never do anything to impeach upon your copyright. To sue google for copyright infringement would mean you suing every person who ever visited your site and stored a copy in their browser's cache.

                    The difference between what they're doing, and what you plan on doing at this point should be a bit more obvious. You are wanting to use google's application, take the data from it, and use it as your own.

                    Copyright in relation to the internet is a very complicated matter to say the least. But (at least for the time being 🙁 ) there is still a fair-use clause that allows for non-infringing use of other's copyrights.

                      For it to be a copyright violation they would have to be altering, selling, manipulating, your data. What they do clearly falls under fair-use because they never do anything to impeach upon your copyright. To sue google for copyright infringement would mean you suing every person who ever visited your site and stored a copy in their browser's cache.

                      Um, nope to this.

                      Google makes its cache available to anyone with a PC. They copy other people's content and publish it. If you're looking for an analogy try thinking of anyone who rips the tracks off a CD, stores the mp3's on a PC and allows anyone who wants to to download it. Doesn't that sound like the sort of thing people get sued for?

                        What google is no doubt trying to avoid is someone putting up a search engine that just queries google and reports back to some client somewhere. They're doing all the work and the google brand gets no credit.

                        As for legality, are you willing to risk it? You may not even be noticed. If you get caught and they actually care, god help you. They have several billion dollars lying around to sue you so hard you might die.

                        As for technical implementation, there are a variety of php commands that you can use, but I think they are pretty good about screening non browser queries. You'd probably have to use an approach that let you manually construct headers so you would like like mozilla or IE or whatever. Additionally, you'd probably have to dance around on a variety of IPs. You could try an approach like that SETI project which parses astronomical observations for hints of extraterrestrial life.

                        Or you could actually come up with your own useful technology rather than hassling a company that seems fairly benign.

                          Originally posted by justsomeone
                          Um, nope to this.

                          Google makes its cache available to anyone with a PC. They copy other people's content and publish it. If you're looking for an analogy try thinking of anyone who rips the tracks off a CD, stores the mp3's on a PC and allows anyone who wants to to download it. Doesn't that sound like the sort of thing people get sued for?

                          IANAL, but I think it's more like they are taking a picture of someone's content and publishing that - just like a photographer maintains the copyrights to his own pictures, even if it's you in the picture.

                            true dat. and if you wanted them not to list you, i'm sure they'd be more than happy to oblige.

                              Originally posted by Elizabeth
                              IANAL

                              Me neither, so maybe we should park the legal side of this 🙂

                                Originally posted by Weedpacket
                                http://www.google.com/webmasters/faq.html#cached
                                Put briefly, Google uses a robot exclusion file.

                                Google provides two services - a search engine and a site cache.

                                As far as I know it's possible to "opt out" of either.

                                So, you can have Google ignore you completely.

                                Or you can have Google list you in its index, make copies of all your content and serve it up to anyone who wants it.

                                Or you can have Google list you in its index, but not take copies of your content.

                                Both should be "opt in", in my opinion. If I want Google to list me, I should have to ask them to. If I want them to make a copy of my site, I should have to ask them to.

                                Put briefly, "Opt out" sucks.

                                And speaking of briefly, I've had an enough of this thread. No doubt you have too. Thanks for your time, effort and contributions - even if I disagree with the spirit of most of them.

                                I'll probably build my tool anyway, so I can monitor links to the sites I'm responsible for. I don't have the time to do the actual monitoring myself, and I'd rather not have to pay someone to do it. If the software solution gets me into trouble, I may "cease and desist" and go for a "pecking chicken" solution - fire up the query manually, move the pointer to the "next" link and get some mechanical device to click on the mouse button every so often. Then I can come back and parse my cached pages as I see fit. It's just that the pure software solution seems more elegant.

                                  Originally posted by justsomeone
                                  Both should be "opt in", in my opinion. If I want Google to list me, I should have to ask them to. If I want them to make a copy of my site, I should have to ask them to.

                                  So you want Google to index either a uselessly tiny corner of the web, or maintain a second database consisting of every single site in the world that it will need to check everytime they follow a link.

                                  You missed a fifth option: don't put anything on the web yourself. Don't opt in, in other words. If Google want to spend billions of dollars in equipment and bandwidth to surf the web at no cost to you, why should you complain?

                                  Thanks for your time, effort and contributions - even if I disagree with the spirit of most of them.

                                  Probably because we disagree with the spirit of your proposal. But then, if you want to use Google's billions of dollars worth of equipment and bandwidth for something you don't want to pay to do yourself, why should Google complain?

                                    Why don't you just use Alexa?

                                    alexa.com/data/details/main?q=&url=phpbuilder.com

                                    Seems to have a few of the details you mentioned you are interested in, and would save you a lot of time & effort (and possibly lawyers fees!)

                                    adam

                                      Write a Reply...