Can robots.txt PREVENT bots from crawling PHP INCLUDE files?

MMeticulous

Hello. I'm new to PHP.

I use INCLUDE files to basically frame my page with static elements (like company slogan and contact information) which is identical on every page of my site.

My concern is that I don't want to have bots like Google determine that this is all duplicate content (which it is, if you crawl the same includes on every page) and hence fail/refuse to crawl and index my site.

From the research that I have done, I believe that I can create a directory to contain all of these includes (with redundant text), then I believe I can use a robots.txt file in my websites root directory, to simply disallow access by bots to that particular directory.

The thing that confuses me is: since PHP is pulling these includes in from the server side, I don't know whether the robots.txt file will work, or if PHP will just automatically pull these elements (includes) into the main page and the bots will never realize that the content came from the includes directory, which was "forbidden".

Will the robots.txt file work in this scenario to prevent the content of the includes from being included in the main page and/or crawled by bots? Will this fix my duplicate content problem, in regards to the search engines, or is there another BETTER way of accomplishing this?

Please advise!

Thanks a lot for any assistance!
😃 Jeff

laserlight

One option is to simply make the directory with those include files inaccessible from the Web.

bpat1434

If you put all your included files into a directory, then use robots.txt to disallow indexing of that folder, then web-crawlers will not index what's in the include directory.

The crawler sees nothing more than what a regular user sees (well, except they don't see the images, it's an all text version of the site). So if you use php includes, the crawler is no more aware of php than just the extension of the file.

I don't think google will not index your entire site if you have static content. All it will do is not re-index that same page again. So the rest of your pages will be crawled, but not the static ones. If they crawled pages like you expect them to, then 99% of the web would not be indexed since many pages don't change. Take blogs for example. While the main page changes once every week or so, if the page gets indexed daily, it would notice that there hasn't been a change and think it's static.

In short, robots.txt will work as you want. You can even disable crawling for specific pages, so you don't have to move your pages into a folder. Something like:

User-agent: *
Disallow: /include.php
Disallow: /include2.php
Disallow: /inc/
Disallow: /img/
Disallow: /images/
Disallow: /js/
Disallow: /css/

for example.

NogDog

A few thoughts:

Robots only "obey" the robots.txt file if they choose to do so. Naturally, the malicious ones will choose not to.

Robots can only find one of your include files if there is a hyperlink to it from a web page that the robot has already found, or the file is in a web-accessible directory that (a) does not have a default web page file and (b) where the web server is configured to show a directory listing when a directory is accessed that does not have a default file.

By far the simplest solution so that you don't have to worry about any of the above is to store your include files outside of the web document root directory tree.

laserlight

NogDog wrote:
By far the simplest solution so that you don't have to worry about any of the above is to store your include files outside of the web document root directory tree.

Yes, that would be a little simpler than my suggestion of changing configuring the server to exclude that directory from the web space.

MMeticulous

Hello - Thanks for all the feedback!

In my research I had also learned of placing the includes folder above the root folder. Regardless of whether this technique is used or the Robots.txt, the same element confuses me:

If the bots access the site the same way that any live visitor does, goes to the home page and follows navigation links, then the server would automatically insert the includes into the home page and EVERY subsequent page throughout my site (without anyone knowing their is even included content).

For example if I go to my home page, then select "View Source" from the browser, I can't view any "PHP code" for "includes" I just see the html content of those includes automatically woven into the main body code of my webpage. Why would bots not see this code the same way since the PHP is all run on the server side and not on the browser side?

I'm just trying to wrap my brain around this. I really appreciate all the feedback.

Thanks!
😃 Jeff

MMeticulous

Hello again!

Is there a way that I can view my site exactly as the search engines view it? (Be that text only, without scripts, or however they view it?)

Thanks.
🙂 Jeff

NogDog

Basically, just open a page in your browser and do a "view source". That's what the robots see (and that's what your browser sees, too). All a robot does is send HTTP requests just like your browser does. If the request is for a valid web resource then the server sends back the applicable output, which can be HTML text, the bytes of an image file, etc. Typically the robot will then scan any text it receives for hyperlinks and add them to its list of URLs to try, along with scraping out any text it finds useful/interesting for whatever purpose it is being used.

MMeticulous

NogDog;10892755 wrote:
Basically, just open a page in your browser and do a "view source". That's what the robots see (and that's what your browser sees, too). All a robot does is send HTTP requests just like your browser does. If the request is for a valid web resource then the server sends back the applicable output, which can be HTML text, the bytes of an image file, etc. Typically the robot will then scan any text it receives for hyperlinks and add them to its list of URLs to try, along with scraping out any text it finds useful/interesting for whatever purpose it is being used.

Thanks for the feedback!

I have to admint though, that just confuses me that much more as to why the robots.txt file (or placing the files higher than the public_html directory) would be able to defer bots from the php INCLUDE content, since the bot wouldn't "see" any evidence of the presence of those files by viewing the page source (since PHP is processed on the server side).

If the bots access the site the same way that any live visitor does, goes to the home page and follows my navigation links, then the server would automatically insert the includes into the home page and EVERY subsequent page throughout my site (without anyone knowing their is even included content).

Sorry, I don't mean to be dense, I just really want to understand this. Thanks for the help!
😃 Jeff

NogDog

example.php:

<html><head><title>Test</title></head><body>
<p>This is a test.</p>
<?php
include "include.php";
?>
</body></html>

include.php:

<?php
echo "<p>Hello, World!</p>\n";
?>

Access example.php from your browser, and you see:

This is a test.

Hello, World!

If you then do a "View Source" of that output in your browser, you see:

<html><head><title>Test</title></head><body>
<p>This is a test.</p>
<p>Hello, World!</p>
</body></html>

That "View Source" output is exactly what a robot would see if it sent a HTTP request to view the example.php file: it would have no clue that there was an "include.php" file, or any sort of include at all. Therefore, it would have no idea that there is such a file that it could try to access.

MMeticulous

NogDog;10892773 wrote:
That "View Source" output is exactly what a robot would see if it sent a HTTP request to view the example.php file: it would have no clue that there was an "include.php" file, or any sort of include at all. Therefore, it would have no idea that there is such a file that it could try to access.

Thanks for the feedback & examples NogDog!

That is how I understood it to work. In your example though, instead of "Hello World" I have a two paragraph static company bio which shows up on every single page of my site (in the left column of my page). Since it is in the left column, it shows up in the "View Source" BEFORE the main body of my page.

According to my understanding, if a GoogleBot scans the page and sees redundant content (my company bio AGAIN...) then it will abort scanning/indexing that page, before it ever reaches the main body of my page, which has the content unique to that page.

I'm trying to figure out how to prevent the bots from scanning and indexing that left column (PHP Include).

If I tell a robots.txt not to scan and index my includes directory (as has been suggested) or even locate the directory above my public_html file, why would that stop the bots from viewing the content of the Includes files as that content is dynamically fed into my webpage from the server side? Again, this included content is on every single webpage of my site. They don't have to know that an "include" is there, the html portion of the included content is automatically woven into my webpage from the server side.

How can I stop the bots from scaning duplicate content (stored in my includes) served on every page of my site?

Thanks for the help!
😃 Jeff

NogDog

For one, I'm not sure that your assumption is true: that Google will ignore a page simply because it starts out with the same "boiler plate" text as do other pages. These forums have the same starting text on each page, yet new posts here regularly show up in Google within a day or less of being posted.

Anyway, as I tried to illustrate above, Google would have absolutely no idea which part of your page came from an include, so as far as your page rank concerns go, includes have nothing to do with it. All Google sees is the output of the URL it requests, but it has no more idea what other files might have been involved in generating that output on the server side than your browser does. As far as either one knows, you typed out all that "bolier plate" text separately on each page of the site.

If you truly feel that it affects Google's indexing of your pages, then the only thing that would change that would be to redesign/resequence the actual HTML output; but I really do doubt that it is necessary.

MMeticulous

NogDog;10892779 wrote:
For one, I'm not sure that your assumption is true: that Google will ignore a page simply because it starts out with the same "boiler plate" text as do other pages. These forums have the same starting text on each page, yet new posts here regularly show up in Google within a day or less of being posted.

Thanks again for the feedback NogDog!

Your site here is a vB site with a lot of constantly new and relevant content, minimal static content, and a good page-rank. I also have a vB site that does OK in the search engines, though not nearly as well as your site here does.

However the site that I'm working on right now (which I'm writing about) is NOT a vB site, has done notoriously BAD in the search engines for the past few years, and has a LOT of static content on it. A lot of the research that I have done so far heavily preaches against redundant/duplicate content. I'm not positive that this is the problem with my site, but I know that my ranking stinks and that this can't be helping me any.

With that said, what you have described in how the bots read a webpage makes perfect sense to me. Without my being educated on the mechanics of bots, that is precisely how I would expect them to act. That is why I was concerned that the solutions above (robots.txt & directory hierarchy) would NOT prevent bots from scanning PHP server-side included html content.

Which sadly brings me back to where I started, with more understanding but still no solution. So does ANYONE know, is there A WAY for me to prevent bots from scanning this duplicate content (which I INCLUDE on every single page of my site)?

Thanks again for all the help!
😃 Jeff

NogDog

You could include the common content at the end of the pages, then make some clever use of CSS (probably absolute positioning) to move it to the desired place on the page.