Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!


Sponsored Content

 

 
 

Photo

Search Engines Are Not Obeying Robots.txt ?


  • Please log in to reply
7 replies to this topic

#1 azhar5i

azhar5i

    HR 1

  • Members
  • Pip
  • 4 posts
  • Location:London

Posted 26 August 2006 - 07:17 AM

Hello,

I have set up a robots.txt such as:

useragent: *
Disallow: /a
Disallow: /b/home.php

I really wanted all biggies Google, Yahoo and Msn to not access my PHP version, but damn! they are keep accessing my PHP pages.

The question is i have studied somewhere that to stop dynamically generated pages we must use,

useragent: *
Disallow: /a
Disallow: /b/home.php/*?

is (/*?) it better?
Or would it (/*?) help me stop search engines any more?

I am really confused, please share your reviews and answers, am desperately waiting..

Thanks

#2 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 26 August 2006 - 08:17 AM

Welcome azhar5i ! hi.gif

Is the robots.txt uploaded to the root level of your site? eg When you load up www.yourdomain.com/robots.txt do you see your robots.txt instructions? Also, does the request for this file produce a 200 OK response? And is it named robots.txt or Robots.txt with the R in upper caps?

I ask because it sounds like the spiders aren't recognizing your robots.txt file.

As I understand it, your current robots.txt should work. There is an implied wildcard at the end of each disallow line. So your Disallow: /a would exclude all of the following:

1. A sub-directory off of root named /a/
2. A sub-directory off of root named /answers
3. A file at the root level named a.html or a.php
4. A file at the root level named ahead.html or answers.php, basically any file at your root level that starts with the letter a since your exclusion does not include a trailing slash to let the spiders know you mean to exclude a directory.

While your 2nd disallow instruction would exclude:

1. The file that is located at /b/home.php no matter what query strings may follow the base file name.

You shouldn't need to use the query string thingee you mentioned. First it's non-standard, though some of the engines will obey both the * wildcard and the ? query notice. The only times I've seen it used is when you want some queries to a base filename be excluded, while others are allowed. For instance if you wanted to exclude a printer-friendly version of your pages and a "print" variable showed up in the query string for those requests, you could use it to exclude those pages from being crawled.

#3 azhar5i

azhar5i

    HR 1

  • Members
  • Pip
  • 4 posts
  • Location:London

Posted 28 August 2006 - 01:35 AM

Hello Randy,

thanks for a warm welcome wink.gif

Yeah my robots.txt file is in root directory with naming (robots.txt), it is accessible and doesn't produce a 200 error?

Another fact that I have studied somewhere that (/*?) wild characters are helpful spacially to disallow dynamically generated pages, yeah i have my site generated dynamically generated pages.

but main PHP Pages are home.php, products.php etc whom may be disallowed when bot is accessing the site.

what if i include (/*?) with my PHP Pages? like:

Useragent: Googelbot
Disallow: /b/home.php/*?
Disallow: /b/products.php/*?

Etc?

#4 projectphp

projectphp

    Lost in Translation

  • Moderator
  • 2,203 posts
  • Location:Sydney Australia

Posted 28 August 2006 - 01:58 AM

OK, just got it:
CODE
useragent: *
Disallow: /a
Disallow: /b/home.php


SHOULD be:
CODE
User-agent: *
Disallow: /a
Disallow: /b/home.php

Notice the missing hyphen/dash in User-agent.

For reference, I write this before I noticed that smile.gif

QUOTE
Yeah my robots.txt file is in root directory with naming (robots.txt), it is accessible and doesn't produce a 200 error?

hysterical.gif 200 is actually GOOD! It means the page is found.

Lets start again: robots.txt says do not download anything that starts with the string you add to a single line after Disallow: but DOES NTO say do not index. As such, the URL may be in Google et al, but they SHOULD NOT download the file.

So, some example. Lets say your domains is:
www.test.com

Your robots.txt needs to be here:
www.test.com/robots.txt

Your current setup will ban the following page:
www.test.com/b/home.php

It will also sytop ANYTHING on your site that starts with an a (lowercase ONLY), e.g.:
www.test.com/a-file.html
www.test.com/a-directory/

It will NOT stop:
www.test.com/A-file.htm
www.test.com/A-directory.htm

#5 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 28 August 2006 - 06:40 AM

QUOTE
Notice the missing hyphen/dash in User-agent.


Good catch ! appl.gif It's always the little things isn't it?

#6 azhar5i

azhar5i

    HR 1

  • Members
  • Pip
  • 4 posts
  • Location:London

Posted 28 August 2006 - 07:49 AM

Thatz not that, sad.gif

actually i made mistake writing here, but not the evil when uploading my robots.txt file...

Ok i am pasting my robots.txt file here please highlight me if anything goes wrong.....

CODE
User-agent: Googlebot
Disallow: /_vti_txt/
Disallow: /arbicocomputers.co.uk/
Disallow: /arbicocomputers.com/
Disallow: /BACK-UP/
Disallow: /for ebay shop/
Disallow: /ftptest/
Disallow: /logs/
Disallow: /otherdomains/
Disallow: /.bash_history/
Disallow: /_vti_inf.html
Disallow: /postinfo.html
Disallow: /wsc86535089/admin/
Disallow: /wsc86535089/arbico/
Disallow: /wsc86535089/arbicomis/
Disallow: /wsc86535089/catalog/
Disallow: /wsc86535089/db/
Disallow: /wsc86535089/ebay/
Disallow: /wsc86535089/files/
Disallow: /wsc86535089/fulfilment/
Disallow: /wsc86535089/help/
Disallow: /wsc86535089/include/
Disallow: /wsc86535089/log/
Disallow: /wsc86535089/mail/
Disallow: /wsc86535089/main/
Disallow: /wsc86535089/modules/
Disallow: /wsc86535089/pages/
Disallow: /wsc86535089/payment/
Disallow: /wsc86535089/payments/
Disallow: /wsc86535089/provider/
Disallow: /wsc86535089/schemes/
Disallow: /wsc86535089/shipping/
Disallow: /wsc86535089/single/
Disallow: /wsc86535089/skin1/
Disallow: /wsc86535089/skin1_original/
Disallow: /wsc86535089/smarty-2.6.9/
Disallow: /wsc86535089/sql/
Disallow: /wsc86535089/templates_c/
Disallow: /wsc86535089/upgrade/
Disallow: /wsc86535089/websitecm/
Disallow: /wsc86535089/about_us.php
Disallow: /wsc86535089/adaptive.php
Disallow: /wsc86535089/auth.php
Disallow: /wsc86535089/cart.php
Disallow: /wsc86535089/cat_count.php
Disallow: /wsc86535089/changes_password.php
Disallow: /wsc86535089/check_requirements.php
Disallow: /wsc86535089/cleanup.php
Disallow: /wsc86535089/cmp.doc
Disallow: /wsc86535089/cmpi_popup.php
Disallow: /wsc86535089/config.php
Disallow: /wsc86535089/COPYRIGHT
Disallow: /wsc86535089/customer_testimonials.php
Disallow: /wsc86535089/default_icon.gif
Disallow: /wsc86535089/default_image.gif
Disallow: /wsc86535089/default_logo.gif
Disallow: /wsc86535089/download.php
Disallow: /wsc86535089/ebay_customer_feedback.php
Disallow: /wsc86535089/error_message.php
Disallow: /wsc86535089/featured_products.php
Disallow: /wsc86535089/functions_seo.js
Disallow: /wsc86535089/giftcert.php
Disallow: /wsc86535089/help.php
Disallow: /wsc86535089/home.php
Disallow: /wsc86535089/https.php
Disallow: /wsc86535089/icon.php
Disallow: /wsc86535089/image.php
Disallow: /wsc86535089/index.php
Disallow: /wsc86535089/INSTALL
Disallow: /wsc86535089/install.php
Disallow: /wsc86535089/jscripts.zip
Disallow: /wsc86535089/manufacturers.php
Disallow: /wsc86535089/minicart.php
Disallow: /wsc86535089/mlogo.php
Disallow: /wsc86535089/news.php
Disallow: /wsc86535089/nocookie_warning.php
Disallow: /wsc86535089/order.php
Disallow: /wsc86535089/orders.php
Disallow: /wsc86535089/Orignal_home.php
Disallow: /wsc86535089/pages.php
Disallow: /wsc86535089/patch.1
Disallow: /wsc86535089/patch.pl
Disallow: /wsc86535089/phpinfo.php
Disallow: /wsc86535089/popup_info.php
Disallow: /wsc86535089/popup_poptions.php
Disallow: /wsc86535089/prepare.php
Disallow: /wsc86535089/process_order.php
Disallow: /wsc86535089/product.php
Disallow: /wsc86535089/product_image.php
Disallow: /wsc86535089/products.php
Disallow: /wsc86535089/README
Disallow: /wsc86535089/recommends.php
Disallow: /wsc86535089/referer.php
Disallow: /wsc86535089/register.php
Disallow: /wsc86535089/search.php
Disallow: /wsc86535089/secure_login.php
Disallow: /wsc86535089/send_to_friend.php
Disallow: /wsc86535089/shop_closed.html
Disallow: /wsc86535089/sitemap.php
Disallow: /wsc86535089/smarty.php
Disallow: /wsc86535089/test.php
Disallow: /wsc86535089/top.inc.php
Disallow: /wsc86535089/UPGRADE.readme
Disallow: /wsc86535089/VERSION
Disallow: /wsc86535089/vote.php

user-agent: Slurp
Disallow: /_vti_txt/
Disallow: /arbicocomputers.co.uk/
Disallow: /arbicocomputers.com/
Disallow: /BACK-UP/
Disallow: /for ebay shop/
Disallow: /ftptest/
Disallow: /logs/
Disallow: /otherdomains/
Disallow: /.bash_history/
Disallow: /_vti_inf.html
Disallow: /postinfo.html
Disallow: /wsc86535089/admin/
Disallow: /wsc86535089/arbico/
Disallow: /wsc86535089/arbicomis/
Disallow: /wsc86535089/catalog/
Disallow: /wsc86535089/db/
Disallow: /wsc86535089/ebay/
Disallow: /wsc86535089/files/
Disallow: /wsc86535089/fulfilment/
Disallow: /wsc86535089/help/
Disallow: /wsc86535089/include/
Disallow: /wsc86535089/log/
Disallow: /wsc86535089/mail/
Disallow: /wsc86535089/main/
Disallow: /wsc86535089/modules/
Disallow: /wsc86535089/pages/
Disallow: /wsc86535089/payment/
Disallow: /wsc86535089/payments/
Disallow: /wsc86535089/provider/
Disallow: /wsc86535089/schemes/
Disallow: /wsc86535089/shipping/
Disallow: /wsc86535089/single/
Disallow: /wsc86535089/skin1/
Disallow: /wsc86535089/skin1_original/
Disallow: /wsc86535089/smarty-2.6.9/
Disallow: /wsc86535089/sql/
Disallow: /wsc86535089/templates_c/
Disallow: /wsc86535089/upgrade/
Disallow: /wsc86535089/websitecm/
Disallow: /wsc86535089/about_us.php
Disallow: /wsc86535089/adaptive.php
Disallow: /wsc86535089/auth.php
Disallow: /wsc86535089/cart.php
Disallow: /wsc86535089/cat_count.php
Disallow: /wsc86535089/changes_password.php
Disallow: /wsc86535089/check_requirements.php
Disallow: /wsc86535089/cleanup.php
Disallow: /wsc86535089/cmp.doc
Disallow: /wsc86535089/cmpi_popup.php
Disallow: /wsc86535089/config.php
Disallow: /wsc86535089/COPYRIGHT
Disallow: /wsc86535089/customer_testimonials.php
Disallow: /wsc86535089/default_icon.gif
Disallow: /wsc86535089/default_image.gif
Disallow: /wsc86535089/default_logo.gif
Disallow: /wsc86535089/download.php
Disallow: /wsc86535089/ebay_customer_feedback.php
Disallow: /wsc86535089/error_message.php
Disallow: /wsc86535089/featured_products.php
Disallow: /wsc86535089/functions_seo.js
Disallow: /wsc86535089/giftcert.php
Disallow: /wsc86535089/help.php
Disallow: /wsc86535089/home.php
Disallow: /wsc86535089/https.php
Disallow: /wsc86535089/icon.php
Disallow: /wsc86535089/image.php
Disallow: /wsc86535089/index.php
Disallow: /wsc86535089/INSTALL
Disallow: /wsc86535089/install.php
Disallow: /wsc86535089/jscripts.zip
Disallow: /wsc86535089/manufacturers.php
Disallow: /wsc86535089/minicart.php
Disallow: /wsc86535089/mlogo.php
Disallow: /wsc86535089/news.php
Disallow: /wsc86535089/nocookie_warning.php
Disallow: /wsc86535089/order.php
Disallow: /wsc86535089/orders.php
Disallow: /wsc86535089/Orignal_home.php
Disallow: /wsc86535089/pages.php
Disallow: /wsc86535089/patch.1
Disallow: /wsc86535089/patch.pl
Disallow: /wsc86535089/phpinfo.php
Disallow: /wsc86535089/popup_info.php
Disallow: /wsc86535089/popup_poptions.php
Disallow: /wsc86535089/prepare.php
Disallow: /wsc86535089/process_order.php
Disallow: /wsc86535089/product.php
Disallow: /wsc86535089/product_image.php
Disallow: /wsc86535089/products.php
Disallow: /wsc86535089/README
Disallow: /wsc86535089/recommends.php
Disallow: /wsc86535089/referer.php
Disallow: /wsc86535089/register.php
Disallow: /wsc86535089/search.php
Disallow: /wsc86535089/secure_login.php
Disallow: /wsc86535089/send_to_friend.php
Disallow: /wsc86535089/shop_closed.html
Disallow: /wsc86535089/sitemap.php
Disallow: /wsc86535089/smarty.php
Disallow: /wsc86535089/test.php
Disallow: /wsc86535089/top.inc.php
Disallow: /wsc86535089/UPGRADE.readme
Disallow: /wsc86535089/VERSION
Disallow: /wsc86535089/vote.php

User-agent: Msnbot
Disallow: /_vti_txt/
Disallow: /arbicocomputers.co.uk/
Disallow: /arbicocomputers.com/
Disallow: /BACK-UP/
Disallow: /for ebay shop/
Disallow: /ftptest/
Disallow: /logs/
Disallow: /otherdomains/
Disallow: /.bash_history/
Disallow: /_vti_inf.html
Disallow: /postinfo.html
Disallow: /wsc86535089/admin/
Disallow: /wsc86535089/arbico/
Disallow: /wsc86535089/arbicomis/
Disallow: /wsc86535089/catalog/
Disallow: /wsc86535089/db/
Disallow: /wsc86535089/ebay/
Disallow: /wsc86535089/files/
Disallow: /wsc86535089/fulfilment/
Disallow: /wsc86535089/help/
Disallow: /wsc86535089/include/
Disallow: /wsc86535089/log/
Disallow: /wsc86535089/mail/
Disallow: /wsc86535089/main/
Disallow: /wsc86535089/modules/
Disallow: /wsc86535089/pages/
Disallow: /wsc86535089/payment/
Disallow: /wsc86535089/payments/
Disallow: /wsc86535089/provider/
Disallow: /wsc86535089/schemes/
Disallow: /wsc86535089/shipping/
Disallow: /wsc86535089/single/
Disallow: /wsc86535089/skin1/
Disallow: /wsc86535089/skin1_original/
Disallow: /wsc86535089/smarty-2.6.9/
Disallow: /wsc86535089/sql/
Disallow: /wsc86535089/templates_c/
Disallow: /wsc86535089/upgrade/
Disallow: /wsc86535089/websitecm/
Disallow: /wsc86535089/about_us.php
Disallow: /wsc86535089/adaptive.php
Disallow: /wsc86535089/auth.php
Disallow: /wsc86535089/cart.php
Disallow: /wsc86535089/cat_count.php
Disallow: /wsc86535089/changes_password.php
Disallow: /wsc86535089/check_requirements.php
Disallow: /wsc86535089/cleanup.php
Disallow: /wsc86535089/cmp.doc
Disallow: /wsc86535089/cmpi_popup.php
Disallow: /wsc86535089/config.php
Disallow: /wsc86535089/COPYRIGHT
Disallow: /wsc86535089/customer_testimonials.php
Disallow: /wsc86535089/default_icon.gif
Disallow: /wsc86535089/default_image.gif
Disallow: /wsc86535089/default_logo.gif
Disallow: /wsc86535089/download.php
Disallow: /wsc86535089/ebay_customer_feedback.php
Disallow: /wsc86535089/error_message.php
Disallow: /wsc86535089/featured_products.php
Disallow: /wsc86535089/functions_seo.js
Disallow: /wsc86535089/giftcert.php
Disallow: /wsc86535089/help.php
Disallow: /wsc86535089/home.php
Disallow: /wsc86535089/https.php
Disallow: /wsc86535089/icon.php
Disallow: /wsc86535089/image.php
Disallow: /wsc86535089/index.php
Disallow: /wsc86535089/INSTALL
Disallow: /wsc86535089/install.php
Disallow: /wsc86535089/jscripts.zip
Disallow: /wsc86535089/manufacturers.php
Disallow: /wsc86535089/minicart.php
Disallow: /wsc86535089/mlogo.php
Disallow: /wsc86535089/news.php
Disallow: /wsc86535089/nocookie_warning.php
Disallow: /wsc86535089/order.php
Disallow: /wsc86535089/orders.php
Disallow: /wsc86535089/Orignal_home.php
Disallow: /wsc86535089/pages.php
Disallow: /wsc86535089/patch.1
Disallow: /wsc86535089/patch.pl
Disallow: /wsc86535089/phpinfo.php
Disallow: /wsc86535089/popup_info.php
Disallow: /wsc86535089/popup_poptions.php
Disallow: /wsc86535089/prepare.php
Disallow: /wsc86535089/process_order.php
Disallow: /wsc86535089/product.php
Disallow: /wsc86535089/product_image.php
Disallow: /wsc86535089/products.php
Disallow: /wsc86535089/README
Disallow: /wsc86535089/recommends.php
Disallow: /wsc86535089/referer.php
Disallow: /wsc86535089/register.php
Disallow: /wsc86535089/search.php
Disallow: /wsc86535089/secure_login.php
Disallow: /wsc86535089/send_to_friend.php
Disallow: /wsc86535089/shop_closed.html
Disallow: /wsc86535089/sitemap.php
Disallow: /wsc86535089/smarty.php
Disallow: /wsc86535089/test.php
Disallow: /wsc86535089/top.inc.php
Disallow: /wsc86535089/UPGRADE.readme
Disallow: /wsc86535089/VERSION
Disallow: /wsc86535089/vote.php


?

[Edited to add code tags. ~Randy]

#7 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 28 August 2006 - 08:14 AM

Okay, first off there are lots of things that could stand to be cleaned up. Starting off with that you don't need to declare each search engine spider if you're excluding the same files with each. Just use the wildcard User-agent: *

There are also several entries that do nothing. The ones that reference domain names won't work. There are no sub-directory files by those names. The same goes for several other files I spot checked. Basically, if you can't put your base URL plus what's in a disallow line and load a real page (eg not your 404 Error page) you don't need that disallow line.

For areas that are password protected (for example your /logs/ entry) you also don't need a exclusion. Spiders don't enter usernames or passwords, so all you're doing is letting potential hackers know which areas you have that are protected.

You can also safely remove hidden system files. For example .bash_history. These are system level only files and are not web accessible.

You have a lot of entries for the wsc86535089 sub-directory. If you want to exclude everything in that folder you can do it with one simple rule. Disallow: /wsc86535089/ There's no need to exclude every single file or directory that sits under this path if you want to exclude them all in one fell swoop.

For directories or files with a space in the name you should replace the space character with %20

Back to the original question. Do I gather that you simply want to exclude any dynamic pages that include a query string? Or do you need to exclude the pages even if no query string is present?

If it's the former, then a single line saying Disallow: /*? will work. At least for Google. I don't recall seeing if MSNbot or Slurp will honor such wildcards, but if they don't they'll simply ignore the rule.

If you need to exclude the dynamic file or the sub-directory it's housed in, simply include that as a Disallow. Don't worry about the query strings. They'll get excluded too as long at either the dynamic file or the directory is excluded.

#8 azhar5i

azhar5i

    HR 1

  • Members
  • Pip
  • 4 posts
  • Location:London

Posted 29 August 2006 - 12:31 AM

oK RANDY, first of all thanks for your help

1- Yeah i had that
User-agent: * before i replaced that with distributing robots, thought it might be creating problem...

2-I don't want any search engine to access the file that does not load a real page, it is like all the garbage for me, cause i already have thousands of HTML generated pages in my root directories and don't want to create problem for the robot while indexing my web site.

3- Basically my root directory is Disallow: <b>/wsc86535089/</b> and all my HTML pages resides under it...
Like: /wsc86535089/index.html

so that is the reason i didn't disallow whole "/wsc86535089/" directory.

4- Thanks for suggesting not to use SPACE instead %20, i would do that

and in the end my real question, it is now so clear that Wild characters (/*?) are to only disallow GOOGLE BOT. MSN BOT and SLURP don't obey them...

so i would prefer not to use wild characters.

Thanks for a great help Randy... and projectphp
(Y)




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users