Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!


Sponsored Content

 

 
 

Photo

Blocking Subdomains


  • Please log in to reply
7 replies to this topic

#1 amabaie

amabaie

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 606 posts
  • Location:Ontario, Canada

Posted 27 August 2007 - 10:11 PM

We are about to launch a site, but Google seems to have indexed the development site, despite no links to it. In response, I placed

# go away
User-agent: *
Disallow: /

in the robot.txt file in the root directory. Now Google Webmaster Tools says that the site is not indexed, but it specifies the site with the www. The problem is that there are several subdomains (one for each city) that are dynamically created through ASP coding, and most of them (not all) remain indexed. Is there a way in the robot.txt file to list each subdomain to be, in its entirety, excuded from the index?

Thanks.

#2 Alan Perkins

Alan Perkins

    Token male admin

  • Admin
  • 1,559 posts
  • Location:UK

Posted 28 August 2007 - 05:00 AM

Hi amabaie

First things first, it's robots.txt, not robot.txt. smile.gif

QUOTE
Is there a way in the robot.txt file to list each subdomain to be, in its entirety, excuded from the index?
No.

You need to either
  • put each subdomain on a separate directory of your server, or
  • dynamically write the robots.txt file


#3 amabaie

amabaie

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 606 posts
  • Location:Ontario, Canada

Posted 28 August 2007 - 08:07 AM

QUOTE(Alan Perkins @ Aug 28 2007, 06:00 AM) View Post
Hi amabaie

First things first, it's robots.txt, not robot.txt. smile.gif


Sorry Alan...my typo.

Thanks for the response. I was hoping there would be a way...


#4 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 28 August 2007 - 12:27 PM

To clarify the issue, in case it wasn't already...

As far as the search engines are concerned, each subdomain is a standalone site/domain and thus needs to have its own robots.txt. Meaning if they send a query to somesub.domain.com/robots.txt a file needs to be delivered.

As Alan pointed out above, as long as the subdomain files are in their own directory where the server is managing the connection/rewriting you're okay. You can also dynamically create a robots.txt file if need be, which can be accomplished one of several ways. The easiest probably being to have a scripted 404 error page that looks for and handles requests for the robots.txt dynamically.

A third way you can handle it with subdomains if you're on a *nix systems it to set up a little RewriteCond/RewriteRule conditional that catches any requests for the robots.txt filename and delivers a file. I do this with my test design subdomains to make sure they don't accidently get spidered/indexed development, but at the same time leave the main www sub open to be spidered. In case it'll help anybody, the .htaccess I use to do this is as follows:

CODE
Options +FollowSymlinks
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} robots\.txt
RewriteCond %{HTTP_HOST} !www\.mydomain\.com [NC]
RewriteRule ^(.*)$ norobots.txt


The norobots.txt file is simply one that contains the standard User-agent: * and Disallow: / lines.

FWIW, the above would also tell spiders not to crawl non-www addresses for the domain, which in theory might help to combat this type of duplicate content. This is not the way I would suggest handling www/non-www issues however. I'm already handling that redirection earlier in my htaccess, so I don't have to deal with it for my test design subs.

#5 chrishirst

chrishirst

    A not so moderate moderator.

  • Moderator
  • 5,882 posts
  • Location:Blackpool UK

Posted 28 August 2007 - 06:47 PM

Dynamic robots.txt using a custom 404 page

CODE
<%
if instr(request.servervariables("QUERY_STRING"),"robots.txt") > 0 then
    with response
        if instr(request.servervariables("QUERY_STRING"),"www") = 0 then
            .write "User-agent: *"
            .write vbCrLf
            .write "Disallow: /"
            .write vbCrLf
        else
            StreamText("main-robots.txt")    
        end if
    end with
end if
%>


SteamText Function
CODE
<%
Sub StreamText(TextFileName)
' read in a file and stream it out to the browser
Dim objFSO, objTextFile
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objTextFile = objFSO.OpenTextFile(Server.MapPath(TextFileName))

Do While Not objTextFile.AtEndOfStream
    Response.Write objTextFile.ReadLine  & vbCrLf  
Loop
objTextFile.Close
Set objTextFile = Nothing
Set objFSO = Nothing
End Sub

%>


rename your robots.txt to main-robots.txt

A request for www.host.tld/robots.txt will stream the main-robots.txt to the UA while anything else including host.tld/robots.txt will get a wildcard disallow.

#6 Alan Perkins

Alan Perkins

    Token male admin

  • Admin
  • 1,559 posts
  • Location:UK

Posted 29 August 2007 - 03:18 AM

Hey Chris, I'm no ASP expert, but shouldn't there be a line in there somewhere to set the HTTP response to a 200?

#7 amabaie

amabaie

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 606 posts
  • Location:Ontario, Canada

Posted 29 August 2007 - 09:23 AM

Wow. Thanks for all the suggestions. I am sure our programmer will be able to use something in here!

#8 chrishirst

chrishirst

    A not so moderate moderator.

  • Moderator
  • 5,882 posts
  • Location:Blackpool UK

Posted 30 August 2007 - 07:13 PM

QUOTE
but shouldn't there be a line in there somewhere to set the HTTP response to a 200

Not necessary at all

A scripted 404 page will give a 200 response anyway, it's the 404 response you need to send explicitly.





0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users