Are you a Google Analytics enthusiast?
Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE!

www.CustomReportSharing.com
From the folks who brought you High Rankings!
More SEO Content
International SEM | Social Media | Search Friendly Design | SEO | Paid Search / PPC | Seminars | Forum Threads | Q&A | Copywriting | Keyword Research | Web Analytics / Conversions | Blogging | Dynamic Sites | Linking | SEO Services | Site Architecture | Search Engine Spam | Wrap-ups | Business Issues | HRA Questions | Online Courses
Blocking Subdomains
Started by
amabaie
, Aug 27 2007 10:11 PM
7 replies to this topic
#1
Posted 27 August 2007 - 10:11 PM
We are about to launch a site, but Google seems to have indexed the development site, despite no links to it. In response, I placed
# go away
User-agent: *
Disallow: /
in the robot.txt file in the root directory. Now Google Webmaster Tools says that the site is not indexed, but it specifies the site with the www. The problem is that there are several subdomains (one for each city) that are dynamically created through ASP coding, and most of them (not all) remain indexed. Is there a way in the robot.txt file to list each subdomain to be, in its entirety, excuded from the index?
Thanks.
# go away
User-agent: *
Disallow: /
in the robot.txt file in the root directory. Now Google Webmaster Tools says that the site is not indexed, but it specifies the site with the www. The problem is that there are several subdomains (one for each city) that are dynamically created through ASP coding, and most of them (not all) remain indexed. Is there a way in the robot.txt file to list each subdomain to be, in its entirety, excuded from the index?
Thanks.
#2
Posted 28 August 2007 - 05:00 AM
Hi amabaie
First things first, it's robots.txt, not robot.txt.
You need to either
First things first, it's robots.txt, not robot.txt.
QUOTE
Is there a way in the robot.txt file to list each subdomain to be, in its entirety, excuded from the index?
No.You need to either
- put each subdomain on a separate directory of your server, or
- dynamically write the robots.txt file
#4
Posted 28 August 2007 - 12:27 PM
To clarify the issue, in case it wasn't already...
As far as the search engines are concerned, each subdomain is a standalone site/domain and thus needs to have its own robots.txt. Meaning if they send a query to somesub.domain.com/robots.txt a file needs to be delivered.
As Alan pointed out above, as long as the subdomain files are in their own directory where the server is managing the connection/rewriting you're okay. You can also dynamically create a robots.txt file if need be, which can be accomplished one of several ways. The easiest probably being to have a scripted 404 error page that looks for and handles requests for the robots.txt dynamically.
A third way you can handle it with subdomains if you're on a *nix systems it to set up a little RewriteCond/RewriteRule conditional that catches any requests for the robots.txt filename and delivers a file. I do this with my test design subdomains to make sure they don't accidently get spidered/indexed development, but at the same time leave the main www sub open to be spidered. In case it'll help anybody, the .htaccess I use to do this is as follows:
The norobots.txt file is simply one that contains the standard User-agent: * and Disallow: / lines.
FWIW, the above would also tell spiders not to crawl non-www addresses for the domain, which in theory might help to combat this type of duplicate content. This is not the way I would suggest handling www/non-www issues however. I'm already handling that redirection earlier in my htaccess, so I don't have to deal with it for my test design subs.
As far as the search engines are concerned, each subdomain is a standalone site/domain and thus needs to have its own robots.txt. Meaning if they send a query to somesub.domain.com/robots.txt a file needs to be delivered.
As Alan pointed out above, as long as the subdomain files are in their own directory where the server is managing the connection/rewriting you're okay. You can also dynamically create a robots.txt file if need be, which can be accomplished one of several ways. The easiest probably being to have a scripted 404 error page that looks for and handles requests for the robots.txt dynamically.
A third way you can handle it with subdomains if you're on a *nix systems it to set up a little RewriteCond/RewriteRule conditional that catches any requests for the robots.txt filename and delivers a file. I do this with my test design subdomains to make sure they don't accidently get spidered/indexed development, but at the same time leave the main www sub open to be spidered. In case it'll help anybody, the .htaccess I use to do this is as follows:
CODE
Options +FollowSymlinks
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} robots\.txt
RewriteCond %{HTTP_HOST} !www\.mydomain\.com [NC]
RewriteRule ^(.*)$ norobots.txt
RewriteEngine on
RewriteCond %{REQUEST_FILENAME} robots\.txt
RewriteCond %{HTTP_HOST} !www\.mydomain\.com [NC]
RewriteRule ^(.*)$ norobots.txt
The norobots.txt file is simply one that contains the standard User-agent: * and Disallow: / lines.
FWIW, the above would also tell spiders not to crawl non-www addresses for the domain, which in theory might help to combat this type of duplicate content. This is not the way I would suggest handling www/non-www issues however. I'm already handling that redirection earlier in my htaccess, so I don't have to deal with it for my test design subs.
#5
Posted 28 August 2007 - 06:47 PM
Dynamic robots.txt using a custom 404 page
SteamText Function
rename your robots.txt to main-robots.txt
A request for www.host.tld/robots.txt will stream the main-robots.txt to the UA while anything else including host.tld/robots.txt will get a wildcard disallow.
CODE
<%
if instr(request.servervariables("QUERY_STRING"),"robots.txt") > 0 then
with response
if instr(request.servervariables("QUERY_STRING"),"www") = 0 then
.write "User-agent: *"
.write vbCrLf
.write "Disallow: /"
.write vbCrLf
else
StreamText("main-robots.txt")
end if
end with
end if
%>
if instr(request.servervariables("QUERY_STRING"),"robots.txt") > 0 then
with response
if instr(request.servervariables("QUERY_STRING"),"www") = 0 then
.write "User-agent: *"
.write vbCrLf
.write "Disallow: /"
.write vbCrLf
else
StreamText("main-robots.txt")
end if
end with
end if
%>
SteamText Function
CODE
<%
Sub StreamText(TextFileName)
' read in a file and stream it out to the browser
Dim objFSO, objTextFile
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objTextFile = objFSO.OpenTextFile(Server.MapPath(TextFileName))
Do While Not objTextFile.AtEndOfStream
Response.Write objTextFile.ReadLine & vbCrLf
Loop
objTextFile.Close
Set objTextFile = Nothing
Set objFSO = Nothing
End Sub
%>
Sub StreamText(TextFileName)
' read in a file and stream it out to the browser
Dim objFSO, objTextFile
Set objFSO = CreateObject("Scripting.FileSystemObject")
Set objTextFile = objFSO.OpenTextFile(Server.MapPath(TextFileName))
Do While Not objTextFile.AtEndOfStream
Response.Write objTextFile.ReadLine & vbCrLf
Loop
objTextFile.Close
Set objTextFile = Nothing
Set objFSO = Nothing
End Sub
%>
rename your robots.txt to main-robots.txt
A request for www.host.tld/robots.txt will stream the main-robots.txt to the UA while anything else including host.tld/robots.txt will get a wildcard disallow.
#6
Posted 29 August 2007 - 03:18 AM
Hey Chris, I'm no ASP expert, but shouldn't there be a line in there somewhere to set the HTTP response to a 200?
#7
Posted 29 August 2007 - 09:23 AM
Wow. Thanks for all the suggestions. I am sure our programmer will be able to use something in here!
#8
Posted 30 August 2007 - 07:13 PM
QUOTE
but shouldn't there be a line in there somewhere to set the HTTP response to a 200
Not necessary at all
A scripted 404 page will give a 200 response anyway, it's the 404 response you need to send explicitly.
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users









