| Important Announcement: ***Need an Affordable SEO Website Review?*** |
![]() ![]() |
Mar 1 2004, 03:50 PM
Post
#1
|
|
![]() HR 2 ![]() ![]() Group: Members Posts: 12 Joined: 27-February 04 User's local time: Feb 9 2010, 12:02 PM Member No.: 2,691 |
Hello,
I was given the advice to create a 1x1 transparent gif and link it to a page in order to track bots that ignored my robots.txt deny list. The purpose was to find the bots that were not adhering to the standards and see if I wanted to exclude them from crawling. I then read that doing this is considered a "no-no" and would cause my site to be dropped by SE's. Can anyone confirm the accuracy of this information? (IMG:http://www.highrankings.com/forum/style_emoticons/default/huh.gif) thanks.... David |
|
|
|
Mar 1 2004, 03:55 PM
Post
#2
|
|
![]() HR 7 ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Active Members Posts: 2,333 Joined: 13-August 03 User's local time: Feb 9 2010, 10:02 AM From: Phoenix, AZ Member No.: 501 |
Alot of people will use transparent images to track activity or something of that source. There is no problem with this. I would recommend using something other than a 1x1 setting as it is my opinion that SEs may have automatic spam filters to detect 1x1 images.
I don't have any factual evidence on this. At any rate if you are simply going to have 1 transparent image on one page, you shouldn't have any problems SEO wise with that. |
|
|
|
Mar 1 2004, 07:35 PM
Post
#3
|
|
![]() Token male admin Group: Admin Posts: 1,436 Joined: 28-July 03 User's local time: Feb 9 2010, 05:02 PM From: UK Member No.: 45 |
QUOTE(DavidatWork @ Mar 1 2004, 08:50 PM) I was given the advice to create a 1x1 transparent gif and link it to a page in order to track bots that ignored my robots.txt deny list. The purpose was to find the bots that were not adhering to the standards and see if I wanted to exclude them from crawling. That's a fine purpose. However, your intent might be misinterpreted. I advise against using any invisible link, even to a page excluded by robots.txt. QUOTE(searchrank) Alot of people will use transparent images to track activity or something of that source Yes, these are known as Web bugs or Web beacons. However, they don't require anchor tags to be wrapped around them. Invisible links should be avoided, no matter what size they are. |
|
|
|
Mar 2 2004, 08:24 AM
Post
#4
|
|
![]() HR 2 ![]() ![]() Group: Members Posts: 12 Joined: 27-February 04 User's local time: Feb 9 2010, 12:02 PM Member No.: 2,691 |
From the different opinions I get the feeling that the verdict is still out
on this topic. I guess I will err on the side of caution and not use it. I will have to try and think of another method. thanks..... |
|
|
|
Mar 2 2004, 08:43 AM
Post
#5
|
|
![]() HR 5 ![]() ![]() ![]() ![]() ![]() Group: Active Members Posts: 385 Joined: 29-January 04 User's local time: Feb 9 2010, 07:02 PM From: Cape Town, South Africa Member No.: 2,171 |
If you don't use a 1x1 transparent pixel with a link on it, it shouldn't be a problem, since sometimes you need just such a transparent pixel for spacing purposes, like when you need to force the browser to leave a gap. I have used the transparent pixel on some sites in dynamic menus but that was limited use and for layout purposes only.
For bot tracking you can simply use any page of your choice, which might actually turn out to be more useful for bot testing. Bernhard |
|
|
|
Mar 2 2004, 11:49 AM
Post
#6
|
|
![]() HR 6 Group: Moderator Posts: 918 Joined: 24-July 03 User's local time: Feb 9 2010, 12:02 PM From: Michigan USA Member No.: 17 |
One-pixel images have been used on the Internet for at least eight years that I know of, with absolutely no ill effects. But, this isn't a question of using an invisible image, one-by-one or otherwise, but rather of using a hidden link.
For those who don't know what a Spider Trap is, it's essentially a link that no visitor will ever use (because it's hidden) and no reputable spider will ever use (because it's disallowed in robots.txt). Ergo, any access to the Trap must come from a "bad 'bot," one you probably don't want using your bandwidth. This might include email harvesters, content thieves, or just poorly written scripts. The Spider Trap tricks these bad 'bots into announcing their presence, leaving behind both their user agent and IP address, either of which can then be used to block their access to your site. In short, Spider Traps serve a useful purpose. Could one conceivably get you in trouble with a search engine. It's possible. There's certainly not many valid reasons to hide a link from your visitors, and the invalid reasons are all pretty much synonymous with SE spam, so the mere presence of such a hidden link could potentially be a problem. Personally, I don't think it is, and I'm absolutely certain it shouldn't be. We shouldn't do useless things simply to rank higher in a search engine, and the flip side of that is we shouldn't avoid doing useful things from fear of NOT ranking in a search engine. Forget that search engines exist, and a Spider Trap will still make sense. Unfortunately, that doesn't necessarily make the choice any more comfortable. Fortunately, the choice really isn't one that needs to be made any way. It's not that difficult, usually, to set up a Spider Trap that is virtually guaranteed not to raise any SE eyebrows. The trick is to implement your Trap on an excluded page. Find a page in your site you don't need indexed, such as a Privacy Statement or Feedback Form, for example, and exclude it in your robots.txt file. Do NOTHING to that page immediately. Spiders don't check your robots.txt prior to grabbing every page, or even prior to every visit, so that page isn't really excluded until your logs indicate the spider has requested your robots.txt file again. Only then can you be certain the spider will know to avoid the page. You can now safely install your Trap on that excluded page, confident that no reputable spider will ever see the hidden link. The bad 'bots, however, will see it, because they ignore robots.txt, and will still follow it, thus tripping your trap. All gain, no pain. (IMG:http://www.highrankings.com/forum/style_emoticons/default/smile.gif) |
|
|
|
Mar 2 2004, 12:20 PM
Post
#7
|
|
![]() Token male admin Group: Admin Posts: 1,436 Joined: 28-July 03 User's local time: Feb 9 2010, 05:02 PM From: UK Member No.: 45 |
Great post Ron. (IMG:http://www.highrankings.com/forum/style_emoticons/default/cheers.gif)
|
|
|
|
Mar 3 2004, 03:40 AM
Post
#8
|
|
|
HR 3 ![]() ![]() ![]() Group: Active Members Posts: 102 Joined: 20-September 03 User's local time: Feb 9 2010, 01:02 PM From: Ogden, Utah, USA Member No.: 856 |
Agreed, good post Ron.
Having a good amount of experience, I have come to the following conclusion: o Resources mentioned for exclusion in a robots.txt file become meaningless to a good search engine like Google. o Therefore, a good search engine won't care whether you've got a 1x1 invisible pixel, or a huge screaming link, it's going to ignore the link and the page it references, period. I have had absolutely no trouble implementing spider traps. I think Google et al. know that this is an issue that real webmasters have to deal with and therefore won't try to punish you under these circumstances. I say, go for it. Do the 1x1 pixel, do whatever you want, it will be fine, so long as your spider trap is mentioned in your robots.txt file. By the way, I've found quite a few pretty successful ways to catch bad spiders. Here's some additional ideas for you (besides what has already been mentioned): o Add an entry to your robots.txt file that isn't linked to from anywhere (ie. it's an island page). Anything that visits that page should be banned. o (this one's my favorite) Say you have an images directory. You'll have links to the images inside the directory, but not to the directory itself. I've found that some bad robots will attempt to harvest the image directory even though there was no direct link to it (it's looking for all the images). So, what you do is you code an index.html page in the images directory with a big fat link that says, 'don't click this link'. If anything follows the link, they go to a page which bans them. The reason to include the link is that sometimes legit users are just really into your site and want to try and see everything possible. You don't want them to be banned, just the dumb spiders that don't know how to read english (or your favorite language). o Make sure to add a (practically) invisible link to both the top and the bottom of the page, as I've found some bad spiders will traverse the links on your page in reverse order. Therefore, if your link to the banning page is on the top, they might not find it until they've already traversed all the links on the page. Even worse is when the robot does a depth first search from bottom to top (ie. it finds the bottom most link on your site, follows it, then finds the next one, follows it, etc.). These robots would harvest your whole site until finding the banning link at the top of the page. o Some robots will avoid pages with names that have the phrase, 'spam', 'trap', 'bait' or 'bot' in it. They are trying not to get caught, so they won't follow these pages. Instead, make sure your spider traps are named more generically. Probably most importantly, always (did I stress?) ALWAYS include information about why a bad robot was banned from your site. There are a lot of legit users out there that want to download your content, but have slow modems and need to walk away from their desk. So, they turn on their auto downloaders which attempt to download your site while they're gone. The problem is that probably most of the time they are configured to ignore the robots.txt file. That doesn't mean you don't want them as your customer. So, make sure you include a simple error message describing the problem and a mechanism to get out of it (like a link that will remove their IP address or whatever you're using to ban the robot by). As far as implementation goes, since my websites are mostly dynamically driven (php), I have an include file which checks my bad robot database (IP address mostly) and returns a simple page that says essentially, 'your ip has been banned. please contact us to have this ban released' along with information about why it was banned. And finally, if you really want to have fun with a bad robot, create what's called a "poison trap." Essentially, it's an endless pool of never ending links that just goes to create more endless links. A good example of this technique for email harvesting robots is called wpoison. You can fill a bad robot's hard drive up real fast using this technique. Though, you've got to have a lot of bandwidth available. =) |
|
|
|
Mar 3 2004, 09:35 AM
Post
#9
|
|
![]() HR 2 ![]() ![]() Group: Members Posts: 12 Joined: 27-February 04 User's local time: Feb 9 2010, 12:02 PM Member No.: 2,691 |
QUOTE I say, go for it. Do the 1x1 pixel, do whatever you want, it will be fine, so long as your spider trap is mentioned in your robots.txt file. Are you saying I should just put some commented text in my robots.txt file that says "hey, i have a bad bots trap and this is where it is" ? QUOTE If anything follows the link, they go to a page which bans them. Is there any pre-written script out there that does this? Since I'm not a programmer I don't have the knowledge to write one from scratch. I can edit though...if it's html or jsp. Just looking for some examples of pages that would ban. QUOTE Probably most importantly, always (did I stress?) ALWAYS include information about why a bad robot was banned from your site I assume you mean to put this information in some kind of return page if they have hit the banned page. Lots of great info all...thanks very much. (IMG:http://www.highrankings.com/forum/style_emoticons/default/notworthy.gif) |
|
|
|
Mar 3 2004, 01:20 PM
Post
#10
|
|
![]() Vintage Babe Group: Moderator Posts: 4,142 Joined: 31-July 03 User's local time: Feb 9 2010, 12:02 PM From: Triangle area, NC, USA, Earth (usually) Member No.: 89 |
QUOTE(DavidatWork @ Mar 3 2004, 09:35 AM) Are you saying I should just put some commented text in my robots.txt file that says "hey, i have a bad bots trap and this is where it is" ? I think what was meant was that the spider trap page should be disallowed in your robots.txt page. Spiders that obey robots.txt won't index the page; those that disobey robots.txt (bad spiders) will get "caught" by visiting a page they shouldn't. QUOTE I assume you mean to put this information in some kind of return page if they have hit the banned page. In this case, I believe that what was meant was that when an agent (browser, spider, whatever) which has been banned visits your site, instead of just getting a generic error or not being allowed to see anything (I assume this would generate a "forbidden" error?), modify the error page to show why someone might have been banned and giving them instructions for how to go about getting "unbanned". A spider will not be able to read the page and follow the instructions, whereas a human being who has "accidentally" gotten him/herself banned will. My (IMG:http://www.highrankings.com/forum/style_emoticons/default/penny.gif) --Torka (IMG:http://www.highrankings.com/forum/style_emoticons/default/mf_prop.gif) |
|
|
|
Mar 3 2004, 02:34 PM
Post
#11
|
|
![]() HR 2 ![]() ![]() Group: Members Posts: 12 Joined: 27-February 04 User's local time: Feb 9 2010, 12:02 PM Member No.: 2,691 |
Makes cents .... I mean ..... sense to me.
thanks..... (IMG:http://www.highrankings.com/forum/style_emoticons/default/thumbup1.gif) |
|
|
|
Mar 3 2004, 08:57 PM
Post
#12
|
|
|
HR 3 ![]() ![]() ![]() Group: Active Members Posts: 102 Joined: 20-September 03 User's local time: Feb 9 2010, 01:02 PM From: Ogden, Utah, USA Member No.: 856 |
QUOTE(torka @ Mar 3 2004, 02:20 PM) QUOTE(DavidatWork @ Mar 3 2004, 09:35 AM) Are you saying I should just put some commented text in my robots.txt file that says "hey, i have a bad bots trap and this is where it is" ? I think what was meant was that the spider trap page should be disallowed in your robots.txt page. Spiders that obey robots.txt won't index the page; those that disobey robots.txt (bad spiders) will get "caught" by visiting a page they shouldn't. Your summary is very good, however just to clarify... What I was saying was that some bad robots will look through a robots.txt file just to find the things which it is supposed to not index, and then head straight to them. So, you'll catch a few robots by simply including an entry in your robots.txt file that points straight to a trap. This won't catch all or even most bad robots, but it's a quick way to find a few of them. I guess these would be the _really bad_ robots who parse the robots.txt file just to find things to index. |
|
|
|
Mar 5 2004, 10:55 AM
Post
#13
|
|
|
HR 1 ![]() Group: Members Posts: 1 Joined: 5-March 04 User's local time: Feb 9 2010, 01:02 PM Member No.: 2,773 |
Can you tell me how you are going about tracking the bots using the trans image? Also how it is that you are not allowing the bots? I am obviously very new to all this. My understanding is that you can stop the bots in particular meta tags.
Thanks for any help you are willing to give me. Chris |
|
|
|
Mar 5 2004, 11:26 AM
Post
#14
|
|
![]() HR 6 ![]() ![]() ![]() ![]() ![]() ![]() Group: Active Members Posts: 588 Joined: 5-August 03 User's local time: Feb 9 2010, 02:02 PM From: Massachusetts Member No.: 307 |
One factor to consider is that the invisible link is to a page on the same domain. Sites are full of links to pages on the same domain. It would be a strange search engine to consider that to be a spam link.
From the search engine spam perspective, what would be useful is analyzing the links to any given site to see whether they come from tiny image files on other sites. If there were, that would be suspicious. |
|
|
|
Mar 5 2004, 11:35 AM
Post
#15
|
|
![]() Token male admin Group: Admin Posts: 1,436 Joined: 28-July 03 User's local time: Feb 9 2010, 05:02 PM From: UK Member No.: 45 |
QUOTE(cluksha @ Mar 5 2004, 03:55 PM) Can you tell me how you are going about tracking the bots using the trans image? Welcome to the forums cluksha (IMG:http://www.highrankings.com/forum/style_emoticons/default/bye1.gif) Anybody that arrives at the page linked to using the transparent image is assumed to be a bot. You then take a note of its IP address ready for the next stage ... QUOTE Also how it is that you are not allowing the bots? You can configure your server to bar requests from certain IP addresses and ranges. Use the IP address you discovered at the previous stage to do this. QUOTE(cline) Sites are full of links to pages on the same domain. It would be a strange search engine to consider that to be a spam link. Hidden links to pages on the same domain are often used for spam. For example, it's a means of getting thousands of doorways pages indexed. |
|
|
|
![]() ![]() |
|
Lo-Fi Version | Time is now: 9th February 2010 - 12:02 PM |