Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!


Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 




From the folks who brought you High Rankings!

- - - - -

The Potential For Unintended Abuse From Udacity

  • Please log in to reply
1 reply to this topic

#1 Michael Martinez

Michael Martinez

    HR 10

  • Active Members
  • PipPipPipPipPipPipPipPipPipPip
  • 5,325 posts
  • Location:Georgia

Posted 10 April 2012 - 04:45 PM

If you haven't heard about them, Udacity is a startup that is offering college-level courses for free to anyone. The courses are all online and are built on a technology that has handled up to 160,000 students in a class.

One of the courses Udacity offers is in designing a search engine. That includes building a crawler. I looked at the FAQ for the class last week -- thinking it might be a useful resource for people who want to learn about advance SEO -- when I realized that Udacity has apparently made no effort to advise Webmasters on how to recognize or manage traffic that may come from Udacity student crawlers.

One of the students told me on Twitter that Udacity asks them to be polite and not hammer the servers -- but it is evident to me that no one is adding up the numbers here. Udacity doesn't explain where the crawlers get their seed sets from. If they all start from the same place then many sites will be hit hard by a lot of crawlers following the same paths.

Without specifying a user-agent that students should identify their crawlers with, Udacity leaves room for students to create all sorts of user-agent names (although if they leave it at the default that will be something like python-script/{level} ).

So who is most likely to feel the impact of this relentless crawling? People who lease and co-locate their own servers. I suspect that people who lease space on shared servers and/or the Cloud services may be protected by measures their hosting ISPs take if they detect heavy alien crawling. The people who (like me) are responsible for managing their own server codesets have to watch out for this activity themselves.

There are already many, many more crawlers out there than I can possibly identify. These students are not really trying to build their own companies (which might some day offer me value in return for allowing to crawl my servers); the idea is that they will be able to get jobs working for search engines around the world.

A typical Web server may allow 100-200 concurrent connections at a time by default. Most people don't reset this number. The crawlers could potentially block a significant amount of traffic on some sites.

While I feel that the number of sites that may be adversely impacted by this activity at any given time will be relatively small, it's not creating value for the Webmasters. I want people to be aware of this issue because we cannot predict the consequences of Udacity's actions. They are being promoted by TechCrunch and one of the founders came out of Stanford University. I don't expect these guys to be fly-by-night do-nothings who come and go quickly. I suspect they will be around for a while.

#2 Michael Martinez

Michael Martinez

    HR 10

  • Active Members
  • PipPipPipPipPipPipPipPipPipPip
  • 5,325 posts
  • Location:Georgia

Posted 10 April 2012 - 06:12 PM

From a student who just Tweeted me:

@seo_theory Udacity told us not to go rogue, used test sites, not real Internet, discussed politeness, cost, had Google guest speaker, etc.

My reply:

@priscillaoppy Your explanation is reassuring but @udacity needs to be more explicit in its FAQ what it does to protect webmasters.

Edited by Michael Martinez, 10 April 2012 - 06:12 PM.

0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users

We are now a read-only forum.
No new posts or registrations allowed.