Jump to content

  • Log in with Facebook Log in with Twitter Log In with Google      Sign In   
  • Create Account

Subscribe to HRA Now!

 



Are you a Google Analytics enthusiast?

Share and download Custom Google Analytics Reports, dashboards and advanced segments--for FREE! 

 



 

 www.CustomReportSharing.com 

From the folks who brought you High Rankings!


Sponsored Content

 

 
 

Photo
- - - - -

Identifying Duplicate Pages


  • Please log in to reply
8 replies to this topic

#1 suthra

suthra

    HR 4

  • Active Members
  • PipPipPipPip
  • 134 posts

Posted 08 April 2004 - 07:31 AM

Hi , how does a searchengine like google identify a duplicate page.

For example, i create 2 websites (seperate domains) with contents like:

Same Templates Footer , header for both domains and the content goes like this.
1st Domain :
In todays world everyone searches for money....

2nd Domain

In this world Everyone searches for money.....

Now humans knows this is a duplicate page , but how does a bot knows that it is a duplicate page?


:raspberry:

Plz reply.,
Regds.,
sarathy.s

#2 Randy

Randy

    Convert Me!

  • Moderator
  • 17,540 posts

Posted 08 April 2004 - 08:08 AM

No one can tell you exactly how the various search engines identify duplicate content suthra, since none of us work for any of the search engines (to my knowledge). However, programically speaking, it's not all that hard to scan a database and look for patterns. That's what computers are very good at actually.

Add into the mix that they they can also factor in Whois domain ownership information, host/ip information, etc, etc and it can become fairly easy to flag content that is essentually duplicate. Flag it for another look by the algorithm, or flag it to be sent to a real, live person to review.

Obviously they're not perfect at it yet. If they were we'd never be able to use a search engine to find the content thiefs. And we can most certainly still do that.

#3 OldWelshGuy

OldWelshGuy

    Work is Fun

  • Moderator
  • 4,713 posts
  • Location:Neath, South Wales, UK

Posted 08 April 2004 - 08:19 AM

I am with Randy here, in as much as calculating possibilities and odds are where computers excel, and spamming the SE's is all about laws of probability.

It is as easy as this choose 6 numbers from 1 to 49 (UK lottery) and the odds of them coming up are 14 million to one. However the odds of them coming up in the same order as you chose them are astronomically high.

With that in mind, the odds of duplicate content, phrases etc appearing are low, but the odds of the same 200 words appearing on a page in a similar pattern are again massive, result = cheating. Filter in barrier moving factors such as news sites being MORE likely to have dupe content, syndication and PR companies the same, forums also, and you have a workable system

They can not spot duplicate content, they just calculate the odds, and any page that passes those odds, gets removed, if on the other hand too many pages are getting through, they can shorten the odds. Many pages will be caught in the 'trawl', but what are the odds of them not knowingly having duplicate content, be it through theft or deception? :raspberry:

That is my simplistic view of it anyhow, I am sure others here will tidy it up, correct it, and present it in some form of intelligent copy :rant:

#4 sherri

sherri

    Perceptum et Invenio

  • Active Members
  • PipPipPip
  • 89 posts
  • Location:Lost in Canada

Posted 08 April 2004 - 01:14 PM

I have a similar problem. I think it has to be a certain % identical. I've been changing page titles and a few paragraphs to help out the problem. Amazing that my pages were shown as identical since they are different Real Content pages. With a different paragraph of text and a handful of unique links. Oh well.


... the non-disclosure agreements that the Google employees must have to sign must be enormous!

#5 Grumpus

Grumpus

    HR 6

  • Active Members
  • PipPipPipPipPipPip
  • 786 posts

Posted 08 April 2004 - 01:19 PM

Sherri - you're right - page titles are a big element here when it comes to these filters. As you're surfing around, find a site (usually a dynamic one) where the page titles are identical on every page. Then do the same experiment I showed above. You'll find FAR fewer pages listed before you get to that "click here for unfiltered results" page.

G.

#6 mcanerin

mcanerin

    HR 7

  • Active Members
  • PipPipPipPipPipPipPip
  • 2,242 posts
  • Location:Calgary, Alberta, Canada

Posted 08 April 2004 - 01:33 PM

Actually, what Google does is detect duplication based on the search term.

When Google returns results, it doesn't use the page description as the description in the SERPs, it uses a snippit of text that surrounds the search terms you are looking for, then includes the title.

It uses the same technology to detect duplication. The more content there is on the page, the longer of a snippit they use. They then compare the snippits to see how similar they are. The threshold can be changed by Google depending on how much they want, so there is no number I could give you for that part.

The advantage to this system is that if you have a page, for example, that quotes heavily from a particular source (or is a reprint of an article or parts of it), this can detect that and then use other methods (usually PR) to present the most authoritative version.

Note this is not a penalty - it's a filter - no sense showing the same article over and over again just because it's popular - just show the main one.

Bottom line - it's a text snippit based on your search term, rather than a page by page comparision. This can result in the same 2 pages being considered duplicates and separate documents, given 2 different search terms.

Remember, this is Google only, I'm not certain of the others, but I imagine they are similar.

Ian

#7 OldWelshGuy

OldWelshGuy

    Work is Fun

  • Moderator
  • 4,713 posts
  • Location:Neath, South Wales, UK

Posted 08 April 2004 - 01:56 PM

This is a good thread :)

#8 hugeaffiliates

hugeaffiliates

    HR 1

  • Members
  • Pip
  • 6 posts

Posted 10 April 2004 - 10:32 PM

i've written a few articles and put them on my site and had submitted them for free reprint to other sites and ezines because of the extra referral traffic and links they could bring...

the question always at the back of my mind is as other sites start to reproduce my article, will the original article on my site(or the whole site itself) be flagged as duplicate content and get penalised?

#9 Ruud

Ruud

    HR 4

  • Active Members
  • PipPipPipPip
  • 129 posts
  • Location:Rimouski, Canada (Quebec)

Posted 11 April 2004 - 02:04 PM

Interesting indeed. Especially as at times I get frustrated when a slew of results is all the same Amazon feed. No wording changed, just yet another Amazon affiliate shop. In fact, on one of my own sites I often would get people coming in on a product search. Worse (read: weirder) at times that page would rank above Amazon's own page. Without optimization done that is odd, right?

Also, the duplicate content filter has to be off or on a very high treshold. Many blogs around the web use archive pages; often the original article and the archived one appear initially in the search results.

This DevShed article gives you a nice idea of one of how you can use similarity checks in your own systems.

The longest common substring between two strings is the longest contiguous chain of characters that exists in both strings. The longer the substring, the better the match between the two strings. This simple approach can work very well in practice.


There are numerous ways of doing similarity calculations though.

I assume that the fresh index is less strict on similarity checking as they want to get it out there asap. For the full index they must have a bit more time so they can afford some more calculations.

Reasoning back from the user interface I would say (guess) that Google must be getting better at spotting similar content. Stemming, semantics, search phrase error checking. In which way they apply that knowledge is not known to me.

Ruud

Note: thinking out loud here. None of the above reflects absolute or assumed knowledge of Google's ways to spot duplicate content.




0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users