Are you a Google Analytics enthusiast?
More SEO Content
Identifying Duplicate Pages
Posted 08 April 2004 - 07:31 AM
For example, i create 2 websites (seperate domains) with contents like:
Same Templates Footer , header for both domains and the content goes like this.
1st Domain :
In todays world everyone searches for money....
In this world Everyone searches for money.....
Now humans knows this is a duplicate page , but how does a bot knows that it is a duplicate page?
Posted 08 April 2004 - 08:08 AM
Add into the mix that they they can also factor in Whois domain ownership information, host/ip information, etc, etc and it can become fairly easy to flag content that is essentually duplicate. Flag it for another look by the algorithm, or flag it to be sent to a real, live person to review.
Obviously they're not perfect at it yet. If they were we'd never be able to use a search engine to find the content thiefs. And we can most certainly still do that.
Posted 08 April 2004 - 08:19 AM
It is as easy as this choose 6 numbers from 1 to 49 (UK lottery) and the odds of them coming up are 14 million to one. However the odds of them coming up in the same order as you chose them are astronomically high.
With that in mind, the odds of duplicate content, phrases etc appearing are low, but the odds of the same 200 words appearing on a page in a similar pattern are again massive, result = cheating. Filter in barrier moving factors such as news sites being MORE likely to have dupe content, syndication and PR companies the same, forums also, and you have a workable system
They can not spot duplicate content, they just calculate the odds, and any page that passes those odds, gets removed, if on the other hand too many pages are getting through, they can shorten the odds. Many pages will be caught in the 'trawl', but what are the odds of them not knowingly having duplicate content, be it through theft or deception?
That is my simplistic view of it anyhow, I am sure others here will tidy it up, correct it, and present it in some form of intelligent copy
Posted 08 April 2004 - 01:14 PM
... the non-disclosure agreements that the Google employees must have to sign must be enormous!
Posted 08 April 2004 - 01:19 PM
Posted 08 April 2004 - 01:33 PM
When Google returns results, it doesn't use the page description as the description in the SERPs, it uses a snippit of text that surrounds the search terms you are looking for, then includes the title.
It uses the same technology to detect duplication. The more content there is on the page, the longer of a snippit they use. They then compare the snippits to see how similar they are. The threshold can be changed by Google depending on how much they want, so there is no number I could give you for that part.
The advantage to this system is that if you have a page, for example, that quotes heavily from a particular source (or is a reprint of an article or parts of it), this can detect that and then use other methods (usually PR) to present the most authoritative version.
Note this is not a penalty - it's a filter - no sense showing the same article over and over again just because it's popular - just show the main one.
Bottom line - it's a text snippit based on your search term, rather than a page by page comparision. This can result in the same 2 pages being considered duplicates and separate documents, given 2 different search terms.
Remember, this is Google only, I'm not certain of the others, but I imagine they are similar.
Posted 10 April 2004 - 10:32 PM
the question always at the back of my mind is as other sites start to reproduce my article, will the original article on my site(or the whole site itself) be flagged as duplicate content and get penalised?
Posted 11 April 2004 - 02:04 PM
Also, the duplicate content filter has to be off or on a very high treshold. Many blogs around the web use archive pages; often the original article and the archived one appear initially in the search results.
This DevShed article gives you a nice idea of one of how you can use similarity checks in your own systems.
The longest common substring between two strings is the longest contiguous chain of characters that exists in both strings. The longer the substring, the better the match between the two strings. This simple approach can work very well in practice.
There are numerous ways of doing similarity calculations though.
I assume that the fresh index is less strict on similarity checking as they want to get it out there asap. For the full index they must have a bit more time so they can afford some more calculations.
Reasoning back from the user interface I would say (guess) that Google must be getting better at spotting similar content. Stemming, semantics, search phrase error checking. In which way they apply that knowledge is not known to me.
Note: thinking out loud here. None of the above reflects absolute or assumed knowledge of Google's ways to spot duplicate content.
0 user(s) are reading this topic
0 members, 0 guests, 0 anonymous users