I am with Randy here, in as much as calculating possibilities and odds are where computers excel, and spamming the SE's is all about laws of probability.
It is as easy as this choose 6 numbers from 1 to 49 (UK lottery) and the odds of them coming up are 14 million to one. However the odds of them coming up in the same order as you chose them are astronomically high.
With that in mind, the odds of duplicate content, phrases etc appearing are low, but the odds of the same 200 words appearing on a page in a similar pattern are again massive, result = cheating. Filter in barrier moving factors such as news sites being MORE likely to have dupe content, syndication and PR companies the same, forums also, and you have a workable system
They can not spot duplicate content, they just calculate the odds, and any page that passes those odds, gets removed, if on the other hand too many pages are getting through, they can shorten the odds. Many pages will be caught in the 'trawl', but what are the odds of them not knowingly having duplicate content, be it through theft or deception?
That is my simplistic view of it anyhow, I am sure others here will tidy it up, correct it, and present it in some form of intelligent copy