I watch a couple of football games and do the "sleep" thing and look - the thread goes crazy. Cool.

Good stuffs in here.
[quote name='"Scottie"']And, how do they determine trustworthiness of the sources? Are we back to PR ratings?[/quote]
Topic sensitive PR
could do it for the sectors for which topics that are defined. LocalRank
could play into it as well - though obviously the search would have to be made to generate the "term/localRank relationship". Google is also tracking clicks from time to time nowadays. Many speculate that this is going to give sites that get lots of traffic a boost in the ranks. I don't think so (well, it will, sort of). Those clicks (if the user doesn't go back to the same set of SERPs and click something else again) help to determine satisfaction. If I don't go back to the serps after visiting a site, I most likely got my answer, and therefore that site is a good source of information on that subject.
[quote name='"Scottie"']Heck, we've now got two pages on the web that appear to confirm that Ray Charles is love.[/quote]
If there are enough sites out there so that a pattern emerges, then yes, that assumption could be made. Chances are, though, that the word love is going to appear contextually before "Ray Charles" as in "I love Ray Charles' music" more often than anything else on those pages. Pages that actually have the example you gave are likely humor type sites and there will be no other mention of Ray Charles, so you're not going to get the confirming numbers needed.
The comparison of two documents just won't cut it. If I'm a baseball player and I play in two games and have a batting average of .400, it is a sign that I
might be a good player. If you've played in 500 games and I have an average of .300, you're definitely a good player and even though it's less than .400, chances are, smart money is going to give you the advantage when we're going head to head in a batting derby.
In the LocalRank Arena:
[quote name='"Scottie"']Standard PR is used to pull the top 1000 (or whatever number they choose) results.[/quote]
The number is likely much lower in respects to LocalRank, but that's pretty much the gist of it. 1000 is the total number of documents in the set that are ranked, but since people are only going to look at the first couple of pages (as a general rule) LocalRank is likely calculated on a smaller number to save on resources. Once you get down below 100 or 250 or whatever, the bonus you get from this won't move you close enough to the top to make it worth doing the math anyway.
There are three steps to Google's caclulations on a query.
Step1: Google comes up with an unsorted list of relevant documents.
Step2: Google sorts these documents using PR, keyword existence/density, etc. This phase deals with mainly the "on-page" factors. It then takes the top 1000 documents (that number we know is true).
Step3: Offpage Factors kick in on the top X pages so that they can be resorted amongst themseleves - inbound link text, LocalRank, and all of the stuff that deals with the "relationships between pages".
[quote name='"Torka"']I'm just glad I don't have to work out the math behind it. I've got enough trouble with three dimensions, nevermind thousands.[/quote]
Amen to that! There probably even comes a point where too much knowledge/understanding of it is going to be detrimental as you'll end up second guessing yourself all the time.

The important thing is to just have an understanding of the concepts and the potential of the technology. Looking around the forums and talking to my SEO pals since Florida update there is one definite finding that is constant.
Those of us who understood and had been employing tactics to accomodate Semantics, LocalRank, TSPR, and other "up and comming" elements related to those (all of which we've known about since late 2002 and early 2003) didn't report
any major problems. Sure, maybe something lost a page or two in the SERPS, but the stuff either came back on its own or a minor adjustment fixed it. Dropping a page or two can happen with any update that employs an algo change - heck, that's what the algo change is
designed to do - change the results.
Those who didn't know about these technologies or chose to put them in the "This isn't going to help me now, I'll worry about it when it happens" folder, had some sites that were relatively unaffected and some sites vanish off the face of the earth. (Most likely they ended up on
Mars because it's warmer than Rochester.
[quote name='"Torka"']the W3C document I managed to absorb before brain failure, RDF and XML have something to do with each other.[/quote]
Think of RDF (Resource Description Framework) as the foundation. Think of XML (eXtensible Markup Language) as the formatting language used to deliver it.
HTTP (Hypertext Transfer Protocol) is to HTML (Hypertext Markup Language) as RDF is to XML.
[quote name='"Torka"']we don't have to convert everything over to XML (at least not right away) in order for this to work.[/quote]
Nope. We don't. And we probably won't. Though for specific applications (like that marketplace thing described in Vijay's article, Froogle, etc.) it makes sure that there are no errors, problems, bad assumptions. When it comes to specific applications, there is little room for error because folks are looking to perform a specific task and get 100% accurate results.
Searching the web is different. Because it's all encompassing and it delivers "everything" there is simply no way to deliver 100% accurate results - unless you were to require
everyone provide a formatted feed - but then, you end up not delivering "everything" anymore.
(Wanna get rich? Start a search engine that uses an RDF feed from all its sources and bill it as an "accurate" and not a "comphrehensive" engine. Do it fast, though - everyone and their brother will be doing it soon!).
[quote name='"Torka"']So, now, how does one actually optimize RIGHT NOW (and in the next year or two) under such a scenario?[/quote]
See the above two answers. As far as the search engines go (at least with Google and Ink and the other players that are on the "search the entire web" track) it won't be critical to know XML. (At least I don't think so). Since Froogle started out (which is the best way to guage Google's success in this area) Google has made
huge strides in improvement of extracting the proper information from a page.
We can also see AllTheWeb playing with this type of technology, also. If you do a search, you'll notice that they provide two snippets from the page. The first is the same type of snippet that Google provides - it just grabs the keywords and shows you the phrase(s) that is(are) near one instance of the term. Then you have the
description section. These are a complete sentence that describes the content of the page (in most cases). These descriptions may or may not have the keywords in them. In some cases, they may be the DMOZ description of the site, but in most (since most "pages" don't have a DMOZ listing) it's extracted right from the page.
Example:
AllTheWeb Search for "anteater biology". Except for the USGS Usage Stats page, the descriptions are summary sentences, there are no instances of the word "anteater" in them, and only 1 has the word "biology". Yet each provides an excellent description of what you are going to find on the page.
How did they do it? Well, they found the "element" on the page that, through semantical pattern recognition, best describes the overall topic of the page. In this case though, the semantics aren't applied as they relate to the search term, but as to the identifying overal focus of the page.
This is done using the DMOZ rdf dump as the "seed" but the end results are extrapolations from that.
Pretty neat stuff, huh?
[quote name='"Scottie"']Based on the very interesting Middlebury.edu article you posted, the words relative to each other don't matter. All that matters is frequency and "uniqueness" in the indexing. The words are stripped to their stem version, counted and mapped, then normalized (density adjustments) and compared to pages with the same words.[/quote]
Pretty much - but the foundation of the whole concept is based on being able to do this by extracting these "concepts" from "natural language". In highly competitive areas where the norm is to highly optimize a page and to focus on keywords rather than natural language - yes, keyword stuffed pages would get bonuses. In other less competitive sectors where SEO isn't done as much, natural language flows and it all happens normally.
So, this presents a problem for Google. Large batches of keyword rich pages all fighting over the same terms create an artificial phrase that doesn't normally exist in natural language. Semantics isn't capable of finding out what's real or artificial, so how can Google use this system to rank on natural occurances rather than artificial ones?
Well, I'll tell ya.

1.) CKDA (I just invented that acronym. It stands for Complex Keyword Density Analysis). In the olden days, keyword density was calculated based upon the number of times a keyword appeared in comparison to the total number of words on the page. Too low, and it's not going to rank. Too high, and you're stuffing so it's not going to rank. CKDA takes it further - it does keyword density in the title tag (the number of times the keyword appears in the title tag compared to all the words in the title tag). It does an anchor density analysis (number of times compared to all the different words within anchor tags on the page). It does the same thing with formatting tags (words that appear in <B><EM><H1>, etc compared to all words within those tags). And, even though they aren't really used for ranking - alt tags, title tags, meta tags, etc. could all be used for density analysis to determine if it's a naturally occuring instance of the word, or if it's a forced instance.
We know (or at least
I know based upon enough observation to be able to consider it as fact in my own mind) that Google is now utilizing CKDA and that they might even be working with highly tightened ranges of acceptible density. They have seemingly loosened up the ranges a bit since Florida first hit. It's still in play, though.
2) Natural Language Identification. I can't be 100% certain that this is in use. We know that ATW (see my example above) has the capability of extracting a natural language description from a page - i.e. be able to say "This is a sentence". We can assume that Google
has the technology to do this, as well. Now, as I say, they may not be fully utilizing it, but the possibility is there. Google
wants to provide results based upon natural occurances, so it only makes sense that Google is (or will be soon) using something along these lines to make certain that at least a good part of a page has some natural language in it.
Obviously, pages have sections that don't use natural language like navigational elements and such so it's not looking for 100%. It may not even be looking for 50%. It might just say, "are there several sentences here?" I'm not sure.
3) Other things I haven't thought of.
So, when it comes down to a means of coming up with a way to ensure that keyword stuffed pages don't start creating "artificial" instances of phrases that will go into the database that handles the semantics, it merely runs a few checks before it gets there and if they pages don't qualify, they don't get to contribute to the information there.
As I say, and this is important, in some sectors that are all highly competitive and highly optimized, then there may not be enough pages that qualify to contribute to a valid "semantics database" relating to that sector. This is why it may appear that there are several algos running at Google. It's not that there are several algos, it's that there is a gaping hole in a key element of the ranking right now. Suddenly changing from a keyword rich page to one that is more natural isn't going to help you right now. It'll happen slowly, over time, as more pages begin to use the technique.
And remember, Google is famous for taking something it doesn't like and rather than just wiping everything out, they nudge it out of existence. It starts out where you lose a few spots, then a few updates later, you lose a few more. And pretty soon you're gone (if you haven't been slowly adapting to what they now want from you).
Right now, in most sectors, keyword richness still works. Over time (probably within the year) it'll work less and less well. You'll still need keywords, but they'll need to appear naturally and no stuffing allowed.
----
Vijay - I have to go actually get some work done today. I'll check out your links from your last post later this afternoon. Thanks for posting them!
G.