QUOTE(Robert813 @ Nov 7 2005, 03:45 PM)
QUOTE(Michael Martinez @ Nov 7 2005, 03:27 PM)
They never presented it as crucial to their relevance ranking algorithm.
What would that have to do with the seperate PR algo?
PageRank has always, going back to the original paper (which was about the search engine, not about PageRank itself), been only one of many components, and really not a very vital one.
Not true Michael...
"The Google search engine has two important features that help it produce high precision results. First, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called PageRank and is described in detail in [Page 98]. Second, Google utilizes link to improve search results. "
Look again. There is nothing in that paragraph which addresses how they determine the final results.
If you look above that section, you'll see where they wrote: "... In particular, link structure [Page 98] and link text provide a lot of information for making relevance judgments and quality filtering. Google makes use of both link structure and anchor text (see Sections 2.1 and 2.2). "
So, they're throwing "anchor text" into the mix right from the start. And then there is section 2.3:
2.3 Other Features
Aside from PageRank and the use of anchor text, Google has several other features. First, it has location information for all hits and so it makes extensive use of proximity in search. Second, Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font are weighted higher than other words. Third, full raw HTML of pages is available in a repository.
"Words in a larger or bolder font are weighted higher than other words."
Hm. Where have I pointed that out before?
So, if we are judging the quality of weight by the document's internal numbering system, PageRank is given equal weight with anchor text and bolding/large-size font usage.
They then go into extensive detail about how the search engine database is structured, so PageRank hardly constitutes a major portion of the document. But finally we get down to Section 4.5.1 (RANKING SYSTEM) where they write:
4.5.1 The Ranking System
Google maintains much more information about web documents than typical search engines. Every hitlist includes position, font, and capitalization information. Additionally, we factor in hits from anchor text and the PageRank of the document. Combining all of this information into a rank is difficult. We designed our ranking function so that no particular factor can have too much influence. First, consider the simplest case -- a single word query. In order to rank a document with a single word query, Google looks at that document's hit list for that word. Google considers each hit to be one of several different types (title, anchor, URL, plain text large font, plain text small font, ...), each of which has its own type-weight. The type-weights make up a vector indexed by type. Google counts the number of hits of each type in the hit list. Then every count is converted into a count-weight. Count-weights increase linearly with counts at first but quickly taper off so that more than a certain count will not help. We take the dot product of the vector of count-weights with the vector of type-weights to compute an IR score for the document. Finally, the IR score is combined with PageRank to give a final rank to the document.
So, here they are looking at: position, font, capitalization, anchor text, title, URL, plain text large font, plain text small font, ..., and PageRank.
But then they complicate it for multi-word queries:
For a multi-word search, the situation is more complicated. Now multiple hit lists must be scanned through at once so that hits occurring close together in a document are weighted higher than hits occurring far apart. The hits from the multiple hit lists are matched up so that nearby hits are matched together. For every matched set of hits, a proximity is computed. The proximity is based on how far apart the hits are in the document (or anchor) but is classified into 10 different value "bins" ranging from a phrase match to "not even close". Counts are computed not only for every type of hit but for every type and proximity. Every type and proximity pair has a type-prox-weight. The counts are converted into count-weights and we take the dot product of the count-weights and the type-prox-weights to compute an IR score. All of these numbers and matrices can all be displayed with the search results using a special debug mode. These displays have been very helpful in developing the ranking system.
Now we're into the importance of proximity pairs, count weights, and whatnot.
Oh, sure, PageRank is included somewhere in the mix.
They were proud of PageRank, to be sure, but it doesn't have anything to do with relevance.