If I put you in charge of fixing this problem, what would you do, how complicated would it be and how long would it take to "fix"? No problem is unfixable, and the solution may give the problem context.
I think they could have made the switch to 5 bytes within a few months, assuming that they committed sufficient resources to the task.
OK, Daniel, you seem to have missed the point of my question.
I asked you this multi-part question: "what would you do, how complicated would it be and how long would it take to "fix" what would you do, how complicated would it be and how long would it take to "fix""
You have made the claim, several times, that this is a non-trivial problem to solve. OK, I accept that. But what is the solution? "They could have made the switch to 5 bytes within a few months, assuming that they committed sufficient resources to the task" is not an answer that has any real technical evidence to support it, has no alternate possibilities attached, and it is an assessment of the possibilities, not actually a solution.
Given this thread is now 6 pages long, the almost complete lack of technical explainations for this "problem" is bizarre. I felt sure, given an open invitation, you could have supplied me with technical data like this article: http://www.searchgui...article215.html
, that even goes so far as to offer a specific code snippet showing the trivial nature of a fix.
Specifically, I was a bit disappointed that you have not provided any evidence to support the notion that 5 bytes is the optimal number for a docID, nor that this is the best approach, nor in fact anything like what would be required, in terms of coding, to actually implement such a solution. One paragraph to answer the solution question, and then 4-5 on why google did all this. Doesn't seem right to me.
Alas, all we have had since my last post is more speculation about sociological, marketing and monetary concerns, as well as a segway into quality of results, rather than raw technical data.
This problem, if it ever existed, is a purely design and technical problem. Other factors may play a part in prioritising its "fixing", but that makes the assumption that the problem exists currently, and wasn't already fixed.
Usually, in any debate, speculation as to why a thing suppossedly happenned, and theories on why it suppossedly continues, is reserved for after a theory is proven / shown, not before and/or as part of the argument to prove such a belief. Motivation for doing / not doing can not form the primary basis for evidence of a thing, nor indeed should 5 year old evidence.
So, in the interest of not being hypocrital, let me offer my $0.02 solution. Why not have 32 bit docIDs but have multiple inverted indexes? So, a docID is specific to one inverted index, which in turn has its own index ID. This would require an extra 32 bits per URL, not per entry into an inverted index. normalising a database saves space, and this would have aditional benefits as well.
This solution would also allow multiple threads on multiple machines to be run that search through each index seperately, utilising pre-existing functions. If uindexes were kept small, this would also improve speed as there would be far less entries per inverted index to search through.
Now, such a solution would require changes in core logic, no doubt, but would not take massive ammounts of time (certainly not 17 months), and also makes logically seperate indexes, like News, easier to implement.
Now, this is "off the top of my head", and I am sure there are plenty of other solutions to such a problem. Perhaps someone can offer some others, so that we can all better understand both the expense and / or difficulty Google would face.
Edited by projectphp, 19 September 2004 - 09:00 PM.