In a blog post today Google says they’ve identified 1 trillion unique URLs on the web. It’s actually more, they say, but some web pages have multiple URLs with exactly the same content or URLs that are auto-generated copies of each other.
What they note way down in the fourth paragraph, however, is that they don’t actually index all of those pages, so you can’t find them on Google . Estimates on the true size of the Google index are a mere 40 billion pages or so.
Why don’t they index all the pages they’ve found? Some of them are spam. But it’s also very expensive to index sites. And the fact that Google indexes many news sites, blogs and other rapidly changing web sites every 15 minutes makes all that indexing even more expensive. So they make value judgment on what to actually index and what not to. And most of the web is left out.
Google also says “But we’re proud to have the most comprehensive index of any search engine.”
That may be true today, but it probably won’t be true next week (check back here then). Google knows that as well as we do, and that’s why they posted this today.