Website Promotion Tips - Google Dance - The Index Update of the Google Search Engine
Google Dance - The Index Update of the Google Search Engine
Google Dance - The Index Update of the Google Search Engine
The name "Google Dance" has often been used to describe
the index update of the Google search engine. Google's index update
occurred on average once per month. During an index update there was
significant movement in search results and Google showed new
backward links for pages. However, in mid-2003 Google started to
update it's index continuously. It appears that, still, there has to
be an update of the complete index once in a while and during this
time new backward links are shown. But, because of the continuous
update, the effects on search results seem to be rather
insignificant.
We will keep this site up running because it provides some
information beyond the Google Dance. But there will no longer be a
monitoring of updated data centers during a "Dance".
The Technical Background of the Google Dance
The Google search engine pulls its results from more than 10,000
servers which are simple Linux PCs that are used by Google for
reasons of cost. Naturally, an index update cannot be proceeded on
all those servers at the same time. One server after the other has
to be updated with the new index.
Many webmasters think that, during the Google Dance, Google is in
some way able to control if a server with the new index or a server
with an old index responds to a search query. But, since Google's
index is inverse, this would be very complicated. As we will show
below, there is no such control within the system. In fact, the
reason for the Google Dance is Google's way of using the Domain Name
System (DNS).
Google Dance and DNS
Not only Google's index is spread over more than 10,000 servers, but
also these servers are, as of now, placed in 13 different data
centers. These data centers are mainly located in the US (i.e. Santa
Clara, California and Herndon, Virginia) and in Dublin, Ireland.
In order to direct traffic to all these data centers, Google could
thoeretically record all queries centrally and then send them to the
data centers. But this would obviously be inefficient. In fact, each
data center has its own IP address (numerical address on the
internet) and the way these IP addresses are accessed is managed by
the Domain Name System.
Basically, the DNS works like this: On the Internet, data transfers
always take place in-between IP addresses. The information about
which domain resolves to which IP address is provided by the name
servers of the DNS. When a user enters a domain into his browser, a
locally configured name server gets him the IP address for that
domain by contacting the name server which is responsible for that
domain. (The DNS is structured hierarchically. Illustrating the
whole process would go beyond the scope of this paper.) The IP
address is then cached by the name server, so that it is not
necessary to contact the responsible name server each time a
connection is built up to a domain.
The records for a domain at the responsible name server constitute
for how long the record may be cached by a caching name server. This
is the Time To Live (TTL) of a domain. As soon as the TTL expires,
the caching name server has to fetch the record for a domain again
from the responsible name server. Quite often, the TTL is set to one
or more days. In contrast, the Time To Live of the domain
www.google.com is only five minutes. So, a name server may only
cache Google's IP address for five minutes and has then to look up
the IP address again.
Each time, Google's name server is contacted, it sends back the IP
address of only one data center. In this way, Google queries are
always directed to different data centers by changing DNS records.
On the one hand, the DNS records may be based on the load of the
single data centers. In this way, Google would conduct a simple form
of load balancing by its use of the DNS. On the other hand, the
geographical location of a caching name server may influence how
often it receives the single data centers' IP addresses. So, the
distance for data transmissions can be reduced. In order to show the
DNS records of the domain www.google.com, we
present them here by the example of one caching name server.
How data centers, DNS and Google Dance are related, is easily
answered. During the Google Dance, the data centers do not receive
the new index at the same time. In fact, the new index is
transferred to one data center after the other. When a user queries
Google during the Google Dance, he may get the results from a data
center which still has the old index at one point im time and from a
data center which has the new index a few minutes later. From the
users perspective, the index update took place within some minutes.
But of course, this procedure may reverse, so that Google switches
seemingly between the old and the new index.
Finally, it shall be noted that Google did the DNS load balancing by
themselves until September 2003. Since then, they use the services
and, hence, the name servers of Akamai Technologies, Inc.
IP Addresses and Domains of Google's Data Centers
The progression of a Google Dance could basically be watched by
querying the IP addresses of Google's data centers. But queries on
the IP addresses are normally redirected to www.google.com. However,
Google has domains which resolve to the single data centers' IP
addresses. These domains as well as their IP addresses are shown in
the following list.
Note: Searches at www-zu and www-sj are currently redirected to
other data centers. Since results for searches at their IP addresses
fluctuate heavily during a Google Dance, also these searches seem to
be internally routed to other data centers. As we can see from our
statistics for Google's DNS records, there are currently no
searches at www.google.com directed to www-zu and www-sj. So, we can
assume that the data centers are offline.
Those that keep an eye on Google's index updates often think that
the Google Dance is over, when they see the new index at
www.google.com or when they don't see the old index at
www.google.com for some time. In fact, the update is not finished
until all the domains listed above provide results from the new
index.
The index updates at the single data centers seem to happen at one
point in time. As soon as one data center shows results from the new
index, it won't switch back to the old index. This happens most
likely because the index is redundant at each data center and at
first, only one part of the servers (eventually half of them) is
updated. During this period, only the other half of the servers is
active and provides search results. As soon as the update of the
first half of servers is finished, they become active and provide
search results while the other half receives the new index. Thus,
from the user's perspective, the update of one data centers happens
at one point in time.
Finally, it shall be noted that the access to the single data
centers is generally controlled by the DNS only, but sometimes
queries are redirected. However, this is easy to detect: When for a
query at one of the domains listed above, the links to Google's
cache do not comply with the IP address that belongs to the domain,
then the query is redirected. If this happens, Google inhibits - for
whatever reason - the access to one data center.
The Google Dance Test Domains www2 and www3
The beginning of a Google Dance can always be watched at the test
domains www2.google.com and www3.google.com. Those domains normally
have stable DNS records which make the domains resolve to only one
(often the same) IP address. Before the Google Dance begins, at
least one of the test domains is assigned the IP address of the data
center that receives the new index first.
Building up a completely new index once per month can cause quite
some trouble. After all, Google has to spider some billion documents
an then to process many TeraBytes of data. Therefore, testing the
new index is inevitable. Of course, the folks at Google don't need
the test domains themselves. Most certainly, they have many options
to check a new index internally, but they do not have a lot of time
to conduct the tests.
So, the reason for having www2 and www3 is rather to show the new
index to webmasters which are interested in their upcoming rankings.
Many of these webmasters discuss the new index at the Google forums
out on the web. These discussions can be observed by Google
employees. At that time, the general public cannot see the new index
yet, because the DNS records for www.google.com normally do not
point to the IP address of the data center that is updated first
when the update begins.
As soon as Google's test community of forums members does not find
any severe malfunctions caused by the new index, Google's DNS
records are ready to make www.google.com resolve the the data center
that is updated first. This is the time when the Google Dance
begins. But if severe malfunctions become obvious during this test
phase, there is still the possibility to cancel the update at the
other data centers. The domain www.google.com would not resolve to
the data center which has the flawed index and the general public
could not take any notice about it. In this case, the index could be
rebuilt or the web could be spidered again.
So, the search results which are to be seen on www2.google.com and
www3.google.com will always appear on www.google.com later on, as
long as there is a regular index update. However, there may be minor
fluctuations. On the one hand, the index at one data center never
absolutely equals the index at another data center. We can easily
check this by watching the number of results for the same query at
the data center domains listed above, which often differ from each
other. On the other hand, it is often assumed that the iterative
PageRank calculation is not finished yet, when the Google Dance
begins so that preliminary values exert influence on rankings at
that point in time.
The New PageRank Values during the Google Dance
Most webmasters are interested in ranking changes for their website
during the Google Dance. But, besides that, many also want to know
about their new PageRank values. Normally, the Google Toolbar
fetches the PageRank values from the data center that is specified
by its IP address in the actual DNS record for www.google.com.
Hence, when the Google Dance begins, the Toolbar usually displays
the old PageRank values.
Google submits PageRank values in simple text files to the Toolbar.
In former times, this happened via XML. The switch to text files
occured in August 2002. The PageRank files can be requested directly
from the domain www.google.com. Basically, the URLs for those files
look like follows (without line breaks):
There is only one line of text in the PageRank files. The last
cipher in this line is PageRank.
The parameters incorporated in the above shown URL are inevitable
for the display of the PageRank files in a browser. The value "navclient-auto"
for the parameter "client" identifies the Toolbar. Via the parameter
"q" the URL is submitted. The value "Rank" for the parameter
"features" determines that the PageRank files are requested. If it
is omitted, Google's servers still transmit XML files. The parameter
"ch" transfers a checksum for the URL to Google, whereby this
checksum can only change when the Toolbar version is updated by
Google.
The PageRank files that are requested by the Google Toolbar are
cached by the Internet Explorer. So, their URLs and the checksums
can simply been found out by having a look at the folder Temporary
Internet Files. Knowing the checksums of your URLs, you can view the
PageRank files in your browser. Since the PageRank files are kept in
the browser cache and, thus, are clearly visible, and as long as
requests are not automated, watching the PageRank files in a browser
should not be a violation of Google's Terms of Service. However, you
should be cautious. The Toolbar submits its own User-Agent to
Google. It is:
Mozilla/4.0 (compatible; GoogleToolbar 1.1.60-deleon; OS SE
4.10)
1.1.60-deleon is a Toolbar version which may of course change. OS is
the operating system that you have installed. So, Google is able to
identify requests by browsers, if they do not go out via a proxy and
if the User-Agent is not modified accordingly.
Now, let's see how we can get the new PageRank values. Taking a look
at IE's cache, you will notice that the PageRank files are not
requested from the domain www.google.com but from IP addresses like
216.239.33.102. Additionally, the PageRank files' URLs often contain
a parameter "failedip" that is set to values like
"216.239.35.102;1111" (Its function is not absolutely clear).
However, it is pretty easy to get the new PageRank values. Simply
modify the IP addresses in the URL so that the request goes to one
of the data centers that already has the new index. The necessary
information is given above.
About The Author
The contents of this document may be reproduced on the web, provided
that a copyright notice is included and that there is a straight
HTML hyperlink to the corresponding page at
dance.efactory.de in direct
context.