15 June, 2008
UPDATE 06/16/08: After
making Slashdot
today and reading the comments, I feel I should clarify a few things regarding this post.
First, it was pointed out (quite rightly) that
Gootrude trends the number of documents in
Google's index that contain each search term, whereas Google Trends tracks the search volume
associated with a search term. These are not the same thing. However, I would wager that
they are related, though certainly not in some nice 1:1 correspondence. That is, if a
relatively unique word - say, "myspace" - is mentioned within 100,000 web sites, then the
number of times this word is used as a search term in Google will be higher than if "myspace"
were mentioned in only 10 web sites just because it is exposed to many more people who would
then be interested in all things "myspace". I would also wager that the reverse is true.
So, if "myspace" is an extremely popular search term, then it will probably also appear in
many more web sites around the Internet. Gootrude attempts to trend the number of hits a
given search term returns in Google, and by the above reasoning, there is a loose correspondence
between this and the number of times the term is searched for in Google. Perhaps this is
not useful, but only time will tell after Gootrude is used by a lot of people. If you will
allow the above interpretation, then the remainder of the post below makes sense.
Now for the original post:
The
Google Trends project allows you to
input search terms like "Myspace", "2008 election", or "Linux", and see how Google tracks how
popular these search terms are over time. The resulting graphs can be quite interesting - spikes
in search volume can sometimes be correlated against particular news articles and world events,
and the Google Trends interface points these out.
This is a handy tool, but there are many search terms that Google Trends does not display any
results for. Such terms (such as "Linux Firewalls" - with the quotes) have insufficient search
volumes to display graphs according to the error message that Google Trends generates. Fair
enough. I suppose that Google sets an internal threshold on search volume, and this threshold
could be set for reasons that range anywhere from Google Trends is still experimental to Google
not wanting to provide data on how it builds its massive search index for emerging search terms.
Either way, I would like a way to see search term trends that Google doesn't currently make
available to me.
Although I'm an open source developer and author, search terms related to my projects are not
popular enough yet to be displayed by Google Trends. So, I had to roll my own trending mechanism,
and this blog post announces the release of a new open source tool
Gootrude
(see the
quick start,
source code, and
download links) that I wrote to do just this.
The basic strategy is to take a collection of search
terms defined by the user, automatically query Google for the number of results associated with each of
these search terms (this is displayed by Google when doing a web search), and graph these numbers over
time with
Gnuplot. At this point let me state up front
that Gootrude only makes use of data that Google freely provides to everyone with normal web
searches, and is meant to be run once per day (so as to not be a pest in terms of the numbers
of queries it makes). As an example, if you type in the word "security" into the Google web
interface, it will return a string like "
Results 1 - 10 of about 1,010,000,000". The
"1,010,000,000" number is collected by Gootrude and stored in a file along with the current time.
For the past year, I have sent a set of search terms through Google once per day with Gootrude
and the results are displayed below. Visible within the data returned from Google are strange
oscillations that vary quite a bit more than I would have expected, and also evidence for what
happens when a large site (like linux.com) posts an article about a Cipherdyne project.
First, below is the graph of the fairly unique word "cipherdyne" since late June 2007. The
filled-in red curve is the absolute number of search results (taken each day around 1am), and
the green line is the 10-day moving average.
As you can see, at the beginning of the graph around July 1st Google steadily shows
about 28,000 results for "cipherdyne", but towards the end of July this dips to well below 20,000 only to
rebound in August to about 30,000. Then, beginning around March 1st, 2008 the results shoot
up to over 100,000 briefly and then back to around 70,000 in May. How does one interpret this
data? It seems unlikely that these fluctuations can be entirely explained by "actual"
day-by-day changes in how external sites reference the term "cipherdyne" - there must be some
index updating component that is internal to Google at work here, and we'll see a better example
of this below.
Now, here is the graph of the search term "gpgdir":
The most obvious feature of the gpgdir graph above is the large spike to around 60,000 results
around May 1st. It turns out that an
article
was posted to
linux.com on the 24th of April, so given that
"gpgdir" is not a common word, the spike seems nicely correlated with the posting of the article
as it got bounced around the Internet and blogosphere. A more interesting feature perhaps is
the sharp cyclic oscillation between July and December 2007. During this time, search results for
gpgdir bounced from 1,000 to around 10,000 and back again several times, and the transition
each time was fast - making the jump to 10,000 over the course of two days and then stabilizing
for about 10 days or so and then back down to 1,000. It is almost as though Google was trying to
establish the proper order of magnitude for "gpgdir" search results during this time via a sort
of step function.
Finally here is the graph of "single packet authorization":
Again, we see a dramatic spike in search results - from around 5,000 to well over 50,000 and
settling down to about 10,000 around the beginning of June. Although there has been some
activity related to SPA in the
Ubuntu forums
and also in the
Gentoo forums, if this caused Google to report the search results as over 50,000 why did
this number return so precipitously back to around 10,000? The links have not gone away, but
they were probably mentioned on other referencing sites and then moved to less important pages
over time on those sites. Perhaps Google is trying to find the appropriate steady state for
its search results, and there are many factors that Google takes into account that are not
available to the public.
There are lots of unanswered questions this sort of data brings to mind:
- All of the data for the above graphs was collected from a single Linux system. How
different would the results be if several systems in different geographic locations
collected the data and the average for each data point was used instead?
- Each data point was collected around 1am every morning. If the data collection time
were, say, 1pm, would the results have been significantly different?
- What is the "optimal" time scale for the moving average? Given that Google's own
Trends interface seems to show search results on the macro level, would a much longer
moving average than 10 days - perhaps on the order of several weeks - be a more
accurate reflection of search popularity?
One thing is clear - getting search results that are meaningful is much easier with unusual
search terms. With the posting of this blog entry, the term "Gootrude" should evolve nicely
within Google results, and the graph of these results will be updated daily on the
main Gootrude page so you can see this evolution as it
unfolds.
In closing, I would like to mention that Gootrude is just getting started, so there are
lots of enhancements that need to be made. Some of the most important features to develop
are:
- Integration with the Google Charts API.
- Development of an online web portal for Gootrude so that users don't have to have their
own infrastructure to run Gootrude.
- Ability to import search data from different Gootrude collection systems.
- Add support for data collected from additional search engines.
If you are an open source developer and would like to contribute, see the
TODO file
for an updated list of development tasks, or send me an email (mbr[at]cipherdyne[dot]org). Also,
if you have any ideas or feedback on why some of the graphs above look the way they do in the
context of how Google builds its index, please email me.
Finally, here are a few additional graphs of search terms over the past year: