It happens. According to Google Analytics a website I built had 14.4 billion in revenue one day in November, 2008. While I wish it was true, the bigger problem is that any chart that includes this datapoint is essentially useless since it dwarfs the daily revenue of every other day.
"Originally, Google neglected updating Google Trends on a regular basis. In March 2007, internet bloggers noticed that Google had not added new data since November 2006, and Trends was updated within a week. Google did not update Trends from March until July 30, and only after it was blogged about, again.[2] Google now claims to be "updating the information provided by Google Trends daily; Hot Trends is updated hourly."
Google Insights for Search seems to be better for this sort of analysis since it offers regional filtering options and puts the searched term into context.
Seems they had an error with how it calculated? The spike exists for everything I can find that existed back then, but the spike seems relative to the total, so maybe they accidentally counted 1 search as 2 or something? Maybe it was an issue with their use of ajax that caused 2 search requests to be fired off to google? Maybe the data isn't wrong, maybe something caused extra searches?
When I first saw the spike it was for XKCD and Penny Arcade, I was expecting it to be internet culture related. I mean at the time XKCD was still growing in readership and to get a colossal jump was a bit weird so I was wondering about a Digg or Reddit boost, then when I noticed it was wider and into more obscure terms I wondered if it was an oddball 4chan event. However I saw it in literally every term I searched, the only ones I couldn't see it in were terms that already had huge random spikes.
So I thought I'd post it here, see if anyone else could figure it out and it looks like a few people have good suggestions - gotta love HN.
Do they even have the ability to repair the data? If information is logged in real-time, and there is no easy way to filter through billions of search query terms to de-dupe (or whatever fix may be required), it might not be possible to correct the dataset.
Because the original data on which the statistics are based was probably deleted a long time ago. And that's the only way to get the 'right' numbers.
The only option would be to filter out the peak, then again, this will also lose all real information in that timespan. Just too much bother.
I would assume that google, or any other similar data aggregation company, would log and keep statistics and summary information, but discard actual raw data. They do have some of the biggest storage capabilities, but thats no reason to fill it full of apache logs.
For instance, if a spam site ranks for a query, it is only super last resort to ban them manually - they would prefer to change the next incarnation of the algorithm to block that spam.