Friday, November 1, 2013

Taking Google’s Ngram For a Spin

Eric Schultz

As part of a recent parents weekend at a fine New England educational institution, I had the pleasure of watching Peter Norvig,  director of research at Google, demonstrate Google’s Ngram Viewer (Ngram section starts around 9:30).  Released in December 2010, Ngram (Wikipedia tells us) is a “phrase-usage graphing tool” that charts the yearly count of selected n-grams (letter combinations), or words and phrases.  The Google database currently includes 5.2 million books published between 1500 and 2008 containing 500 billion words in American English, British English, French, German, Spanish, Russian, and Chinese

Peter Norvig’s example that day illustrated the power of the Ngram database, comparing the word combinations “The United States is” and “The United States are.”

The result seems logical: the singular “is” becomes the dominant verb after the American Civil War.  This was a relatively straightforward example, however, and I should warn as you read through this article that you’ll need to channel a little of your inner art historian as the graphs become more complicated and require a longer look, often informed by a refresh on dates.  

I also need to warn that the Ngram tool is a little like a chain saw in the hands of a beginner; my graph does not look exactly like Peter’s (and I’m not sure the reason), everything is cap-sensitive, and there’s no simple facility yet (of which I am aware) for combining terms.  So, for example, one cannot search on “Franklin Delano Roosevelt” and “FDR” combined.

As my kindly doctor says, he won’t prescribe for an illness he cannot diagnose.  So, as a complete novice, I am not endorsing Ngram for serious historical research.  I have concluded, however, in a competition between Angry Birds and NGram, the latter is a far more fascinating diversion.

Here’s a comparison I ran on the terms “one nation under God” and “one nation indivisible.”  Remembering that “under God” became a pressing issue in the 1950s and was signed into law by President Eisenhower in 1954, this graph again makes good sense.

Now let me offer something a little more nuanced, comparing the terms “George Washington” and “Abraham Lincoln” (and remembering that “President Washington” or “Abe Lincoln” might be good terms to one day combine in a total search).  The results are below.

What to make of this?  I would have bet that Lincoln had more sheer volume of mentions than Washington, at least in the last generation, but it turns out to be the opposite.  We can see increases at the time of Lincoln’s 100th birthday in 1909, and Washington’s 200th in 1932.  Beyond that, it appears that Washington really remains first in the hearts (or at least the publications) of his countrymen.  (To complicate the picture, when John Adams is added, he dominates both Washington and Lincoln for most of the 19th century before falling behind permanently around 1900.)

Another graph shows the comparison of four wartime events and seems more straightforward, with the emotional force of Pearl Harbor clearly reflected in literature.

Having written about the history of air conditioning for United Technologies (“Weathermakers to the World,” 2012), I was curious to see what would happen when I tested the term.  Sure enough, the history of the technology was plotted on the screen, from its introduction to the public in movie theaters and department stores beginning around 1925, to its hyped status in the 1930s as a technology capable of pulling America out of the Great Depression, to its growth as the Baby Boomers returned from WWII and invaded suburbia.  

When the New York Times called America’s 1970 census “the Air-Conditioned Census,” it resulted in a decade of torrid press until air conditioning became a mature, more mundane topic.  As climate change becomes a persistent topic (and Google updates its Ngram data beyond 2008), we might well see another upsurge in “air conditioning” literature.

I graphed a small sample of American historians, just to get a sense for the push and pull of various interpretations.  (I could see this as the dreaded final exam in a History class, with the simple instructions: “Comment.”)  

Remembering that every “John Fiske” (historian, philosopher and other) ever written about is contained in the Ngram results, I leave it to my professional historian friends to make sense of this chart.  I might add only that, knowing the world a bit as an entrepreneur, the emphasis on Frederick Jackson Turner’s frontier thesis beginning in the 1980s is not surprising, as he is the adopted historian of the high-tech crowd.

Finally, as some of this writing was done with the World Series raging, I wanted to compare the Cardinals and Red Sox. Being a lifelong Sox fan, I was overjoyed to learn that the recent separation in press shown by Ngram accurately fortold the results of the on-field competition.

1 comment:

hcr said...

This is GREAT!!! Historical tinkering at its best!