Simanaitis Says

On cars, old, new and future; science & technology; vintage airplanes, computer flight simulation of them; Sherlockiana; our English language; travel; and other stuff


CALL ME Big Data. Several years ago—never mind how long precisely—computers became potent enough to analyze huge collections of information. These metastudies of data mining have included everything from medical reports to classical rock lyrics.


A fascinating piece by Steve Lohr in The New York Times, January 26, 2013, describes Big Data techniques applied to literature ( And I mean really big: One of these computer analyses considered 3592 different works published from 1780 to 1900.

This data mining was done by Matthew L. Jockers, assistant professor of English at the University of Nebraska as well as researcher at that institution’s Center for Digital Research in the Humanities. He has a book coming out on the subject, Macroanalysis: Digital Methods & Literary History.


Macroanalysis: Digital Methods & Literary History, by Matthew L. Jockers, University of Illinois Press, June 2013. Preorders are being taken at  

As with other Big Data analyses, his computer algorithms seek out patterns and themes that link to each other. His particular mining, though, takes place in novels, not societal trends. Like Google’s ranking of websites, these algorithms assess the strength of the links.

One interesting result concerns the influence of authors on the works of others. Conventional wisdom and previous “micro” literary analyses place the likes of Charles Dickens and Mark Twain among the most influential novelists of the 19th century. However, Jockers’ macrostudy identifies Pride and Prejudice’s Jane Austin and Ivanhoe’s Sir Walter Scott as having the greatest effect on other authors in terms of style and themes.

Harvard researchers Jean-Baptiste Michel and Erez Lieberman-Aiden undertook similar mining of Big Data in the respository of Google Books. Their goal was to track word usage over time, compare related words and invoke graphical techniques in analyzing the usage.


Lieberman-Aiden and Michel describe Culturomics with graphical techniques.

Google has scanned 20 million books, some dating back to 1500. As an example of usage over time, it’s noted that the word “men” absolutely overwhelmed “women” in printed appearance for centuries—until a crossover in 1985, with “women” leading ever since.

In Science, 14 January 2011, Michel and colleagues use the term “Culturomics” to describe this application of Big Data techniques to quantifying culture. An example offered is purely a temporal one: References to the word “1880” peaked in that year and fell to half this amount by 1912, a lag of 32 years. By contrast, “1973” had a half-life of only 10 years.

Remarked Michel, “We are forgetting our past faster with each passing year.”

Another study, performed by Cornell computer scientist Jon Kleinberg and colleagues, mined the Big Data of “Quotes” offered at Internet Movie Database, They studied quoted lines from about 1000 films and their recurrences and related links on the Web.

The Internet Movie Database

The Internet Movie Database contains a wealth of film nuggets waiting to be mined.

For instance, there’s the classic line from Apocalypse Now, “I love the smell of napalm in the morning.” The computer algorithm recognized its similarity to an advertising tag line, precisely the same but for replacing the word “napalm” with “coffee.”

Now there’s a nugget of Big Data. ds

© Dennis Simanaitis,, 2013

One comment on “DATA MINING

  1. Pingback: IS PRIVACY GONE FOR GOOD? | Simanaitis Says

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.


This entry was posted on February 1, 2013 by in Sci-Tech and tagged , .
%d bloggers like this: