From Galileo to Google: How Big Data Illuminates Human Culture
By Maria Popova
Given my longtime fascination with the so-termed digital humanities and with data visualization, and my occasional dabbles in the intersection of the two, I’ve followed the work of data scholars Erez Aiden and Jean-Baptiste Michel with intense interest since its public beginnings. Now, they have collected and contextualized their findings in the compelling Uncharted: Big Data as a Lens on Human Culture (public library) — a stimulating record of their seven-year quest to quantify cultural change through the dual lens of history and digital data by analyzing the contents of the 30,000 books digitized by Google, using Google’s Ngram viewer tool to explore how the usage frequency of specific words changes over time and what that might reveal about corresponding shifts in our cultural values and beliefs about economics, politics, health, science, the arts, and more.
Aiden and Michel, who met at Harvard’s Program for Evolutionary Dynamics and dubbed their field of research “culturomics,” contextualize the premise:
At its core, this big data revolution is about how humans create and preserve a historical record of their activities. Its consequences will transform how we look at ourselves. It will enable the creation of new scopes that make it possible for our society to more effectively probe its own nature. Big data is going to change the humanities, transform the social sciences, and renegotiate the relationship between the world of commerce and the ivory tower.
And big data is indeed big — humongous, even. Each of us, on average, has an annual data footprint of nearly one terabyte, and together we amount to a staggering five zettabytes per year. Since each byte consists of eight bits — short for “binary digits,” with each bit representing a binary yes-no question answered either by a 1 (“yes”) or a 0 (“no”) — humanity’s aggregate annual data footprint is equivalent to a gobsmacking forty sextillion (40,000,000,000,000,000,000,000) bits. Aiden and Michel humanize these numbers, so challenging for the human brain to grasp, with a pause-giving analog analogy:
If you wrote out the information contained in one megabyte by hand, the resulting line of 1s and 0s would be more than five times as tall as Mount Everest. If you wrote out one gigabyte by hand, it would circumnavigate the globe at the equator. If you wrote out one terabyte by hand, it would extend to Saturn and back twenty-five times. If you wrote out one petabyte by hand, you could make a round trip to the Voyager 1 probe, the most distant man-made object in the universe. If you wrote out one exabyte by hand, you would reach the star Alpha Centauri. If you wrote out all five zettabytes that humans produce each year by hand, you would reach the galactic core of the Milky Way. If instead of sending e-mails and streaming movies, you used your five zettabytes as an ancient shepherd might have—to count sheep—you could easily count a flock that filled the entire universe, leaving no empty space at all.
But what makes our age unlike any preceding era is precisely that this information exists not as handwritten documents but as digital data, which opens up wholly new frontiers of making sense of the meaning embedded in these seemingly meaningless strings of 1’s and 0’s. Aiden and Michel put it beautifully:
Like an optic lens, which makes it possible to reliably transform and manipulate light, digital media make it possible to reliably transform and manipulate information. Given enough digital records and enough computing power, a new vantage point on human culture becomes possible, one that has the potential to make awe-inspiring contributions to how we understand the world and our place in it.
Aiden and Michel have focused their efforts on one particular, and particularly important, aspect of the big-data universe: books. More specifically, the more than 30 million books digitized by Google, or roughly a quarter of humanity’s existing books. They call this digital library “one of the most fascinating datasets in the history of history,” and it certainly is — not only due to its scale, which exceeds the collections of any university library, from Oxford’s 11 million volumes to Harvard’s 17 million, as well as the National Library of Russia with its 15 million and the National Library of China with its 26 million. At the outset of Aiden and Michel’s project, the only analog library still greater than the Google Books collection was the Library of Congress, which contains 33 million — but Google may well have surpassed that number by now.
Still, big data presents a number of problems. For one, it’s messy — something that doesn’t sit well with scientists’ preference for “carefully constructed questions using elegant experiments that produce consistently accurate results,” Aiden and Michel point out. By contrast, a big dataset tends to be “a miscellany of facts and measurements, collected for no scientific purpose, using an ad hoc procedure … riddled with errors, and marred by numerous, frustrating gaps.”
To further complicate things, big data doesn’t comply with the basic premise of the scientific method — rather than eventuating causal relationships borne out of pre-existing hypotheses, it presents a seemingly bottomless pit of correlations awaiting discovery, often through the combination of doggedness and serendipity, an approach diametrically opposed to hypothesis-driven research. But that, arguably, is exactly what makes big data so alluring — as Stuart Firestein has argued in his fantastic case for why ignorance rather than certitude drives science, modern science could use what the scientific establishment so readily dismisses as “curiosity-driven research” — exploratory, hypothesis-free investigations of processes, relationships, and phenomena.
Michel and Aiden address these biases of science:
As we continue to stockpile unexplained and underexplained patterns, some have argued that correlation is threatening to unseat causation as the bedrock of scientific storytelling. Or even that the emergence of big data will lead to the end of theory. But that view is a little hard to swallow. Among the greatest triumphs of modern science are theories, like Einstein’s general relativity or Darwin’s evolution by natural selection, that explain the cause of a complex phenomenon in terms of a small set of first principles. If we stop striving for such theories, we risk losing sight of what science has always been about. What does it mean when we can make millions of discoveries, but can’t explain a single one? It doesn’t mean that we should give up on explaining things. It just means that we have our work cut out for us.
Such curiosity-driven inquiries speak to the heart of science — the eternal question of what science actually is — which Michel and Aiden capture elegantly:
What makes a problem fascinating? No one really agrees. It seemed to us that a fascinating question was something that a young child might ask, that no one knew how to answer, and for which a few person-years of scientific exploration — the kind of effort we could muster ourselves — might result in meaningful progress. Children are a great source of ideas for scientists, because the questions they ask, though superficially simple and easy to understand, are so often profound.
The promise of big data, it seems, is at once to return us to the roots of our childlike curiosity and to advance science to new frontiers of understanding the world. Much like the invention of the telescope transformed modern science and empowered thinkers like Galileo to spark a new understanding of the world, the rise of big data, Aiden and Michel argue, offers to “create a kind of scope that, instead of observing physical objects, would observe historical change” — and, in the process, to catapult us into unprecedented heights of knowledge:
The great promise of a new scope is that it can take us to uncharted worlds. But the great danger of a new scope is that, in our enthusiasm, we too quickly pass from what our eyes see to what our mind’s eye hopes to see. Even the most powerful data yields to the sovereignty of its interpreter. … Through our scopes, we see ourselves. Every new lens is also a new mirror.
They illustrate this with an example by way of Galileo himself, who began a series of observations of Mars in the fall of 1610 and soon noticed something remarkably curious: Mars seemed to be getting smaller and smaller as the months progressed, shrinking down to a third of its September size by December. This, of course, indicated that the planet was drifting farther and farther from Earth, which went on to become that essential piece of evidence demonstrating that the Ptolemic idea of the geocentric universe was wrong: Earth wasn’t at the center of the cosmos, and the planets were moving according to their own orbits.
But Galileo, with this primitive telescope, couldn’t see any detail of red planet’s surface — that didn’t happen until centuries later when an astronomer by the name of Giovanni Schiaparelli aimed his far more powerful telescope at Mars. Suddenly, before his eyes were mammoth ridges that covered the planet’s surface like painted lines. These findings made their way to a man named Percival Lowell and impressed him so that in 1894, he built an entire observatory in Flagstaff, Arizona, equipped with a yet more powerful telescope, so that he could observe those mysterious lines. Lowell and his team went on to painstakingly record and map Mars’s mesh of nearly 700 criss-crossing “canals,” all the while wondering how they might have been created.
Turning to the previous century’s theory that Mars’s scarce water reserves were contained in the planet’s frozen poles, Lowell assumed that the lines were a meticulous network of canals made by the inhabitants of a perishing planet in an effort to rehydrate it back to life. Based solely on his telescopic observations and the hypotheses of yore, Lowell concluded that Mars was populated by intelligent life — a “discovery” that at once excited and riled the scientific community, and even permeated popular culture. Even Henry Norris Russell, the unofficial “dean of American astronomers,” called Lowell’s ideas “perhaps the best of the existing theories, and certainly the most stimulating to the imagination.” And so they were — by 1898, H.G. Wells had penned The War of the Worlds.
While Lowell’s ideas dwindled in the centuries that followed, they still held their appeal. It wasn’t until NASA’s landmark Mariner mission beamed back close-up photos of Mars — the significance of which Carl Sagan, Ray Bradbury, and Arthur C. Clarke famously debated — that the anticlimactic reality set in: There were no fanciful irrigation canals, and no little green men who built them.
The moral, as Aiden and Michel point out, is that “Martians didn’t come from Mars: They came from the mind of [Lowell].”
What big data offers, then, is hope for unbridling some of our cultural ideas and ideologies from the realm of myth and anchoring them instead to the spirit of science — which brings us to the crux of the issue:
Digital historical records are making it possible to quantify our human collective as never before.
Human history is much more than words can tell. History is also found in the maps we drew and the sculptures we crafted. It’s in the houses we built, the fields we kept, and the clothes we wore. It’s in the food we ate, the music we played, and the gods we believed in. It’s in the caves we painted and the fossils of the creatures that came before us. Inevitably, most of this material will be lost: Our creativity far outstrips our record keeping. But today, more of it can be preserved than ever before.
What makes Aiden and Michel’s efforts particularly noteworthy, however, is that they are as much a work of scrupulous scholarship as of passionate advocacy. They are doing for big data in the humanities what Neil deGrasse Tyson has been doing for space exploration, instigating both cultural interest and government support. They remind us that in today’s era of big science, where the Human Genome Project’s price tag was $3 billion and the Large Hadron Collider’s quest for the Higgs boson cost $9 billion, there is an enormous disconnect between the cultural value of the humanities and the actual price we put on better understanding human history — by contrast to such big science enterprises, the entire annual budget of the National Endowment for the Humanities is a mere $150 million. Michel and Aiden remind us just what’s at stake:
The problem of digitizing the historical record represents an unprecedented opportunity for big-science-style work in the humanities. If we can justify multibillion-dollar projects in the sciences, we should also consider the potential impact of a multibillion-dollar project aimed at recording, preserving, and sharing the most important and fragile tranches of our history to make them widely available for ourselves and our children. By working together, teams of scientists, humanists, and engineers can create shared resources of extraordinary power. These efforts could easily seed the Googles and Facebooks of tomorrow. After all, both these companies started as efforts to digitize aspects of our society. Big humanities is waiting to happen.
And yet the idea is nothing new. Count on the great Isaac Asimov to have presaged it, much like he did online education, the fate of space exploration, and even Carl Sagan’s rise to stardom. In his legendary Foundation trilogy, Asimov conceives his hero, Hari Seldon, as a masterful mathematician who can predict the future through complex mathematical equations rooted in aggregate measurements about the state of society at any given point in time. Like Seldon, who can’t anticipate what any individual person will do but can foreshadow larger cultural outcomes, big data, Aiden and Michel argue, is the real-life equivalent of Asimov’s idea, which he termed “psychohistory” — an invaluable tool for big-picture insight into our collective future.
Perhaps more than anything, however, big data holds the promise of righting the balance of quality over quantity in our culture of information overabundance, helping us to extract meaning from (digital) matter. In a society that tweets more words every hour than all of the surviving ancient Greek texts combined, we certainly could use that.
Uncharted is an excellent and timely read in its entirety, both as a curious window into the secret life of language and as an important piece of advocacy for the value of the digital humanities in the age of data. Sample the project with Aiden and Michel’s entertaining and illuminating TED talk:
Published January 17, 2014