This week I gave a talk at the workshop, organized by Chris Chase-Dunn and Hiroko Inoue at the University of California in Riverside. The talk was about the current status of Seshat: Global History Databank. As I was preparing the talk, I read an article in the Atlantic about digital archaeology, Archaeology’s Information Revolution. Among other things, the article discussed a huge database of Biblical-era pottery, led by Thomas Levy, an archaeologist at the University of California, San Diego.
“We’re collecting billions of those data points,” Levy told Adrienne Lafrance, the author of the story. Later Lafrance writes,
It’s mind-boggling to think of the amount of data now flowing into the annals of archaeology. But the same thing that makes all this data useful—the sheer volume of information—presents difficult new challenges. Archaeologists aren’t yet sure about the best way to preserve these datasets, and they don’t know how, and in what format, they should be shared across networks.
After reading this article I went back to putting together my talk. One slide in it reported on where we are in Seshat—we are coding roughly 1500 variables for over 400 polities (a polity is an independent political unit, which includes not only states and empires, but also city-states, chiefdoms, and even politically independent villages). As those of you who follow Seshat on Twitter know, we currently have over 115,000 coded data points. This is a truly massive amount of data.
But as I finished this summary slide, I thought, will my colleagues be impressed by the scale of Seshat? After all, what are a mere 100,000 data points compared to billions and billions of data that digital archaeologists and digital historians deal with. The Atlantic article mentioned another data project by Sarah Parcak, a “space archaeologist” (I love that moniker), who analyzes satellite imagery of Earth to find unknown archaeological sites. Who knows how many gigabytes, terabytes, or perhaps even petabytes of information she deals with every day?
But as the Atlantic article makes it very clear, there is a big difference between those billions of data and the data in Seshat. What we have in Seshat is really not data, but facts, each being a complex of curated information. Let me illustrate this with one such fact. I will use the coded value of one particular variable, population of the largest settlement, for a particular polity, New Kingdom of Egypt in the Ramesside period. The bare fact looks like this:
♠ Population of the largest settlement ♣ [250,000-300,000] ♥
The square brackets indicate that the precise value for this variable lies somewhere between 250,000 and 300,000 people. In other words, Seshat “knows” not only how big this city was, but also that there is some uncertainty about this estimate, and what are the limits of this uncertainty. In addition, Seshat can record when there is not only uncertainty, but disagreement among experts. For example, there are two schools on what was the population of Italy under the first Roman emperor, Augustus. These estimates differ by a a factor of three. Seshat takes note of such disagreements.
And this is not all. Seshat also knows which city we are talking about—it’s Pi-Ramesses, which was located in the eastern delta of the Nile. There is a descriptive paragraph following the numerical estimate, which explains where it comes from:
“the later Ramesside period marked a new era, when Pi-Ramesses, in the eastern Delta, became the main capital of the kingdom. The Austrian excavations are gradually revealing the huge dimensions and complexity of this metropolis of about 18km2 and 250,000–300,000 dwellers.” 
This paragraph is a quote from the reference , which a Seshat research assistant located in the process of searching information about New Kingdom. The RA also found another article, which reported the results of excavations of Pi-Ramesses, and there is a link to the article, which maps Pi-Ramesses topography and reports the extent of the city in hectares.
Today the link is simply the citation of the article. But in not too distant future, we will have a live link that you could follow to an archaeological database and view the excavation map and many other things we know about this city.
Finally, there is additional information coming from expert historians. Eventually all facts in Seshat will be vetted by at least one academic historian or archaeologist, who is an expert on the coded society. In this case, the population estimate of Pi-Ramesses was discussed in a workshop we ran a year and a half ago in Oxford, which brought together five world-famous experts on Ancient Egypt, so our degree of confidence in this information is quite high.
This is what I mean by the difference between a fact and a mere “datum” (a singular of data).
So I think you can see now that 115,000 such facts is, indeed, a very impressive number.
I am currently analyzing the variables in Seshat related to social complexity of past societies, and I will soon blog about these preliminary results. And that’s another huge advantage of structured data in Seshat: we are using them to test theories about historical dynamics and cultural evolution. In contrast, the mind-boggling amount of data flowing into archaeological databases is precisely that—mind-boggling. The sheer volume actually makes them very hard to use. It’s akin to drinking from the firehose.
Not a terribly convenient way to slake your thirst. Whereas Seshat is like bottled water: a much smaller volume, but infinitely more useful for understanding the cultural evolution of past societies.