Earlier this month we ran a semi-annual general meeting of the Seshat project in Santa Fe, New Mexico. This came about because prior to our own meeting, many of us participated in another workshop at the Santa Fe Institute, and since we were there anyway, we decided to stay in this beautiful location for our own meeting (I spent my previous Sabbatical year in 2008 there, and tend to return to the SFI once or twice per year since then).
The SFI- Seshat workshop was organized by Peter Peregrine, who has led such a database initiatives, as the Atlas for Cultural Evolution, and is involved in the Human Relations Area Files (he is also the archaeology consultant for Seshat, as well as editor for several of the world areas, for which we are currently collecting data). The general idea that motivated this workshop is this. Until now groups that have been building databases of archaeological and historical data have worked largely in isolation. But it doesn’t make sense to continue doing so – it’s wasteful of limited resources, and we can achieve much more if we cooperate. How to do it is, of course, a very big question, and I’ll return to it later in this post.
Our own meeting went very well. We are now profiting from several years of very intense conceptual work, which was devoted to honing our approach for coding data that can be used in testing theories about cultural evolution and historical dynamics. Last Fall, at the previous semi-annual meeting in Oxford we have largely finalized the first set of variables that we have been focusing on in the last three years: social complexity, rituals, warfare, and resources. Since then, we have been busy collecting data, and data are rolling in at a really amazing rate. If in December of 2014 Seshat contained 28,000 data points, this amount has doubled during the Spring of 2015, and will double again by the time we start analyzing data next Fall.
We are currently expanding the list of coded variables to include institutions, equity, economics and well-being. We started the process when we received two research grants last year (from the Tricoastal Foundation and the John Templeton Foundation, respectively), and we are still working out how precisely we are going to capture these difficult-to-quantify concepts. Currently we are at the stage when the talented post-docs supported by these grants (Dan Hoyer, Dan Mullins, and Alessio Palmisano) are testing and honing the current version of the Code Book for these variables. I expect that in the Fall we will be able to put the process of collecting these data on the conveyor line – meaning that we put our dedicated Research Assistants on the job.
The third workshop we ran in Santa Fe focused on what is happening under the hood of the Seshat Databank. Currently Seshat is implemented as a Wiki – a text-based platform that is very flexible and forgiving of frequent revisions of our data collection schemes. This has been an asset during the early days of developing the databank, but as data continue to roll in, it is becoming increasingly labor-intensive to propagate updates to our coding schemes through the mass of data.
Fortunately, we already have a plan to make a transition to the much more powerful implementation of the database, thanks to our Information Technology collaborators, based in Trinity College Dublin (TCD), receiving a large grant from the European Union’s Horizon 2020 program.
Unlike the traditional relational databases that are organized as tables, our TCD colleagues Kevin Feeney and Rob Brennan, are advocates of the graph-based approach to representing data relations. I have written in an earlier blog about this technology in a previous blog post.
Now, thanks to the Horizon 2020 grant, we started implementing the move from the current Wiki implementation to the Seshat Triplestore (the transition will occur in several steps and will be complete only in 2016).
I am extremely excited about the move. Some of the advantages are obvious. We will have more robust data entry and checking tools. Less obviously, we will be able to (semi)automate many of the routine tasks, dramatically increasing the efficiency, quality, and quantity of data flowing into Seshat.
But the graph-based approach to representing data structures has also huge implications for the question with which I started this post: how do we integrate different groups, each working on their separate database initiatives?
In the relational databases approach data are organized as tables. It takes a human to interpret what different rows and columns mean – computers have no ‘understanding’ of the data. In the graph-based approach, a databank includes not only the data, but also a formal, machine-readable description of what the data mean. Doing this involves a lot of additional labor – developing vocabularies, specifying formal specifications between concepts, and so on. But this investment pays for itself down the road, because other data sets that have been treated in the same manner can be linked together in such a way that the data can be read automatically by computers (see Linked Data).
And that’s what we are going to do over the next 2-3 years. We will connect together data contained in such repositories as Seshat, HRAF, and the SFI database on the evolution of early states, and publish them together as a cluster of interlinked data.