Garbage In? How We Can Improve the Quality of Historical Data



Join 36.9K other subscribers

A week ago the urban archaeologist Mike Smith wrote a scathing post about a new article in’s journal Scientific Data. In the article, Meredith Reba and coworkers report on how they “spatialized” the dataset on urban settlements, based on previous publications by Tertius Chandler and George Modelski. As Smith writes in his blog, “The data in both Chandler and Modelski are a mess, routinely dismissed by urban demographic historians as worthless for serious scholarship.” The title of his blog post asks, “Why would a journal called ‘Scientific Data’ publish bad data?”

yalePressReleaseFigure-browserThe spatial coverage of points representing all cities included in the final dataset of Reba et al. 2016. Source

With all due respect (and it’s not an empty phrase, I know Mike and greatly respect his work and scholarship), his negative critique is unfair and counterproductive.

It’s unfair because Reba et al. have made an important addition to the Chandler and Modelski data by “spatializing” it. In other words, they added geographic coordinates to urban settlements in the Chandler/Modelski datasets. They also did it in a thoughtful and scholarly manner. Read their paper to see how much care they took with locating the cities on the map. As they write in the abstract, “The dataset creation process also required data cleaning and harmonization procedures to make the data internally consistent. Additionally, we created a reliability ranking for each geocoded location to assess the geographic uncertainty of each data point.”

Mike Smith doesn’t criticize them for doing a poor job of locating the urban settlements on the map. His problem is with the estimates that Chandler and Modelski make about the population sizes of these settlements. But developing better population estimates for cities is something that can be done independently of their geographic location. Reba et al have added to our knowledge of historical cities, by giving them spatial coordinates. They incremented our knowledge. It’s up to urban archaeologists and historians to improve the estimates of settlements’ population sizes.

Unfortunately, these specialists are not eager to increment our knowledge. Here’s what Mike Smith writes in his blog:

I have to admit that I really despair of this situation. I am very upset that such obviously poor data are being used by otherwise rigorous scholars, and I am upset that I don’t have better data. I have talked to quite a few colleagues—archaeologists and ancient historians—about this situation. I have asked if any of them were involved in assembling reliable and accurate data on ancient city sizes in their region of specialty, and the answer has been negative. I have asked if they knew of anyone doing systematic urban demographic history in their region, and again the answer is no. In my own region, Mesoamerica, there was a flurry of demographic work on city size in the 1980s, but then scholars lost interest. I have asked if anyone might be interested in mounting such a systematic comparative project, again with a negative answer.

The upshot for someone, who wants to do analyses, is: you can’t use the Chandler/Modelski data, and we have nothing better for you to use.

I don’t buy this council of despair. In any case, just how bad are the Chandler/Modelski data? Are their estimates off by 50%, by a factor of 2, or even 3? In many analyses, in which we consider settlements ranging in size from 100s to 1,000,000s – that’s four orders of magnitude – a mere factor of 2 is not that much of an error. So scholars argue about whether the population of Rome in the first century BCE was 0.5 or 1 million. After you have log-transformed these numbers, it’s not going to have that much difference on the global cross-cultural analysis that includes the whole spectrum of settlement sized across the last 10,000 years.

This is not to say that I endorse Chandler/Modelski conceptual approach. In the Seshat project we use a much more sophisticated one. First, we don’t simply provide a “point estimate”, e.g. 1,000,000. If there is a significant degree of uncertainty, our research assistants are instructed to code it with a range. It can be, for example, [500,000—2,000,000] and that’s fine—this is useful datum. Second, when experts disagree, we include both (or more) rival estimates. Finally, these estimates are just the proverbial tip of the iceberg. We also include explanations of where they come from. Eventually we are going to connect to more detailed archaeological databases that provide the solid scientific basis for these estimates. See my post on the Anatomy of a Seshat Fact.

So what the Seshat project offers is an evolutionary way forward that avoids the Scylla and Charybdis of either bad data or despair.


Between Scylla and Charybdis Source

This is how science works. It’s cumulative. We start with naïve ideas, bad approximations, and wrong theories. Then, by applying the scientific method we get progressively better ideas, more accurate approximations, and logically sounds and well-tested theories. So let’s abandon negativism, roll up our sleeves, and get to work!

Notify of
Most Voted
Newest Oldest
Inline Feedbacks
View all comments
Mike Smith

Peter – Nice post. You are right that their spatial work seems fine. It seems ironic that the care they took with spatial localization is completely out of proportion to the care the Chandler and Modelski did NOT take with their population estimates. I think I am upset mostly at the economic historians who just take the city size data and analyze it without worrying about its origin or quality. While it is useful to see this from the long perspective as you suggest – as one step along the path to improving data quality – I fear that now more scholars will be eager to keep analyzing the bad city-size data, now that it has been more rigorously “spatialized” and vetted by a supposedly rigorous journal. So, I remain depressed at this situation. We have readily available bad data that lots of people would like to use, and few scholars seem interested in improving it. There will be small pockets of good population estimates for ancient cities – in SESHAT, and in various regional publications. But the nonspecialists will not have the skills or patience to find these data.

Rudy Cesaretti

It’s nice to see such a healthy dialogue between scholars. ike’s critique of Chandler is dead accurate. But Peter is right that each specialized contribution to historical databases moves the historical sciences forward. One of the best parts of Seshat (in my opinion) is it’s commitment to the validity and accuracy of the data. Indeed, the scholarly community is well on the road to a new database of city populations (among other variables). The ASU-SFI Urban Scaling working group has made major strides in this direction, and Seshat certainly has the best approach to compiling and integrating the best sources and databases. It would be fantastic to see the Urban Scaling group collaborate with Seshat, as their visions and goals are so kindred.


As the years pile up, I’m getting pretty shaky on what “the scientific method” might be. For my part, if you think of science as what you’ve found out about how things are (or how nature is, if you remember that people are part of nature too,) by actually examining them and changing your ideas to conform to reality as best you know it to be…then making historical databases, and making them more accurate, or at least quantifying as much as you can how accurate or precise they are, simply is science.

Therefore it needs no more defense than similar activities such as mapping the universe, classifying species of organisms or devising a timeline of geological changes. You seem to be suggesting that historical databases are justified because they make statistical testing of hypotheses possible. It’s not clear that this implicitly falsificationist perspective is compatible with the notion of historical science. Further, statistical falsification in such fields as economics and evolutionary psychology don’t seem to support any notion of “the” scientific method as the touchstone of knowledge.


If scientists weren’t able to use data that wasn’t good, no science would be possible. It’s not just that data can always be improved, but that there are reasons to believe that science can’t be done without deciding on a completely arbitrary and mostly false starting point, and then bootstrapping your way toward a something better. The best example of that is provided by Hasok Chang in is Inventing Temperature (2006). When you have no way of measuring temperature and how do yo study it ? Well, you start somewhere, even if it’s completely false, and you slowly get to a point where your results converge toward something that looks like it accurately represent the world.

Mike Smith

Sorry, I don’t buy many of these remarks. Can you show me a single scholar who has used the Chandler data (beyond perhaps Morris and Acemoglu, whom I cite), and then proceeded to propose upgrade or make amendments of it? Can you show me scholars who have examined the sources of those data and tried to improve the data? It is nice to describe ideal accounts of how science should proceed. No arguments there. But in this particular case, I don’t see much progress being made on a really poor dataset. People simply use the data and assume it is rigorous, when it is not.

I guess one can argue, like Peter, that having accurate locations for cities over the millennia is a good thing. But how much of an advance is this? These cities were not lost, but their spatial coordinates only existed in local or limitied-distribution sources. Now we have geo-referenced and more readily available location data. This will make it much easier for people to do shoddy analyses with the data, so perhaps this development will have a negative effect on the progress of scholarship (assuming one agrees that having more published bad analyses is not progress).

  1. Home
  2. /
  3. Cliodynamica
  4. /
  5. Regular Posts
  6. /
  7. Garbage In? How We...

© Peter Turchin 2023 All rights reserved

Privacy Policy