Enhancing Seshat with large language models

Seshat: The Global History Databank contains a wealth of data on past societies across the globe, stretching from the Neolithic to the present day. Up until now, all data has been collected and entered manually by research assistants, with the aid of domain experts. It takes many human hours to find relevant literature, extract information pertinent to our variables and enter it into the databank. Each data point has one or more citations, stemming from over 10,000 books, book chapters and articles (as of early 2023).

Recent advances in large language models (LLMs) allow for a synergistic relationship with the Seshat Databank. The text explanations and quotes recorded in each Seshat data point comprise a highly specialized text dataset of historical knowledge, which can be used for fine-tuning LLMs. In turn, LLMs offer the potential to more easily expand Seshat, by streamlining data extraction and updates.

Maria del Rio-Chanona (UCL) and Jakob Hauser (Complexity Science Hub Vienna) are currently working on building a structured Seshat NLP Dataset – linking information in Seshat to the respective paragraphs in academic sources. This will be used to train a natural-language algorithm to detect paragraphs that may have information about a particular variable. Ultimately, we aim to build an algorithm that can put promising sources at the fingertips of research assistants. While at the moment we focus on variables currently coded in Seshat, advances in transfer learning offer the opportunity to develop algorithms that can help code newly defined variables.

The overall goals of this project are to (i) make Seshat more accessible and translate it into a labelled dataset that can be used for supervised training, and (ii) develop a Natural Language Processing (NLP) algorithm that can help expand and update current databases.

This project acknowledges funding from Clariah-AT, the James S. McDonnell Foundation, and the Complexity Science Hub Vienna.

  1. Home
  2. /
  3. Research
  4. /
  5. Current Research Projects
  6. /
  7. Enhancing Seshat with large...

© Peter Turchin 2023 All rights reserved

Privacy Policy