New paper: ‘The Measure of the Archive: The Ro­bustness of Network Analysis in Early Modern Correspondence’

The Networking Archives team are very happy to share this new paper, looking at the robustness of network metrics on correspondence archives with missing data, published today in the Journal of Cultural Analytics.

On the Networking Archives project we work a LOT with network analysis metrics. Though we would all agree it’s important to contextualise quantitative results with domain expertise, we still use metrics as starting points or to form parts of historical arguments. We might, for example, use the rankings of the centrality of a number of Secretaries of State to investigate how they worked with each other.

But interpreting results is not simple, because the sources we work with, digitised early modern correspondence archives, are naturally full of ‘missing data’: for example letters that have been destroyed or lost, or maybe just not digitised yet.

So we set out to test in detail how our results might fare in the light of missing data. To do this we adapted an approach taken by others in social sciences and archaeology: taking out random, successively larger chunks of the data, re-running the metrics, and comparing the original data with the ‘sampled’ versions.

A key difference between our approach and those we’ve seen before is that we tailored it to try and simulate the kind of missing data one might find in historical archives: for example, rather than delete random nodes, we deleted random letters, folios, years, and whole catalogues. To compare the original with the samples, we looked at the Spearman’s correlation between the original and the samples, deleting from 1% all the way to 99% of the data, for a series of key metrics. We repeated this 40 times to understand the variation.

We then plotted the results like this:

These plots tell us how the metrics respond to artificially-missing data: the blue line is the average correlation (a measure of the difference between the original and the version with data missing), and the gray shaded area represents the variability. A blue line which descends more quickly tells us that metric is particularly sensitive to missing data. In this example you can see that removing catalogues has a high impact on betweenness, though with low variability, a lower impact on degree, and a high impact on eigenvector centrality, with much more variability.

The results were surprising: in lots of cases, the correlation stayed really high until 50 or 60% of the data was removed! You can read the full paper here to get the full results.

This is not to say that missing data is not something to be conscious of: of course it is, and as humanities scholars will know, the specifics of the ‘missingness’ are also important. But we can say that correspondence archive are pretty surprisingly robust, even when very partial: perhaps it’s an opportunity to relax some of the anxieties we have surrounding missing historical data and quantitative results.

Alongside the paper, we have created an open-source application which allows users to upload their own network and test its robustness. It’s available here, but if you have a large network it will be best to download the source code from here and run it locally.

Co-Citation Networks from Letter Mentions: A Short Guide

A co-citation network is a network model which uses the principle that those mentioned (or cited) in the same document may share some kind of link. This kind of work has been widely used to understand, for example, the structure of communities of scholars—based on the principle that if two documents are often cited in the same document, they likely have some kind of semantic link, used, for example, to check to see if the authors in the communities have any shared characteristics, or whether there are distinct communities of scholars or academic areas which repeatedly cite each other. Bipartite networks have also been used to understand ecological food webs – connecting animals which are all prey for the same predator, for example, or flowers pollinated by the same insect.

The method can also be used to understand other types of citations. One project, Six Degrees of Francis Bacon, uses co-citation of people in Oxford Dictionary of National Biography articles as a way of inferring some kind of likely social connection – again, based on the premise that if two people were repeatedly mentioned together in articles, they likely share some kind of link. Or, as done here, you could use co-citation to draw links between two individuals, if they are cited, or mentioned, in the same letter.

Read the full post here.

Exploring the State Papers with Word Embeddings

One way we can represent text in a way that a machine can interpret is with a word vector. A word vector is simply a numerical representation of a word within a corpus (a body of text, often a series of documents), usually consisting of a series of numbers in a specified sequence. This type of representation is used for a variety of Natural Language Processing tasks – for instance measuring the similarity between two documents.

A new blog post by the team uses a couple of R packages and a method for creating word vectors with a neural net, called GloVe, to produce a series of vectors which give useful clues as to the semantic links between words in a corpus. The method is then used to analyse the printed summaries of the State Papers Online, and show how they can be used to understand how the association between words and concepts changed over the course of the seventeenth century.

Read the full post here

New Blog Post: Text Mining the State Papers

Much of the work on the Networking Archives project has been using the metadata (people, dates, places) of correspondence rather than the content itself. Here we investigate applying text mining techniques to the printed summaries.

Most of the quantitative research on the Networking Archives project has been using the metadata from the digitised correspondence of State Papers Online. Metadata in this sense means everything except the content of the letters: including author names, recipient names, date, place of sending and so on, in the research of seventeenth-century intelligencing. Gale State Papers Online brings together a number of historical primary sources, not only the manuscript images from the State Papers, but also full text versions of the ‘Calendars of State Papers’, a set of printed finding aids mostly produced in the nineteenth century. These printed summaries represent another huge store of data available to us which we also use in the analysis of the data.

As anyone who has worked with the calendars will tell you, they have been produced to very different standards and as such they interpret the documents they represent in very different ways. They tend to suffer from an identity crisis: never quite sure if they should be purely a manuscript finding aid or a more useful description. In addition, as there’s no inherent logic behind the inconsistencies other than changing editorial policies, it’s hard to get a sense of in what way exactly they are inconsistent. Data analysis can help with this, by analysing the entire dataset at scale, to understand the changing shape of the printed calendars by time, topic, and office.

Read the full post here.

Wikidata and Correspondence Archives

On the Networking Archives project we’ve been using Wikidata IDs as unique identifiers for some of our data types. At present, we use Wikidata identities to disambiguate geographic places in our dataset and Wikipedia links, where available, as a unique identifier for people records.

Wikidata is a knowledge graph: a type of database which stores information in what are known as triples. The ‘things’ in Wikidata are stored as entities, which are connected by properties – which store the relationship between entities. For example: the entity ‘Henry Oldenburg’ is connected to the entity ‘Bremen’ by the property ‘Place of Birth.’ Both Oldenburg and Bremen are themselves connected to many other entities through many other properties, resulting in a complex web of interrelated data.

Click here for a blog post (with code examples) by one of our Research Fellows, Yann, on the use of Wikidata IDs and the correspondence metadata on the Networking Archives project.

Networking Archives publication in Computational Humanities Research

The Networking Archives project is reconciling three separate datasets—Early Modern Letters Online, the Tudor State Papers Online, and Stuart State Papers Online—into one meta-archive. One commonality between the three is that they all, to some degree, contain missing and partial data—potentially a source of anxiety when we come to consider the veracity of our findings. In a recent paper authored by some of the project team, presented at the first Computational Humanities Research Workshop, we outlined some strategies for dealing with missing data, and argued that perhaps we shouldn’t be so worried after all.

First we set out to understand the data in detail, and to this end we’re working on a set of ‘views’, which will visualise the shape of the data along different dimensions. What struck us first is how remarkably similar the State Papers data looks to EMLO, despite their very different origins. These visualisations also help us to analyse the precise ways in which the data is missing and partial—we’re mapping absences as well as presences. Mapping absences has led us to understand, for example, that dates in SPO were more reliable during the secretaryship of the bureaucrat-extraordinaire Joseph Williamson, and less so during the interregnum. They also show that some types of missing data are more correlated than others: statistically, a record in EMLO missing a date is significantly more likely than chance to be missing an author or recipient, but the fact has less bearing on whether that record will be missing an origin or a destination field, for example. Potentially these findings can help us to model in even greater detail the effects of very specific types of absences in the data.

Many of the findings on the Networking Archives project are based on network science metrics. We might, for example, use a ranked list of a particular metric to make a claim about an individual’s proximity to the centre of power, or to find individuals who acted as ‘sustainers’ between different parts of a network. To understand the impact of the missing data, described above, on these kinds of rankings, we ran a series of experiments inspired by the work carried out by Matthew Peeples on archaeological networks. To put it simply, we removed random chunks of letter records from the datasets, re-ran the network algorithms, and compared the ranks of the metrics across the original and ‘sample’ networks. Surprisingly, we found that most metrics were actually pretty similar, even when 60 or 70% of the network was removed.

The last part of the paper is about why we’re interested in studying these joined-up catalogues. One reason is because it allows us to find new, ‘informal’ catalogues at the intersection of the formal collections. Take the example of John Dury: a Scottish minister who worked as a diplomat and towards the promotion of peace amongst Christian factions, he spent much of his life travelling across Europe trying to convince secular leaders of his cause. As such, rather than his correspondence being collected in a single ‘Dury Archive’, his letters are scattered across a number of others (much of it is in the archive of his friend Samuel Hartlib, but we found him in eight other catalogues in EMLO as well as in the Stuart State Papers). Computational methods allow us to find other individual like this, and in the case of Dury, gather his dispersed correspondence into a single, informal catalogue, and through this get a more complete picture of his role in seventeenth-century religious, intellectual and diplomatic networks.

Historians are often—understandably—skeptical about quantitative results of this kind, because working in historical archives makes one only too aware of their partial, often chaotic nature. We suggest that in terms of network science at least, this partiality has less effect than might be expected. In fact, what we’ve discovered is that in most fields using network analysis, complete data is more of an illusion than a fact, and that we should work around absence, rather than without it.

New Networking Archives Fellows Yann and Philip

We are pleased to introduce two new members to the Networking Archives team: Yann Ryan and Philip Beeley. They joined us as Fellows earlier this year, and will be undertaking key parts of the collaborative research projects that we have scoped, including co-authoring the project ‘multigraph’ with the rest of the team, and co-editing the collection of essays coming out of the training schools.

Yann recently completed a PhD thesis ‘Networks, Maps and Readers: Foreign News Reporting in London Newsbooks, 1645 – 1649’ (QMUL), which looked at the flow of news from overseas to London, and examined how this can be traced and measured using computational techniques (including network analysis) as well as more traditional scholarly methods. Prior to the Networking Archives project, he worked in the British Library as a Curator of newspaper data – a newly-created post which sought to promote the use of the Library’s digital newspaper holdings to a wider audience.

His current research interests include historical network analysis, the history of news and intelligencing in Europe, digital and spatial humanities, as well as early modern post and communications. He’s also keen on developing alternative ways of communicating historical research, and is experimenting with writing an open-source book on newspaper data as well as producing computational tools for the Networking Archives team.

Philip’s research and publications are focused on the history of science and epistolary cultures in early modern Europe. He is especially interested in the role played by correspondence networks in the emergence of early modern scientific thought and in the ways in which mathematical ideas were disseminated and discussed both in scholarly communities and across different social milieus. A particular focus thereby is on the history of early Royal Society and of its relations to cognate institutions across the continent. A further area of his research is on early modern cryptography and its significance for diplomatic decision-making as well as in shaping political affairs and military events in seventeenth-century Europe.

He has been involved with the Oxford-based Cultures of Knowledge project and its collaborative database of early modern correspondence EMLO since their inception. Until recently, he was Co-I on the AHRC-funded Reading Euclid project, which investigated the impact of Euclid’s Elements of Geometry on early modern culture in Britain and Ireland by examining educational, editorial, and reading practices through printed, scribal, and other material records.

Philp and Yann’s arrival on the project has already given us a huge boost in terms of productivity, and we are looking forward to sharing the fruits of our collaborative labour in due course.

Ruth and Sebastian make an appearance in new PBS documentary on historical networks

Two members of the Networking Archives project team, Ruth and Sebastian, recently contributed to a new PBS documentary, Networld. The documentarycreated by historian Niall Ferguson, is a three-part series exploring the history of social networks.
In the episode they talk about their work on Protestant letter networks in the reign of Mary I. You can watch it in full on Youtube, here, or visit the official site for more information.

Job Opening: Postdoctoral Research Associate (University of Oxford)

In addition to the currently open postdoctoral position at Queen Mary University London, we’re offering a second ‘Postdoctoral Research Associate’ post beginning January 2020 at the University of Oxford.

This is a full-time, fixed-term post for 18 months. The successful candidate will conduct independent research on the intersections of political and intellectual ‘intelligencing’ in mid-17th century England. In addition to this, they will participate in the collaborative, interdisciplinary ‘laboratories’, in which experiments will be conducted on the newly curated and merged datasets whilst also developing plans for disseminating the results.

For full details please see the official job posting. Interested candidates are encouraged to contact Prof Howard Hotson ([email protected]) after September 6th for an informal discussion of the job and its requirements.

The closing date for applications is 14 October 2019. Interviews are expected to be held shortly thereafter.

Job Opening: Postdoctoral Research Associate (Queen Mary University London)

We’re looking for a ‘Postdoctoral Research Associate’ based at Queen Mary University London to join our project in January 2020.

The Postdoctoral Research Associate will be actively involved in all facets of the project, and will be provided with the necessary training to contribute towards these tasks, although pre-existing skills in network analysis (or other digital humanities training) would be beneficial. In discussion with the PI and Co-Is, the research associate will develop their own research agenda and publications arising from the experimental monthly Lab meetings, to analyse the archive of 430,000 letters using a combination of quantitative network analysis and traditional literary historical research.

For full details please see the official job posting at QMUL. Interested candidates are also encouraged to contact Dr Ruth Ahnert ([email protected]) for an informal discussion of the job and its requirements.

The closing date for applications is 23 September 2019. Interviews are expected to be held shortly thereafter.

ps. See also our second post-doctoral job opening for this project at the University of Oxford.