The Networking Archives team are very happy to share this new paper, looking at the robustness of network metrics on correspondence archives with missing data, published today in the Journal of Cultural Analytics.
On the Networking Archives project we work a LOT with network analysis metrics. Though we would all agree it’s important to contextualise quantitative results with domain expertise, we still use metrics as starting points or to form parts of historical arguments. We might, for example, use the rankings of the centrality of a number of Secretaries of State to investigate how they worked with each other.
But interpreting results is not simple, because the sources we work with, digitised early modern correspondence archives, are naturally full of ‘missing data’: for example letters that have been destroyed or lost, or maybe just not digitised yet.
So we set out to test in detail how our results might fare in the light of missing data. To do this we adapted an approach taken by others in social sciences and archaeology: taking out random, successively larger chunks of the data, re-running the metrics, and comparing the original data with the ‘sampled’ versions.
A key difference between our approach and those we’ve seen before is that we tailored it to try and simulate the kind of missing data one might find in historical archives: for example, rather than delete random nodes, we deleted random letters, folios, years, and whole catalogues. To compare the original with the samples, we looked at the Spearman’s correlation between the original and the samples, deleting from 1% all the way to 99% of the data, for a series of key metrics. We repeated this 40 times to understand the variation.
We then plotted the results like this:
These plots tell us how the metrics respond to artificially-missing data: the blue line is the average correlation (a measure of the difference between the original and the version with data missing), and the gray shaded area represents the variability. A blue line which descends more quickly tells us that metric is particularly sensitive to missing data. In this example you can see that removing catalogues has a high impact on betweenness, though with low variability, a lower impact on degree, and a high impact on eigenvector centrality, with much more variability.
The results were surprising: in lots of cases, the correlation stayed really high until 50 or 60% of the data was removed! You can read the full paper here to get the full results.
This is not to say that missing data is not something to be conscious of: of course it is, and as humanities scholars will know, the specifics of the ‘missingness’ are also important. But we can say that correspondence archive are pretty surprisingly robust, even when very partial: perhaps it’s an opportunity to relax some of the anxieties we have surrounding missing historical data and quantitative results.
Alongside the paper, we have created an open-source application which allows users to upload their own network and test its robustness. It’s available here, but if you have a large network it will be best to download the source code from here and run it locally.