Old Bailey online

The complete proceedings of London’s Central Criminal Court (the Old Bailey) have recently gone online. This fantastic website allows users to view the details of almost 200,000 criminal trials held at the court from 1674 to 1913. I have already spent several hours browsing the site! Since this digitization project has many parallels with the Biodiversity Heritage Library (a project to scan the literature describing all 1.8 million known species) I thought I’d compare the two:

Scope: The Old Bailey text comprised about 195,000 pages. By comparison BHL has to deal with an estimated 300 million pages.

Source material: The Old Bailey project was working from two sources (the Old Bailey Proceedings from 1674 to 1913, and of the Ordinary of Newgate's Accounts, 1690 to 1772) that were already on a microfilm in one location. BHL is working from many thousands our original sources that are not on a microfilm and are scattered in the 10 major natural history museum libraries around the world.

Text Extraction: The smaller volume of material means that the Old Bailey project could rekey (at least once) the complete text of the entire works. BHL could never afford to do this, thus BHL must rely on OCR for text extraction.

Names: Person names in the Old Bailey text were identified using the GATE software, which was able to identify 80-90% of the person names correctly. BHL is using a slightly different approach (we already have a list of 10 million taxonomic names, though this list is incomplete) thus BHL’s ability to identify the names is more a function of the OCR quality, than through software recognition. Nevertheless, I am sure BHL miss some names – I’d be interested in any empirical data on this.

Markup: The Old Bailey text was relatively structured (perhaps because it essentially appeared in a single publication). Nevertheless, the bulk of this (1674-1834) was marked up manually by a team of 5 people. By comparison, although species descriptions are relatively formulaic, the diversity of publications, coupled with the different approaches taken by different authors often working in very different scientific fields, means the markup to the level of detail delivered by the Old Bailey project is just not possible on any meaningful scale. The process would have to be far more automated for it to work for BHL.

Time frame: It took the Old Bailey project eight years to complete the process. In contrast BHL will take much longer, although this is not a fair comparison as what the two projects can realistically achieve are quite different. BHL has already scanned 5.5 million pages (more the 500,000 via a single scribe machine at the Natural History Museum alone). However, this is still just 2% of the total!

More technical details about what the Old Bailey did are available from their website, but is short the text was scanned as 400dpi TIFF files from microfilms and the older text (1634-1834) was “double rekeyed". In other words the text was typed twice (manually) by two different typists and which were then were electronically compared, with any differences manually reconciled. More recent text (1834-1913) was manually keyed once and electronically reconciled against a second transcript generated by OCR software. Through a combination of manual and automated markup, the text was tagged to populate a database of about 18 fields that describe the characteristics of the case. They also use GATE software to identify person names in the text.

I wish BHL could achieve for biodiversity science what the Old Bailey project will do for historians and humanities researchers. Alas, the technical barriers, heterogeneity, and shear scale of the biodiversity literature mean that BHL still has a long way to go.