European Distributed Institute of Taxonomy (EDIT)

EDIT is an EU funded Network of Excellence program with the goal of reducing the fragmentation of biological taxonomic research and coordinating an effort to facilitate taxonomic research using the World Wide Web. Within EDIT there has been much discussion about this can be achieved and I want to explain what my NHM colleagues and I have been doing as part of this project.

Integrating diverse sources of digital information is a major challenge for biodiversity informatics. Not only are we faced with numerous disparate data providers, each with their own specific user communities, but also the information we are interested in is heterogeneous and often specific to each community. Coupled with this we know that our resources are limited so decisions on how to achieve this must scale to the needs of many research communities. Also the technical abilities of these communities is limited so we have to do this in a way that respects this fact. Coming up with a single "standard" specification or approach for biodiversity data integration that solves all these problems might at best be described as difficult, and based on the products of past efforts is arguably futile. What unites us is a common goal to share our data to as wide an audience as possible, and it is generally agreed that in one form or another, we should do this on the Web. So how do we achieve this? 

Over the past few months my NHM colleagues and I have been tackling this in three ways:

  1. Generalized Content Management Systems (CMS)
    These are web-based systems that assist users in the process of content management. By content I am referring to digital information we put into such a system (e.g. text, images, video etc), and in the context of biosystematics I am referring to "data" that is variably structured and atomized (e.g. taxon names, specimen lists, images, literature, documents etc). Because CMS systems are highly generalized they are not well adapted to any one community. However the best of these systems can be customized and their functionality augmented through the addition of modules that perform certain tasks  (e.g. handling and annotating images, or managing lists of bibliographic references etc). Importantly CMS systems handle all the housekeeping tasks of managing content such as user authentication, logging, and archival that are otherwise difficult to manage and a distraction for the from the process of creating content.

    In the context of EDIT we have created a template CMS ("Scratchpads") that we have crudely adapted to the needs of biological taxonomists. Using the Drupal CMS we have inserted modules handling various data types (e.g. bibliographic literature, images etc), and are offering them as templates for communities of taxonomists to build content. Users obtain sites through an electronic registration procedure. To date we have 8 such sites, one of which (http://www.milichiidae.info/) is being used by an EDIT exemplar group. These are the taxonomic groups fortunate enough to receive core EDIT funding. Functionally these sites have very significant limitations. However ever they do allow communities to gain an initial web presence and proven popular with those that use them, though decidedly less popular with two expemplar groups that don’t.

  2. Specialized (Taxonomic) Content Management Systems (TCMS)
    These are CMS systems that have been customized to the very needs of specific research communities. Some are tightly focused to particular types of data (e.g. the Berlin Model database handling taxon names and taxon concepts) or particular taxa (e.g. Species File). Others are much more general but lack the capacity for web editing of data (e.g. CATE, and Specify), though their output is on the web. These systems facilitate much better semantic structuring of data; however, this comes at the price of additional complexity. Thus TCMS systems are very hard to develop, invariably difficult to populate with data, and rarely scale to the broader needs of many biosystematic research communities.

    In the context of EDIT we have mounted one such system at the NHM in recent months. Phasmid Species File is an extensive database of biosystematic data (taxon names, classification, images literature, ecological and geographic data, keys etc) on stick insects and their relatives. It is based on the Orthoptera Species File model and is the first of several systems developed by David Eades and colleagues that will be mounted at the NHM. The next will be on cockroaches and will be mounted at the NHM in two months time. Phasmid Species File (http://beach.nhm.ac.uk/) is currently only visible to researchers inside the NHM domain but will be accessible to others shortly. In the coming year the external collaborators (lead by Paul Brock will be adding a further 20,000 references and 3GB of images to this database.

  3. Semantic markup of unstructured data
    Most biosystematic data documenting the diversity of life is locked up in millions of Natural History articles published over the last 150 years. Getting access to these journals let alone the data within them is an extremely complex task. One approach is to apply Natural Language Processing (NLP) to identify component data such that we can query this information. At its simplest, NLP algorithms can recognize contextually identical parts of text (e.g. descriptions, specimens references, taxon names) and mark this up in such a way that it is possible to recover contextually related pieces of information from one or many documents. Donat Agosti and colleagues have been working on this problem and produced an application that facilitates this called GoldenGATE. In the context of EDIT we have been assessing GoldenGATE and the principles behind this approach to determine its suitability converting documents into a marked-up format with the notion of creating a single XML repository of biosystematic data. This work is being conducted by Julius Welby at the NHM. You can read more about his work at http://www.editwebrevisions.info/node/60.

View My Stats

Comments