Chicago Code Sprint

Several people have asked me to say a few words about the code sprint that took place at Chicago Field Museum (8-11 Sept. 08). This event was to address some of the underlying barriers in the Drupal Content Management System (CMS), that would hinder our use of Drupal for the Encyclopedia of Life's "Life Desk" project, and that we have been trying to grapple with at the NHM in the Scratchpad project. The goal was to engage the Drupal developers community with a few tech savvy biologists, and come up with solutions that would benefit both Drupal and the biological community. The meeting was supported by BioSynC, the Biodiversity Synthesis Center, part of the EoL group based at The Chicago Field Museum. BioSynC provided the logistical support, and were spectacular hosts looking after our every whim. The meeting was run by David Shorthouse, who also did the initial planning. In total there were about 15 participants, mainly independent developers, several EOL informaticians, and a few biologists (myself included).

After an initial day of discussion and demonstrations, four key issues emerged as the key topics to be addressed. These relate to importing and exporting hierarchies (classifications), adding flexible metadata to terms (taxa) in classifications, providing a better interface to managing and editing terms and metadata in classifications and finally improved handling of ultra large (2 million+ term) classifications. What follows is a short account of progress made in each of these areas and these are moving forward since the sprint:

Import and Export of Classifications
The Scratchpads allow users to import their own custom classifications directly from an Excel spreadsheet as a tab separated file (we supply a simple Excel template with instructions on how to do this). We also allow users to import classifications via a web service from uBio’s ClassificationBank, although the number of classifications available from this service right now is very limited. Ultimately we need to expand this range, and create a proper repository of Classifications that are accessible as a web service for users to import and manage within their site. There are numerous projects trying (and failing?) to create such a repository, and we need a common standard that enables these repositories to supply classifications and associated meta data, both to and from the Lifedesk / Scratchpad environment. After some discussion we settled upon the TDWG’s Taxonomic Concept Schema, and a breakout group (spearheaded by Dan Morrison) worked on modifying the Taxonomy XML (import/export) module to support TCS. Specifically we need to be able to import the TCS XML file and extract the metadata needed by LifeDesk / Scratchpad users. By the end of the sprint Dan demonstrated a prototype of the Import / Export module that works with TCS XML. This worked for a temporary web service set up by Roger Espinosa that served TCS XML on the Catalogue of Life classification, and some example classifications in TCS that we could find on the web. Since the Sprint I am unsure how this has been followed ( I can see there has been some recent activity by Dan on the Taxonomy XML module) by but since much of this is so specific to the biological community, we (the Scratchpad project and EOL) need to follow this up to ensure it is robust enough for widespread use. At that point we will (at the very least) get this working on the Scratchpads.

Flexible Metadata on Terms
Once taxonomy is present within a site, metadata (authority information, validity, type records etc) on a term (taxon) has to go somewhere. At present a taxonomy terms only have available a very limited set of fields to which metadata can be added, and these are woefully inadequate for most biologists needs. Likewise, many Drupal users have expressed concern about the lack of space to store additional metadata about a term. To fix this a breakout group started modifying the Taxonomy Enhancer module. This would allow a taxonomy term (e.g. a species name) to point to a node that could flexibility contain any kind of metadata association with the term. This work was completed during the Sprint and has been incorporated into the Taxonomy Enhancer module. We (the biologists) now need to follow this up by creating a content type that supports the metadata that biologists want to store. As a first step, we need to support the minimal set of Taxonomy Concept Schema data that will be imported through the import export module. Assuming EOL does not pick this up, we will try to do this through the Scratchpad project.

Handling Ultra Large Taxonomies
Biological taxonomies are huge. The biggest we (the Scratchpad project) has is about 2 million terms in a single hierarch, and potentially these could get much bigger. Right now, Drupal with a standard memory allocation (32 Mb) seriously struggles with more than about 8,000 terms. While this might be okay for the average Scratchpad / Life Desk, it does not scale to many potential users, who often have individual classifications of 40-50 thousand terms. Computationally handling large classifications is difficult, and the way Drupal handles these is rather outdated and inefficient. To improve on this Simon Rycroft as part of the Scratchpad project developed the LeftandRight module which implements the nested set algorithm for storing large hierarchies. This has tied us over, allowing Drupal to store huge taxonomies (for example, check out this demo site with 2 million+ terms). However, (for reasons I won’t go into here) it has problems sorting terms. Specifically, it prevents users from customizing the arrangement (weight) of terms (taxa). Thus for example in an image gallery, the gallery arrangement gets very mixed up. During the sprint, Simon started work on an alternative module that implements a different algorithm (materialized paths) that would correct this. This stores the entire path (hierarchy) of a term, and solves the sorting problem. Unfortunately in benchmarking (speed tests) subsequent to the sprint, materialized paths are slower than the nested sets approach. It also produced some huge tables (nearly 500Mb!) for indexes with 2 million terms. Thus we are in the process of working on some halfway house approaches with give us the speed improvements of nested sets, but the sorting (weighting) options of materialized paths. This work is ongoing, and we will keep you posted on this.

Taxonomy Management and Editing
Having got a classification into a site, and created somewhere for the metadata to go, we now need an intuitive environment for editing and managing the terms and their metadata. The closest Drupal has to such a tool is the Taxonomy Manager module created by Matthias Hutterer. Right now, this lacks ease of use. In particular a number of enhancements need to be made to make this environment more intuitive and support direct editing of the metadata associated with terms. Addressing these problems is beyond the scope of what could be achieved at the Sprint. However, we discussed with Matthias the improvement we would like to see (specifically I the context of a drag and drop editing environment). In the first instance the EOL group was planning to take this forward and I proposed that if EOL can supply the funds, I know of a Drupal Developer that has the skills to address these problems. However, at the time or writing, this issue has not (to my knowledge) moved forward.

A few other technical advances were made at the Sprint. You will find a summary of them on the Wiki. Notably, Nathaniel Catchpole and Ben Melancon made several patches to fix minor bugs in the way Drupal handles taxonomy (see here for a technical summary), one of which was committed to the Drupal Core during the sprint.

Follow-ups
On the NHM side, we have been focused on addressing the issues of handling ultra large taxonomies. Arguably this is a precursor to picking up the other problems. We have also been addressing issues outside of taxonomy so we have not just been focusing on this. The EOL group last week had a taxonomy workshop, which presumably is also picking up some of these issues, but at the time of writing, I’m sure

Useful Links
Here are a few links related to some of the activities and participants during the event:
Blog post by David Shorthouse (Drupal Taxonomy Code Sprint Redux)
EOL Taxonomy Sprint Wiki (Drupal EOL Taxonomy Group)

Workshop Photo
From left to right, Simon Rycroft, Nathaniel Catchpole, Anthony Goddard, Lisa Walley, Roger Espinosa, Matthias Hutterer, Cyndy Parr, Dan Morrison, Chacha Sikes, David Shorthouse, Benjamin Doherty, Vitthal Kudal, Alexey Shipunov, Ben Melancon. Note that I went AWOL during the photo shoot to talk with John Bates, Head of Zoology and Associate Curator of Birds at the Field (sorry!).

Drupal Sprint Workshop


View My Stats