Some time ago I blogged about the concept of mySpecies - a space for biological taxonomists to work on the web. The idea was inspired by Rod Page's work on iSpecies and even led me to purchase several mySpecies domain names back in August 2006. mySpecies is a fusion of the principles behind user-generated content of mySpace and the search engine aggregation technology of iSpecies. Though my work on the EDIT Scratchpads I have had the chance to rethink some of these ideas. Specifically how the concepts behind mySpecies and iSpecies relate to my broader goals of helping taxonomists help themselves though use of the Web. These ideas are outlined in the illustration below. Don't take any of this too literally - it is just a thought experiment for now:
There are an array of source databases holding information of potential relevance to taxonomists and systematists. Examples include NCBI's Genbank (molecular) and PubMed (citation) database, TreeBASE (for phylogenies), Google Scholar (for contemporary literature), the proposed Biodiversity Heritage Library (for heritage literature), and KE EMu databases run by various institutions (for specimens). To make use of these data custom scripts would have to be written for each data source so that, as necessary, the data can be atomized, normalized and wrapped into a form (such as RDF) that can be meaningfully used. This is only tractable because these source databases are few in number.
mySpecies websites are based on the principles we are using in the EDIT Scratchpads. These are Content Management Systems (currently Drupal) that we have fitted out with various modules and templates that allow taxonomists populate them quickly and intuitively with their own data. Each module manages datasets (e.g. images, citations, documents etc) that are conceptually related to the taxon that is the focus of the Scratchpad. I have blogged about this elsewhere so I wont repeat this information here. For more information I would encourage you to check out the EDIT WP6 Scratchpads site or sign up for one yourself. In sum they provide a completely customizable interface to a collection of web tools that help communities of taxonomists collaboratively manage and share their data on the web. Currently initiation of a Scratchpad is managed manually but this process could be automated in the model proposed here. In this workflow mySpecies sites are the only vehicle though with taxonomists can influence the content of iSpecies pages (see section 5), but for most mySpecies users each site will be the end product of their work. Users should be able to pre-populate mySpecies modules with data from selected source databases (currently this is not part of the EDIT Scratchpad design), but they also need the option to import their own data (something we can do with selected EDIT Scratchpad modules). Unlike iSpecies pages, mySpecies websites allow anyone to say anything about any taxon, even if it repeats or contradicts content in another mySpecies website. This "multiverse" of data counters the single "universe" of information contained within the automatically generated iSpecies WebPages (see section 5).
Data for each mySpecies site is held in a single isolated database (at least this is what we currently do for the EDIT Scratchpads). In this workflow these mySpecies databases feed a handful of consolidated databases based on the mySpecies modules. Because we control the backend database of each mySpecies site this is easy to achieve. For example, one of our bibliographic EDIT scratchpad modules is already serving OAI metadata for each Scratchpad. These feeds could easily be consolidated into a single literature database. Each consolidated database would includes algorithms (such as those found in EndNote for the literature database) to help de-duplicate content, and where possible, flag contradictory or suspicious information (e.g. misspelling or close matches to current content). Alerts (e.g. RSS feeds) would inform the mySpecies user of a potential error in their original site, and provide them with an option to correct or leave the source data.
This is the fuzziest part of the workflow because it is technically less clear to me how this might work, and because some elements in this section are required earlier in the workflow for the mySpecies sites. Regardless, the premise of this section is that atomized and (where possible) normalized data needs to tagged with some form of Globally Unique IDentifier (GUID) so that it can be subsequently reused and repurposed without fear of being altered or duplicated, while maintaining a connection to the original data source so that it can be attributed (tracked back) to the original data provider. Rod Page's bioGUIDs come closest to achieving this. By providing resolvable URIs for biological objects (publications, names, sequences, specimens etc) they can be understood by a web browser to display as HTML, but are also resolved to RDF allowing them to be aggregated and queried for iSpecies pages (see step 5). In addition, the taxon names associated with these data need to be tagged such that we subsequently reassemble these data elements around each name. To achieve this we first need an index of all taxon names (currently uBio's 9.5 million "namebank" records come closest to this) and one (or more) ontology that specifies how these names relate to each other. Arguably the Catalogue of Life project is the closest to this single ontology defining the relationships of about one million names. The combined index of resolvable GUIDs for biological objects derived from selected source databases identified in step 1, and the consolidated "user contributed" databases of step 3, associated with a single common ontology of taxon names, provide the source data for the creation of "intelligent" (iSpecies) pages.
iSpecies pages are dynamically created and not directly editable. Their content can only be influenced by changing the algorithms used to produce them or by users changing the source data (stepts 1-3). This is analogous to the processes used to generate Google News. Each iSpecies page provides a single unified view of information that is conceptually linked to a valid taxon name according to a single authoritative list such as those defined in the Catalogue of Life project. Taxa not in this authoritative list will not have an iSpecies page until those names are submitted to the list. This provides an incentive for taxonomists to submit their classifications to complete the Catalogue of Life - without submitting these names their taxa are effectively invisible to iSpecies. Again this submission process would be done via the mySpecies sites through a purposed built module. These pages are navigated by the single taxonomic classification defined by the chosen catalogue. Information on each page is organized as discrete modules based on the sets of data common to the source databases (e.g. literature, images specimens etc). Each module on each page has its own RSS feed and (where possible) elements of individual pages (such as an image, a specimen record or a DNA sequence) can be tagged "del.icio.us" style by a user for their bespoke reuse external to iSpecies. Each iSpecies page has a Digital Object Identifier (DOI) and all changes to the content of each page are archived by date like in a conventional Wiki. Thus a combination of date and DOI make content on any page citable. DOI's also offer the advantage of being instantly accepted by the publishing industry. DOI's are also represented on each page as a QR-code (see the top right-hand corner of each page) allowing users to attach a physical representation of the identifier that can be resolved to the appropriate page. These QR-codes can be attached to real biological objects (e.g. specimens displayed in museums, fish species in an aquarium shop) that can be resolved back to iSpecies pages. This technology/software already exists for free (I am using it on my Nokia N91). Pages (i.e. taxa) might (arguably) include non-intrusive sponsorship to generate a revenue stream that pays the contributor (or the contributors institution) according to their level of individual contribution. This provides the incentive for contributors to add more data through their mySpecies site.
That is enough for now. Feel free take these ideas apart.
Comments
its realy good
very good articles.