(C) 2011 W. G. Berendsohn. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
For reference, use of the paginated PDF or printed version of this article is recommended.
One of the most serious bottlenecks in the scientific workflows of biodiversity sciences is the need to integrate data from different sources, software applications, and services for analysis, visualisation and publication. For more than a quarter of a century the TDWG Biodiversity Information Standards organisation has a central role in defining and promoting data standards and protocols supporting interoperability between disparate and locally distributed systems.Although often not sufficiently recognized, TDWG standards are the foundation of many popular Biodiversity Informatics applications and infrastructures ranging from small desktop software solutions to large scale international data networks. However, individual scientists and groups of collaborating scientist have difficulties in fully exploiting the potential of standards that are often notoriously complex, lack non-technical documentations, and use different representations and underlying technologies. In the last few years, a series of initiatives such as Scratchpads, the EDIT Platform for Cybertaxonomy, and biowikifarm have started to implement and set up virtual work platforms for biodiversity sciences which shield their users from the complexity of the underlying standards. Apart from being practical work-horses for numerous working processes related to biodiversity sciences, they can be seen as information brokers mediating information between multiple data standards and protocols.The ViBRANT project will further strengthen the flexibility and power of virtual biodiversity working platforms by building software interfaces between them, thus facilitating essential information flows needed for comprehensive data exchange, data indexing, web-publication, and versioning. This work will make an important contribution to the shaping of an international, interoperable, and user-oriented biodiversity information infrastructure.
EDIT, Common Data Model, CDM, Scratchpads, Standards, TDWG, biowikifarm, Taxonomy, Biodiversity, Biodiversity informatics
In the last two to three decades there was a growing recognition that biological diversity is a global asset of tremendous value to present and future generations (Convention on Biological Diversity,
Efforts to share these data soon led to the realisation that capture and storage of biodiversity data is not enough; although most of the attributes are shared across the entire domain, the data sets are not easily linked or integrated. The lack of shared vocabularies and the diversity of data structures used has impeded (and still impedes) the sharing of data. Data sharing is essential to facilitate the collaboration and large-scale analysis needed for a successful treatment of the pressing issues connected with biodiversity. Standards provide a consistent representation of the data to be shared enabling data from different sources to be combined, whilst minimising loss or duplication of data.
“Biodiversity Information Standards (TDWG)” is an organisation that works on defining such standards in the field of biodiversity informatics. TDWG was originally established as the “Taxonomic Databases Working Group” by major botanical institutions and projects from around the globe in 1985 (
To be able to discuss the role of Biodiversity Information Platforms we need to have an exemplary look at some of the TDWG standards and other formats currently used in the field of taxonomy.
ABCD (Access to Biological Collection Data) and DwC (Darwin Core) are two standards intended to support the exchange of collection and observation data. Both have been ratified by TDWG as standard XML schemas. The ABCD standard (see
The DwC standard (
TCS (Taxonomic Concept Transfer Schema,
SDD (Structured Descriptive Data,
DwC-A (Darwin Core Archives,
Taxonomy relies on the results of more than 250 years of research laid down in scientific publications. Digitisation of this content is well under way, but to become truly useful the content needs to be converted into structured databases. Efforts are under way to standardise the markup for the content of taxonomic literature as a prerequisite for this process. TaxPub is an extension of NLM/NCBI Journal Publishing DTD (Version 3.0) that adds elements and attributes relevant to taxonomic descriptions to the already included elements for document features (
The work done has led to a comprehensive overview of the data in the highly complex domain of biodiversity informatics. But all these modelling efforts and resulting standards have no effect if the applications the researchers use cannot import and export standardised data. This is only starting to happen. For example, tools for descriptive data can exchange data using SDD, and some formats that are not (yet) TDWG standards such as Species Profile (
There is a need for workflow-based approaches for converting and integrating data and shielding the user from the complexity of the standards and data structures. Focusing on this problem, the European Distributed Institute of Taxonomy (EDIT) created the EDIT Platform for Cybertaxonomy. The EDIT Platform supports the entire taxonomic workflow, therefore it provides possibilities to import and export data in a standardised way (ABCD, DwC, SDD). Additionally, the EDIT-funded Scratchpads provide a scalable data publishing framework with flexible data models that can be modified by its users.
Biodiversity information platforms as information brokersLack of interoperability is one of the major obstacles to establishment of efficient workflows that help scientists and other users and user groups of the Biodiversity Informatics Infrastructures to improve quality and efficiency of their working processes. Advanced workflow management systems such as Taverna (
In contrast, the emerging biodiversity information platforms implement a different and complementary approach by trying to cover many different scientific and other working activities and hiding underlying data models and access protocols completely from their users. These platforms are usually centred around a local or distributed data store based on a comprehensive information model providing a unified instance of all data needed for scientific activities ranging from field work to data publication on paper and in web portals. Moreover, biodiversity information platforms provide the necessary interfaces to deploy external software tools and services in a way that users can still work with often highly specialised software applications they are used to. Data from external applications can be seamlessly integrated and further processed in the platform environment. In this way, biodiversity information platforms exploit their potential as information brokers and help users to benefit from information standards, which they would be unable to deploy otherwise.
The ViBRANT project work package 4 (standardisation) aims to improve interoperability between biodiversity information platforms and focuses on three emerging systems: Scratchpads (
The software platform Scratchpads (
Scratchpads are build on the content management system
The content entities are related to each other by tagging them with terms from the associated vocabularies. In that sense taxonomic names provide a central link between diverse items of information about a taxon. Scratchpad taxon pages allow users to dynamically construct and curate pages of information (e.g. phenotypic, genomic, images, specimens, geographic distribution). External data from some third party data services (bhl, flickr, wikipedia, yahooimages) can also be dynamically aggregated into these taxon pages. Data provided by web services, however, in general can be placed only into taxon pages; it is not possible to integrate and process them in the local data structure. The only exception is taxonomic classifications which can be obtained in form of uBio ClassificationBank for Species 2000, ITIS and NCBI Genbank.
File based imports exists for classification terms, locations and specimen data which are all based on the CSV (Comma-Separated Values) file format. Following the principle of high flexibility none of these imports requires the data fields to be ordered in a predefined structure, thus these imports always involve user interaction and cannot be automated. Structured data in standardised data exchange formats only exist for bibliographic data. The Scratchpads can import Tagged EndNote, RIS, MARC, EndNote 7 XML, EndNote 8+ XML and BibTeX formatted bibliographies.
Scratchpads provide a limited range of services to expose data to other software systems. At present these are restricted to specimen and bibliographic data (
The flexibility which allows adapting scratchpads to individual needs leads to semi-structured heterogenic data, which often hinders their integration into service-oriented software environments. A major task in achieving better platform interoperability will be to implement web service APIs, which communicate data in commonly accepted exchange formats.
EDIT Platform for CybertaxonomyThe EDIT Platform for Cybertaxonomy (
Establishing interoperability between various existing applications and data standards was a major aim in developing the EDIT Platform. A central data repository and information broker application has been created to achieve interoperability with and between existing applications and web based data providers. It allows other software to exchange data, via import and export functionality in major data formats, or via web services.
This data repository as well as the core components “EDIT Taxonomic Editor” and “EDIT DataPortal” are based on the EDIT Common Data Model (CDM), which comprehensively covers the information domain, including nomenclature, taxonomy, descriptive data, media, geographic information, literature, specimens, and persons. Wherever possible, the CDM has been made compatible with existing community data standards. This model as a base allows managing data consistently in highly structured form. A Java (
The import and export functionalities are complemented by web services exposed by the CDM Community Server (
Another web service implements the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting,
EDIT Platform components like the EDIT Taxonomic Editor can directly use external data providers. This is made possible by service wrappers allowing querying, retrieving and integrating of external data into a CDM data store, where these remote objects can be reused, without losing the information on their origin. Service wrappers already exist for specific data providers like the International Plant Names Index (
The EDIT Platform architecture is mainly service oriented, thus all data flows between different applications are established through web services. The EDIT Map Services, for example, produce distribution and occurrence maps based on data coming from a CDM Store. The communication between both components is effected through a URI based web service API.
Architecture of the CDM Library and the EDIT Platform
Biowikifarm.net (
Using MediaWiki as technological basis allows biowikifarm to focus on the “long tail” of scientific information (
The biowikifarm is part of the activities of Plazi (
Through the MediaWiki API, the software can be used as a service-oriented architecture, providing services to obtain page, file and relation objects (links, categories, semantic properties, data records) in a wide variety of data formats, including xml, rdf, json, html, and plain text.
Each of the MediaWiki installations contributing to this shared repository has different data structures, from simple arbitrarily structured wiki text pages to wikis like the “Offene Naturführer” (
None of the described platforms existed 5 years ago nor was there any commonly used tool available to edit and share biodiversity data in general and taxonomic core data in particular. At that time most applications designed for explicit handling of taxonomic core data (names, concepts, classifications) were in-house products, covering only the restricted requirements of local users and not supporting import and export of data in any standard format, perhaps with the sole exception of the databases providing collection and observation data in GBIF and related networks. User driven export of data to share it with other applications or projects was either not possible or ended up in user defined formats. An example is provided by the Global Compositae Checklist (
Obviously there was a considerable need for applications providing a joint platform for such projects. However, building them is more challenging then generally expected. To mention only the main obstacles: (1) a very complex and broad data domain covering the major fields of taxonomy and nomenclature, specimen and observations, descriptive data, literature, media, molecular data, and more; (2) high demands on the usability of user interfaces which may cover all the complexity of the domain; (3) the absence of a standard that covers all the domain – existing standards cover only parts and often overlap; and (4) the huge number of use cases to cover. The development is further complicated by the prerequisites for sustainable interoperability, namely (5) the demand for a generic open architecture that allows users with IT skills to adapt the software to their needs and allows participation in development (open source approach), (6) the demand for independence from hardware and operating system and database management platforms, and (7) the demand for web services, which make data and functions machine accessible and thus allow integration with other applications and with automated workflows.
Over the past years the described platforms first concentrated on the implementation of their core functionality, enabling users to do their every-day work of compiling, editing and integrating data, and publish the results on the web or as a print publication. At the same time, the basis for more advanced features was laid by building the systems using flexible and generic (though very different) architectures, as described in the previous section.
At present, all platforms have left the prototype status and are used to create content of high value. Although the list of demanded improvements and additional features is still long, it is now time to take a step back and reconsider how to integrate this content into the larger biodiversity e-infrastructure and how to connect the platforms in a way that creates additional value. All three platforms as well as other platforms like the emerging GBIF checklist bank (
As an example, we want to use the capability of the EDIT Platform to act as a data warehouse handling multiple classifications within one database for complex high level queries on several datasets compiled using Scratchpads and CDM implementations. For this, periodic import of all relevant data of the respective Scratchpads and CDM Data Stores into a CDM based database will be needed. An automated procedure using a service producing DwC Archive as the transfer format is being devised for this purpose. It is also envisioned to use the result as a contribution to the Global Names Architecture, once its setup becomes clear.
There are a number of other use cases for data exchange between platforms. Single users or user groups may want to compile data within one system but synchronise them with a repository based on another system to use it for other reasons. This use case is comparable with the handling of contact data. Present-day users do not necessarily hold their contact data within only one system but synchronise them among systems and tools each of them having their own purpose (direct calls, exchanging v-cards, sending FAXes, creating serial letters, advanced backup, synchronisation, etc.). Also with emerging requirements and growing software functionality users may want to switch systems without losing data or having to re-enter it manually, just as they may change their preferred mail client or word processing tool from time to time.
Moreover, other platforms can be used for backing up or versioning data. This may be a preferred use of biowikifarm, taking advantage of the highly developed versioning technology of MediaWiki. Exporting data from one platform to a MediaWiki may serve as a perfect way to fulfil the requirement of providing stable and accessible versions within a constantly changing data environment.
As described in the sections above, the Scratchpads as well as the EDIT Platform do already support a number of available and commonly used standards for data exchange. However, as most existing TDWG standards and other commonly used formats handle only a subset of the full data domain managed by the platforms these standards have only limited value for inter-platform data exchange. They are preferably to be used for the initial import or to enrich existing data. For example, ABCD and Darwin Core imports are used to add specimen data to the existing taxonomy data. SDD can be used to enrich taxon records by supplementing them with highly structured descriptive data. Also the various literature formats are very helpful for enriching a community site with a commonly shared literature repository. Users of platform software can communicate with, for example, the Biodiversity Heritage Library, both to use the indexed and digitised taxonomic literature, and to provide information on missing titles. On the export side existing standards like ABCD and Darwin Core are used to expose data subsets like specimen data to data aggregators like GBIF by using existing wrapper technologies such as BioCASe (
However, most of the standards or standard implementations have drawbacks as they cover only a certain slice of the data, or are not widely accepted by the community, and/or are not yet fully implemented or are implemented only for either import or export, not both. Also, many of the implementations are file-based, and do not provide the web services needed for use within automated workflows. This may be circumvented by implementing further software similar to the BioCASE software that is used for web service data exchange of ABCD data.
An interesting question in the context of web services is the problem of data rights and licenses. Where the schema for the resulting document does not contain a space for licensing information, presently there is no way for a user (especially a machine) to discern the licensing terms under which the data is provided. Care has thus to be taken to include this information, where possible (e.g. in the metadata for Darwin Core Archive or in the metadata section of the ABCD schema). Where not, appropriate service extensions must be defined. Of course, in an ideal world the data would be provided under a Creative Commons 0 license (see
For taxonomy-centred datasets TCS - the official TDWG standard for exchanging taxonomic data - should be the preferred format. However, TCS defines only the structure of the taxonomic backbone. Other data types such as specimen, literature or descriptive data need to be explicitly implemented using another format. This causes problems when trying to exchange broad and rich data like those stored in the Scratchpads or the EDIT Platform. Both sides need to support not only TCS but also all extensions used by the other side to fill in the gaps. As this requires considerable coordination efforts TCS export has not yet been fully implemented by both platforms and TCS imports still have limitations.
Darwin Core Archive (DwC-A), a new format developed by GBIF and others tries to address the described problems by offering a more comprehensive data format that covers all major areas of biodiversity data. It currently comes in two different flavours either taxonomy centred or specimen centred – but other implementations are possible. Although DwC-A has its limitations it is already much more widely used than TCS due to its ease of use and relative unambiguousness. However, all three platforms currently support DwC-A only in parts or not at all. Within ViBRANT this will be addressed. DwC-A import and export functionality will be implemented for the Scratchpads and the EDIT Platform; for MediaWiki it is probably sufficient to implement import functionality.
As DwC-A is primarily a file based exchange format a harvesting mechanism will be implemented to complement the export functionality, which allows automated harvesting of DwC-A data via web services. Within ViBRANT, this technology will be used in particular to integrate all Scratchpad data within one large EDIT Platform based database to allow visualisation and advanced querying across-Scratchpad (and EDIT Platform) data. The service implementation will enable users to provide access to selected slices of the dataset (e.g. to exclude access to data on research in progress). Adequate filter mechanisms need to be implemented that allow definition of exactly which data should be exported. This may become a challenging task due to the complexity of the respective data models.
In contrast to the Scratchpads and especially the EDIT Platform, the data in MediaWikis are often unstructured or at most semi-structured. Creating generic export functionality for them will thus be very difficult, involving potentially extensive data curation measures, rendering it impractical in most cases. However, importing data will be straightforward and will enable users to use this platform as a repository for versioning and publishing. DwC-A could be used here as an exchange format, but as MediaWiki is a mainly text based system it may be better suited to export formats used for text publications. For example the emerging publication format TaxPub (
Biodiversity informatics faces an increasing need for integrated working environments facilitating efficient and streamlined data capture, processing, and publishing based on community standards. The EDIT Platform for Cybertaxonomy, Scratchpads, and biowikifarm each provide practical innovative software solutions which help their users who wish to organise their data in a standardised and networked manner. Further integration will be achieved in the course of the ViBRANT project by designing and implementing interfaces between the technologies. In work package 4 (“Standardisation”), the development of several data exchange modules will contribute to an improved overall interoperability between the EDIT Platform, Scratchpads and the biowikifarm as well as facilitate external connectivity. Based on a new DwC-A export module for Scratchpads and a corresponding import function built into the Java-API of the EDIT Platform for Cybertaxonomy, a comprehensive data index across all Scratchpad and CDM Datastore instances can be realised for the first time. The index will serve both (human) users wishing to perform cross-platform searches and software systems that need machine readable access to the “ViBRANT universe”.
Connectivity between CDM stores and Scratchpads as well as CDM stores and the biowikifarm platform will be realised with XSL transformation of CDM XML publishing output. Based on these pipelines, data managed by an EDIT Platform installation can be further processed in a Scratchpad. Stable versions can be created with exports into the biowiki platform providing a semi-structured and addressable snapshot of dynamic taxonomic databases.
The processing of descriptive data will be handled as a complementary mechanism using the Xper² system. For this, SDD-based interfaces between the EDIT-Platform and Xper² will be implemented and optimised for the transfer of high volumes of descriptive information. Collaborative compilation and development of new character- and character state lists will be enabled through the biowikifarm system. A service for the generation of interactive keys based on SDD-documents will greatly improve the user-friendliness of portal systems across platforms.
With this, the different platforms will for the first time be able to mutually benefit from their respective strengths. The new development will represent an important cornerstone for the establishment of a harmonised and consistent international biodiversity information infrastructure.
EDIT was funded within the European Union’s Sixt Framework Programme under grant agreement n°018340.ViBRANT is funded within the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n°261532. We thank the three reviewers (Cynthia Parr, Vladimir Blagoderov and Mike Sadka) as well as Gregor Hagedorn for their most helpful comments on the manuscript.