Research Article |
Corresponding author: Roderic Page ( roderic.page@glasgow.ac.uk ) Academic editor: Ellinor Michel
© 2016 Roderic Page.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Page RDM (2016) Surfacing the deep data of taxonomy. In: Michel E (Ed.) Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond. ZooKeys 550: 247–260. https://doi.org/10.3897/zookeys.550.9293
|
Taxonomic databases are perpetuating approaches to citing literature that may have been appropriate before the Internet, often being little more than digitised 5 × 3 index cards. Typically the original taxonomic literature is either not cited, or is represented in the form of a (typically abbreviated) text string. Hence much of the “deep data” of taxonomy, such as the original descriptions, revisions, and nomenclatural actions are largely hidden from all but the most resourceful users. At the same time there are burgeoning efforts to digitise the scientific literature, and much of this newly available content has been assigned globally unique identifiers such as Digital Object Identifiers (DOIs), which are also the identifier of choice for most modern publications. This represents an opportunity for taxonomic databases to engage with digitisation efforts. Mapping the taxonomic literature on to globally unique identifiers can be time consuming, but need be done only once. Furthermore, if we reuse existing identifiers, rather than mint our own, we can start to build the links between the diverse data that are needed to support the kinds of inference which biodiversity informatics aspires to support. Until this practice becomes widespread, the taxonomic literature will remain balkanized, and much of the knowledge that it contains will linger in obscurity.
Biodiversity informatics, identifiers, DOI, literature, taxonomy, dark taxa, data cleaning, data integration
As an example of the consequences of this obscurity, consider the fate of the name Leviathan as used for a recently discovered fossil whale described in Nature (
Unless we want the taxonomic literature to linger in obscurity we need to make it easily findable and accessible. An obvious starting point would be if taxonomic databases linked to the digitised taxonomic literature. However, most taxonomic databases are little more than online collections of 5 × 3 index cards, a technology Linnaeus himself pioneered (
If we accept that the key documents of taxonomy are the publications that contain the names, descriptions, nomenclatural changes, and taxonomic revisions, then a major challenge is to “surface” these documents so that readers can discover them. This means changing practices that have served the community well in the pre-digital era, but which are now hindering its progress. One of the key changes will be the adoption of globally unique identifiers for the taxonomic literature.
The taxonomic community’s experience with globally unique identifiers has been mixed. Several factors have contributed to this. The first is the saga of Life Science Identifiers (LSIDs) (
“Tyrannobdella rex n. gen. n. sp. and the evolutionary origins of mucosal leech infestations. PLoS ONE, 5(4) 2010: e1057, 1–8.”
ZooBank mints its own identifier for the PLoS One paper: urn:lsid:zoobank.org:pub:8D431ED1-B837-4781-A591-D3886285283A (since this was written ZooBank has added the DOI for this article). Ironically, the only thing that links these two records together is the taxonomic name “Tyrannobdella”.
A consequence of the failure to reuse existing identifiers is that the biodiversity informatics community has created a large amount of data identified by a technology few people understand (LSIDs, which by default wouldn’t work in a web browser) and with very few cross-links. This lack of links means each database is effectively another silo, and hence many of the expected benefits of serving biodiversity data in RDF (
This experience may encourage a healthy scepticism about the utility of identifiers, but I would argue that this is because we’ve overlooked the importance of their reuse. If different databases insist on minting their own identifiers and not using (or linking to) existing identifiers, then our data will remain in silos. Reusing identifiers will help establish links between databases, and it is these links that will be the basis of many of the hoped-for inferences we can make in biodiversity informatics (
Taxonomic databases often contain names devoid of references to the literature. Names by themselves are of little value; it is the literature, specimens, and data derived from those specimens that are the primary data of taxonomy. Yet much of this information remains hard to obtain (even discovering that it exists can be challenging). Many citations to the taxonomic literature are obscure unless you are familiar with the conventions. For example, if you are searching for the original publication of the name Tachyglossus Illiger, 1811 (a genus of spiny anteaters) then Nomenclator Zoologicus (
One approach to tackling the plethora of ambiguous, if not downright obscure, citations is to use globally unique identifiers to refer to the publications. In the case of the “Prodromus systematis mammalium et avium” (
Using existing bibliographic identifiers has several immediate advantages. It all but eliminates ambiguity in citations. Given that the same citation can be represented multiple ways (consider the bewildering and completely unnecessary proliferation of citation styles for different journals), matching citations using their representation as strings of characters is fraught with problems. Citation strings can also “mutate” over time (
Identifiers provide additional value if they come with supporting services. For example, DOIs can be resolved to both human- and machine-readable content, which enables tools to be built that can consume DOIs and automatically populate databases with bibliographic information (most bibliographic management software makes use of these services). There are also services that take a bibliographic citation and find the corresponding DOI; publishers utilise these to add links to the list of literature cited in an article.
But the real value from identifiers becomes apparent when they are shared, that is, when different databases use the same identifiers for the same entities, instead of minting their own. Reusing identifiers can enable unexpected connections between databases. For example, the PubMed biomedical literature database has a record (PMID:948206) for the paper “Monograph on “Lithoglyphopsis” aperta, the snail host of Mekong River Schistosomiasis” (
Making these connections requires not only that we have digital identifiers, but also that wherever possible we reuse existing identifiers. If we restrict ourselves to project-specific identifiers then we stymie attempts to create a network of connected data on biodiversity.
It is worth exploring ways we can reuse identifiers. One approach is to include links to existing identifiers wherever possible. For example, if a database includes an article that has a DOI, then that database should store the DOI as one of its fields. This is the easiest form of reuse, and doesn’t prevent the database minting its own identifiers. This approach makes sense if we are adding data that hasn’t yet been linked to existing identifiers, or if identifiers may only become available later (e.g., after a database entry has been created, a publisher subsequently digitises the print archive of a journal and issues DOIs for each article). A more powerful example of reuse is when a database incorporates existing identifiers into its own identifiers. The BBC is an excellent example of this: their music and nature sites reuse “slugs” from external resources, such as MusicBrainz and Wikipedia, respectively (
”This may not be much of a revelation to many, but is a notion that is sinking home more deeply for me of late. By “Community”, I don‘t necessarily mean the online community … I mean the taxonomic community.” David Shorthouse “The community is dead” http://ispiders.blogspot.co.uk/2009/06/community-is-dead.html
There are many reasons why communities may or may not form, but arguably a community that shares an interest in a given topic benefits from having a standard way to refer to the things they care about. The increasing adoption of standard bibliographic identifiers such as DOIs makes it easier to build social bookmarking tools around the scientific literature (such as CiteULike http://www.citeulike.org/ and Mendeley http://www.mendeley.com/) because it becomes easier to determine how many members of the network have bookmarked the same paper.
Taxonomic communities are likely to be small and taxon-focussed. But this does not mean that these are the only communities that taxonomists can engage with, or that people outside the taxonomic community will not share the interests of those working on a particular taxon. Using bibliographic identifiers we can discover networks of people interested in particular topics that may intersect with taxonomists (obvious examples are people interested in ecology, conservation and evolutionary biology). By making publications the unit of sharing, companies such as Mendeley have grasped perhaps better than most that the connection between researchers is often not a direct social link, but rather shared interest in the same publication (formalised by patterns of citation and co-citation). For this reason, I suspect that attempts to build communities around taxa (
The taxonomic community has long felt disadvantaged by the role of citation-based “impact factor” in assessing the importance of taxonomic research (
At the same time, the concern about impact may help motivate the use of identifiers such as DOIs. There is a growing “altmetrics” movement (http://altmetrics.org/manifesto/) that aims to provide metrics for the post-publication impact of a publication in terms of activity such as social bookmarking, and commentary on web sites (
The first step towards improving the current generation of taxonomic databases would be to associate the taxonomic literature with existing digital identifiers, such as DOIs. Admittedly, this will not always be straightforward. Although DOIs are the bibliographic identifier of choice, and CrossRef provides tools for locating an existing DOI for a reference, it is not always straightforward to find a DOI for a publication. Part of the difficulty in citing the older literature is that many of the conventions we take for granted in modern scientific articles are lacking. Modern articles have titles, and are published in journals that usually have an unambiguous name, volume number, and pagination. This triplet is usually unique, and makes it relatively easy to locate an article in a bibliographic database (
Ogilby W (1838) On a collection of Mammalia procured by Captain Alexander during his journey into the country of the Damaras. Proceedings of the Zoological Society of London 1838:5–15.
This journal has been digitised by both Wiley and BHL. Wiley makes pages 5-15 available as an article with the doi: 10.1111/j.1096-3642.1838.tb01402.x and attributes the authorship to Richard Owen, not W. Ogilby. On inspection we see that pages 5–15 comprise two articles, one by Ogilby and one by Owen. The first paragraph of page 5 contains the text:
“A selection of the Mammalia procured by Captain Alexander during his recent journey into the country of the Damaras, on the South West Coast of Africa, was exhibited, and Mr. Ogilby directed the attention of the Society to the new and rare species which it contained.”
Subsequent authors have transformed this sentence into the article title “On a collection of Mammalia procured by Captain Alexander during his journey into the country of the Damaras”. Note also that in this case, there is a mismatch between the granularity at which taxonomists cite the literature and the granularity at which Wiley has assigned the identifier (the DOI corresponds to two articles). Perhaps the most obvious example of this mismatch is exemplified by the BHL, which typically recognises units at the scale of journal volume, or individual pages, but not at article level (
Discovering existing identifiers for the taxonomic literature will sometimes be difficult, for a multitude of reasons. For example, taxonomic databases often store an abbreviated (or even corrupted) version of the citation, the citation may be translated from its original language, or the journal may have been renamed and the new name applied retrospectively to older issues (
While DOIs are the best-known bibliographic identifier, there are several others that are relevant to the taxonomic literature (
Identifiers also exist for aggregations of publications, such as journals. The practice of abbreviating journal titles has led to a plethora of ways to refer to the same journal. For example, the BioStor database (
Bulletin of Zoological Nomenclature
The Bulletin of Zoological Nomenclature
Bull. Zool. Nom.
Bull.zool. Nom.
Bull. Zool. Nom
Bull, Zool. Nom.
Bull Zool. Nom.
Bull. Zool.nom.
Bull. Zool Nom.
Bull., Zool. Nom.
Bull. Zool. . Nom.
Bulletin Zoological Nomenclature
Bull Zoological Nomenclature
Bull Zool Nomen
Bull. Zool. Nomencl
Bull Zool Nom.
Bulletin of Zoological Nomeclature
Bulletin Zool. Nom.
Bull. Zool. Nomencl.
This practice of abbreviating journal names (motivated by the desire to conserve space on the printed page) complicates efforts to match citations to identifiers. One approach to tackling this problem is to map abbreviations to journal-level globally unique identifiers, such as International Standard Serial Numbers (ISSNs) (for the Bulletin of Zoological Nomenclature the ISSN is 0007-5167). In addition to reducing ambiguity, there are web services that take ISSNs and return the history of name changes for a journal, which in turn can help clarify the (often complicated) history of long-lived journals.
To assess the extent of taxonomic digitisation I harvested the metadata associated with the LSID for each record in the ION database. This database records names published under the International Code of Zoological Nomenclature. Over 4 million records have been harvested and imported into BioNames (http://bionames.org) (
Of course, having the literature digitised is not the same as having ready access to it. Numerous parties are undertaking digitisation efforts, and the results are being made available under a wide range of conditions. Some output is available under explicitly open access licenses (
As a final motivation to surface deep taxonomic data, consider the rise of “dark taxa” in genomics databases (
It is clear that some dark taxa do, in fact, have names. For example, consider the frog “Gephyromantis aff. blanci MV-2005” (NCBI tax_id 321743), which has a single sequence AY848308 associated with it. This sequence was published as part of a DNA barcoding study (
A key question facing attempts to find names for dark taxa is whether the methods available can be scaled to handle the magnitude of the problem. One could argue that newer technologies such as DNA barcoding make classical taxonomy less relevant, and perhaps the effort in digitising older literature and exposing the taxonomic names it contains is misplaced. A counter argument would be that the taxonomic literature potentially contains a wealth of information on ecology, morphology and behaviour, often for taxa in areas that have been subsequently altered by human activity. Furthermore, as technologies such as barcoding uncover previously overlooked variation, older taxonomic names previously sunk in synonymy may yet become relevant. For example, several taxa have been synonymised with the silvery mole-rat Heliophobius argenteocinereus Peters, 1846 (
Names may have a special place in the hearts of taxonomists (
I thank Ellinor Michel for the invitation to speak at the Sherborn meeting, and for her patience as I eventually got around to writing the promised manuscript. I thank the reviewers, Donat Agosti, Ken Johnson, and Rich Pyle for their constructive comments on the manuscript.