Research Article |
Corresponding author: Richard Pyle ( deepreef@bishopmuseum.org ) Academic editor: Ellinor Michel
© 2016 Richard Pyle.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Pyle RL (2016) Towards a Global Names Architecture: The future of indexing scientific names. In: Michel E (Ed.) Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond. ZooKeys 550: 261–281. https://doi.org/10.3897/zookeys.550.10009
|
For more than 250 years, the taxonomic enterprise has remained almost unchanged. Certainly, the tools of the trade have improved: months-long journeys aboard sailing ships have been reduced to hours aboard jet airplanes; advanced technology allows humans to access environments that were once utterly inaccessible; GPS has replaced crude maps; digital hi-resolution imagery provides far more accurate renderings of organisms that even the best commissioned artists of a century ago; and primitive candle-lit microscopes have been replaced by an array of technologies ranging from scanning electron microscopy to DNA sequencing. But the basic paradigm remains the same. Perhaps the most revolutionary change of all – which we are still in the midst of, and which has not yet been fully realized – is the means by which taxonomists manage and communicate the information of their trade. The rapid evolution in recent decades of computer database management software, and of information dissemination via the Internet, have both dramatically improved the potential for streamlining the entire taxonomic process. Unfortunately, the potential still largely exceeds the reality. The vast majority of taxonomic information is either not yet digitized, or digitized in a form that does not allow direct and easy access. Moreover, the information that is easily accessed in digital form is not yet seamlessly interconnected. In an effort to bring reality closer to potential, a loose affiliation of major taxonomic resources, including GBIF, the Encyclopedia of Life, NBII, Catalog of Life, ITIS, IPNI, ICZN, Index Fungorum, and many others have been crafting a “Global Names Architecture” (GNA). The intention of the GNA is not to replace any of the existing taxonomic data initiatives, but rather to serve as a dynamic index to interconnect them in a way that streamlines the entire taxonomic enterprise: from gathering specimens in the field, to publication of new taxa and related data.
Taxonomy, Carl Linnaeus, Charles Davies Sherborn, Biodiversity Data, Global Names Usage Bank, Global Names Index, ZooBank Biodiversity Library
Although biological taxonomy is sometimes referred to as the “oldest profession” (
Species tot sunt diversae quot diversas formas ab initio creavit infinitum Ens. [There are as many species as the Infinite Being produced diverse forms in the beginning.]
This is not at all surprising, given that Darwin’s concept of evolution was not proposed until a century after the start of modern nomenclature (Darwin 1859). But even then, Darwin opted not to attempt a precise definition of “species”, writing (p. 40):
Hence, in determining whether a form should be ranked as a species or a variety, the opinion of naturalists having sound judgment and wide experience seems the only guide to follow. We must, however, in many cases, decide by a majority of naturalists, for few well-marked and well-known varieties can be named which have not been ranked as species by at least some competent judges.
This idea was reflected by the definition of species by
A species is a community, or a number of related communities, whose distinctive morphological characters are, in the opinion of a competent systematist, sufficiently definite to entitle it, or them, to a specific name. [often paraphrased as, “a species is what a competent taxonomist says it is”]
Many modern taxonomists have dismissed this definition as unscientific or too reliant on the notion of what “competent” means, and as a result, debates regarding a more precise and biologically meaningful definition of species have continued over the decades well into modern times (publications too numerous to cite, but see
Regardless of its merit, acceptance, or adoption, a variant of this definition, effectively “a species is what a community of taxonomists says it is” is the de-facto species definition that has been applied since the time of Linnaeus. Taxonomists have asserted individual species circumscriptions over the course of centuries, and those circumscriptions that have met with approval by subsequent taxonomic communities have endured the test of time. In the modern context, while there are certainly species that are subject to ongoing debate, the vast majority of species have achieved some level of stability.
In stark contrast to the dynamic, ongoing, and seemingly endless debates about what a “species” is, the nomenclatural system used by taxonomists during the past two and a half centuries has been remarkably consistent, universal, and stable. The primary reason for this consistent and universal stability has to do with the Codes of scientific Nomenclature (e.g.,
It is the objective and largely stable nature of scientific names of organisms that makes them well-suited for large-scale indexing of the sort that Charles Davies Sherborn (1861–1942) dedicated his life to. Whereas the majority of the nearly 4,400 species circumscriptions described by Linnaeus in his 1758 Systema Naturae bear very little resemblance to the species boundaries asserted by modern biologists, most of the scientific names he established are not only available under the current Code, but are in current use (though often in combination with different generic names than what Linnaeus used). Even when historical scientific names have been synonymized by later workers, they remain available (when Code-compliant), and therefore potentially relevant centuries after their establishment. Although catalogs of species (e.g.,
The system of scientific nomenclature is not the only aspect of the taxonomic enterprise that has remained relatively constant over the centuries. Certainly there have been some improvements to the way taxonomists do their jobs. For example, it once required months to journey across the seas aboard sailing ships, whereas now almost any part of the world can be reached within a few hours aboard modern jet airplanes (Figure
Early taxonomists had only crude maps to plot the locations of their specimens; in this case the French Polynesian islands of Tahiti and Moorea (top, from Prévost D’Exiles 1746–1789). Today, highly accurate maps and satellite imagery can pinpoint particular locations within a few meters (bottom, Landsat).
Highly trained artisans once labored to produce detailed hand-painted illustrations of specimens (top, from
Carl Linnaeus used candle-lit microscopes with primitive optics to examine his specimens (left, H. Kingsbury). Modern technology allows us to generate high-resolution 3D CT scans of the internal structures of specimens without displacing a single scale (right top, Digimorph; Chromis abyssus), capture crisp images of tiny organisms through electron microscopy (right middle, NOAA; single-celled foraminifera), and read DNA sequences (right bottom, BOLD, unspecified taxon).
However, despite these important technological advancements in the tools of the trade for taxonomy, the fundamental process remains the same today as it was centuries ago: impassioned naturalists seek financial support from governments and private entities to travel the globe to discover new species of organisms; they take detailed notes and acquire specimens, which they carefully transport back to Museums; they create color images, dissect, poke, prod, count, and measure their biological treasures; they write detailed descriptions that are printed on paper in books and periodicals. These fundamental steps in the taxonomic process, while aided by advanced technology, have remained fundamentally unchanged since the time of Linnaeus (Figure
There is one aspect of technological change that has been truly revolutionary, which is the means by which taxonomists manage and communicate the information of their trade. The rapid advancement in recent decades of computer database management software, and of information dissemination via the Internet, have both dramatically improved the potential for streamlining the entire taxonomic process. Less than two decades ago, graduate students in taxonomy spent untold hours in libraries, scouring through pages and pages of paper documents to find original descriptions and key taxonomic revisions. Today, with a few searches on Google and with the extremely useful Biodiversity Heritage Library, many original sources are only a few mouse clicks away. And the ease of access is not limited to digitized literature; specimens, images, and vast amounts of biological information are freely available through the Internet. One of the last remaining barriers to information availability – the pay-walls behind which many newly published research hides – is gradually eroding through an increasing demand for open-access models of publication.
This revolution in digital information technology is extremely fortunate, given the rate at which species have been (and continue to be) described. In 1758, the tenth edition of Linnaeus’ Systema Naturae contained nearly 4,400 species-group names. At the time, this compilation represented the entire catalog of all known animals. In the century that followed, the number of scientific names for species had increased by two orders of magnitude, as represented in the volumes of Sherborn’s Index Animalium (Figure
As exciting as the electronic information revolution is, however, in the context of taxonomy there is still far more potential than there is reality in terms of harnessing the power of information technology. The vast majority of taxonomic information either remains non-digitized, or is digitized in a form that does not allow direct and easy access. Moreover, much (if not most) of the information that is easily accessed in digital form is not yet seamlessly interconnected. At present, the total biodiversity knowledge-base for all life forms is scattered across an estimated half-billion pages of printed literature, thousands of natural history collections housing billions of specimens, hundreds of thousands of digital databases and websites, and hundreds of millions of DNA sequences. Consumers of this knowledge-base, which includes tens of thousands of taxonomists, hundreds of thousands of biologists, a hundred million citizen scientists, governmental resource managers and policy makers, and ultimately much of the total human population, have not had easy access to this information (
The next step in the information revolution for biodiversity information involves not just the digitization of content, but will involve the cross-linking and more seamless integration of existing digital resources.
In an effort to bring reality closer to potential, a loose affiliation of major taxonomic resources, including the Global Biodiversity Information Facility (GBIF; http://www.gbif.org), the Encyclopedia of Life (EOL; http://eol.org), the former U.S. National Biological Information Infrastructure (NBII), Catalog of Life (CoL; http://www.catalogueoflife.org), the Integrated Taxonomic Information System (ITIS; http://www.itis.gov), the International Plant Names Index (IPNI; http://ipni.org), the International Commission on Zoological Nomenclature (ICZN; http://iczn.org), Index Fungorum (IF; http://www.indexfungorum.org), and many others have been crafting a “Global Names Architecture” (GNA). The intention of the GNA is not to replace any of the existing taxonomic or other biodiversity data initiatives, but rather to serve as a dynamic suite of web services and two primary indexes (GNI and GNUB, described below) that interconnect existing data systems in a way that streamlines the entire taxonomic enterprise: from gathering specimens in the field, to publication of new taxa and related data.
The basic premise behind the GNA is that scientific names of organisms represent the key to integrating disconnected biological data, to allow efficient and effective coordination between biological research and exploration activities, and broader understanding and management of biodiversity (
Unfortunately, sources of imprecision and ambiguity severely limit the use of these text-string names for cross-linking digital data content. For example, there are approximately two million scientifically described species (
To overcome the limitations of text-string scientific names, the GNA includes a core component called the Global Names Usage Bank (GNUB). GNUB is a highly normalized database system, the primary purpose of which is to index and assign persistent globally unique identifiers (GUIDs) to Agents, References, and Taxon Name Usage (TNU) instances (among other relevant data objects). Agents are people and organizations, and in the context of GNUB mostly represent Authors of References. References include all published literature, as well as many forms of unpublished documentation (e.g., unpublished reports and manuscripts, specimen labels, herbarium sheets, field notes, etc.). Any static documentation source can be a Reference in the GNUB architecture. A TNU is any usage or treatment of a scientific name within a Reference. TNUs are the foundation for all Code-governed nomenclatural acts, taxon concept definitions, taxonomic treatments, synonymies and classifications, and any other forms of taxonomic assertions. The subset of TNUs that represent the establishment of new scientific names (i.e., original descriptions) are called “Protonyms” (
The core elements of a TNU include the following items (see Table
A unique and persistent identifier for the TNU itself;
A link to the Reference (including page, if applicable) in which the TNU appears;
A recursive link to the Protonym-TNU for the name represented by the TNU;
An indication of the taxonomic rank at which a name was treated (e.g., “genus” for the TNU for Gasterosteus within
The exact spelling (as best as can be represented using UTF-8 encoding) of the name as used within the Reference (e.g.,
A link to the TNU (within the same Reference) representing the immediate parent taxon (e.g., the Protonym TNU for the species saltatrix within
Examples of Taxon Name Usage instances (TNUs). Records representing Protonyms are highlighted in yellow. Records representing treatments of a name as a synonym is highlighted in grey. Ellipses (…) in the Parent column represent links to higher-rank TNUs not included in the sample below. This table is highly simplified and does not represent the actual GNUB data model.
TNUID | Reference | Protonym | Rank | Spelling | Parent | Valid | Representative Usage |
---|---|---|---|---|---|---|---|
1 |
|
1 | Genus | Gasterosteus | … | 1 | Protonym of genus Gasterosteus |
2 |
|
1 | Genus | Gasterosteus | … | 2 | Subsequent usage of Gasterosteus |
3 |
|
3 | Species | Saltatrix | 2 | 3 | Protonym of species G. saltatrix |
4 | Schöpf 1788:167 | 1 | Genus | Gaſteroſteus | … | 4 | Subsequent usage & variant of Gasterosteus |
5 | Schöpf 1788:168 | 3 | Species | Sallatrix | 4 | 5 | Subsequent usage & variant of G. saltatrix |
6 | Lacépède 1802:435 | 6 | Genus | Pomatomus | … | 6 | Protonym of genus Pomatomus |
7 | Lacépède 1802:436 | 7 | Species | Skib | 6 | 7 | Protonym of species P. skib |
8 |
|
8 | Genus | Temnodon | … | 8 | Protonym of genus Temnodon |
9 |
|
8 | Genus | Temnodon | … | 9 | Subsequent usage of Temnodon |
10 |
|
3 | Species | Saltator | 9 | 10 | Subsequent usage & variant of G. saltatrix |
11 |
|
7 | Species | Skib | - | 10 | Synonym treatment of P. skib |
12 |
|
6 | Genus | Pomatomus | … | 12 | Subsequent usage of Pomatomus |
13 |
|
3 | Species | Saltatrix | 12 | 13 | Subsequent usage & combination of G. saltatrix |
In cases where a name is treated as a junior synonym of another name, a link to the TNU (within the same Reference) representing the senior synonym as asserted by the indicated Reference for this junior synonym (e.g.,
By building an index of all TNUs across historical literature (starting with the Protonym TNU for each name), GNUB data services can efficiently perform powerful analyses and transformations of taxon names across different spellings, synonymies and classifications. For example, the species Gasterosteus saltatrix Linnaeus, 1766, has also been spelled Sallatrix in at least one Reference, spelled saltator in at least 16 References, and the species epithet (by whichever spelling) has been variously combined with the genus names Pomatomus Lacépède, 1702, Temnodon Cuvier, 1816 and Cheilodipterus Lacépède, 1801. Moreover, the GNUB index also records the fact that at least twelve other species have been treated as a junior synonyms of saltatrix, and these species have been variously combined with at least ten different genus names. Thus, through GNUB we can see that the species originally established by
Largely through support from two separate NSF grants (DBI-1062441; DBI- 0956415), the GNA has been developed into a highly successful proof of concept. The most visible representation is the ZooBank registry (http://zoobank.org). ZooBank was first proposed as an official online nomenclatural registry for zoology, under the auspices of the International Commission for Zoological Nomenclature (ICZN) by
Prior to 2012, ZooBank registrations grew steadily from approximately 100 registrations per month in 2008-2010, to approximately 500 registrations per month in 2011–2012. After the new GNUB-based implementation of ZooBank was launched in September 2012, registrations increased almost ten-fold, to nearly 5,000 per month. The vast majority of these regiwstrations are prospective – that is, for works and names that are newly established. Retrospective content for ZooBank will be added through the bulk importation of existing databases, and through harvesting protonyms from BHL and other sources. Commensurate with the rise in registrations has been an increase in the ZooBank user-base. From 2008–2012, the ZooBank user base grew steadily to a little over 100 active users. In less than a year since the GNUB-based ZooBank was launched, the user base has also grown nearly ten-fold, to over 1,000 users (and it continues to grow). As successful as the new GNUB-based ZooBank has been, it is important to emphasize that ZooBank is only one example of a service that GNUB can facilitate. In addition to ZooBank as a model for GNUB-based registration systems in other nomenclatural domains, there are many other services that GNUB can facilitate.
Whereas name-usages within static References are indexed directly as TNUs, these are mapped to records in external and/or dynamic data sources through a simple identifier cross-link feature in GNUB. This feature, which currently includes nearly half a million links from records in more than 200 external databases to over 320,000 GNUB records, enables much more than simply linking GNUB records to external databases; specifically, it allows external databases to be linked to each other.
For example, GNUB includes links to over 111,000 registered names in ZooBank, nearly 140,000 records (taxonomic serial numbers) in ITIS, and nearly 70,000 genus-group and species-group name records in the Catalog of Fishes (CoF). Besides allowing these three datasets to link directly to GNUB (and vice-versa), the Identifier cross-link service also enables direct cross-links between each of these otherwise disconnected datasets (in this case, 67,000 linked records between ZooBank and CoF; 26,555 linked records between ZooBank and ITIS, and 26,467 linked records between ITIS and CoF). Because of this cross-linking feature, new names registered in ZooBank could be presented to ITIS and CoF for inclusion in their databases, and corrections made to errors in CoF could be propagated to both ZooBank and ITIS. By establishing cross-links between equivalent records in different database systems, we not only expand the ability for end-users to directly access records in the different systems, but we also create novel opportunities for proactive collaboration between different systems with overlapping content. While other systems include support for similar features (e.g., the EoL “Partner Links”, and the NCBI Taxonomy “LinkOut”), the GNA provides a single shared platform for all cross-links, such that anytime a record is indexed in GNUB, it is automatically cross-linked to all other data systems that are indexed in GNUB.
This cross-linking service is not limited to taxon names. For example, GNUB includes links to over 3,300 journals registered in ZooBank and over 3,200 journals scanned in the Biodiversity Heritage Library (BHL). Through the BHL “OpenURL” service, over 50,000 ZooBank species pages (as well as nearly 100,000 other TNU records) now have direct access to the corresponding page image in BHL. Likewise, because GNUB is linked to over 34,000 authors in the Authors of Plant Names (APN) directory, and nearly 21,000 authors registered in ZooBank, we can compare authorship trends in both domains (e.g., fewer than 1% of all authors have published new scientific names for both plants and animals).
This same cross-linking service applies to records in more than 200 different external databases (not all databases have been fully indexed yet). As such, GNUB can serve as a universal hub to cross-link records (not only Authors, References, and TNUs/Protonyms; but virtually any other data object as well), which will facilitate collaboration and data exchange (as in the names-linking example), enhance web services to infer and establish other links (as in the BHL page example), and to allow analysis of patterns that had not previously been possible (e.g., patterns of authorship over time, such as those as used by
Several other services and APIs were developed for searching, dereferencing, editing, and inserting Agents, References, and TNUs (particularly Protonyms), with a variety of output formats (e.g., HTML, JSON). The most recent service is called “GNIE” (originally an acronym for “Global Names Index Export”, and retained despite its expanded utility), which accepts an identifier for a Protonym and returns a set of all scientific names indexed in GNUB that have been used to represent the same taxon, including all homotypic synonyms (spelling variants, alternate genus combinations, etc.) and heterotypic synonyms (names that have been treated as either junior or senior synonyms of the indicated Protonym). Documentation for all of these services is included on the ZooBank API page (http://zoobank.org/Api).
Although these services were used extensively in the development of ZooBank (Figure
An example ZooBank page, illustrating several GNUB services: 1 user authentication 2 “fuzzy” searching of GNUB content 3 APIs and services 4 ZooBank registration 5 External Identifier cross-linking 6BHL page linking 7 record editing capabilities 8 similar/related name discovery (via GNI’s name searching service); and 9 multi-lingual support. Not shown are services to manage user accounts, de-duplicate records, prototype reconciliation tools, services for journal publishers, and visualization tools for author publication history and other statistics.
In addition to these services designed primarily to support external systems, several services designed for internal GNUB use were also developed. These include a user/login account management system, a robust record de-duplication resolution system, a prototype data reconciliation tool (currently optimized for Agents and journal titles) used for bulk data imports, multi-lingual support, a tool for visualizing the publication timeline history for authors, a suite of database statistics visualization services, and services to facilitate data contribution and management by publishers and editors of journals.
Unlike most existing biodiversity data initiatives, the components of the GNA are not primarily intended to provide novel information; rather, the GNA includes an index of core facts (and associated data services) that are shared across all of biology. Nothing in the GNA is original or novel content; it merely represents a structured way of organizing information to facilitate broader data integration among other databases that do contain original information. Thus, the GNA does not compete with other data resources; but rather serves as a core infrastructure for cross-linking (and thereby empowering) other biological data sources.
Although the GNA is primarily intended to provide a cross-linking service between existing databases, the data model is sufficiently robust and complete that it can fulfill the primary needs of representing nomenclature, taxonomy and classification for groups that are not otherwise represented by existing databases. Thus, while its primary function is to integrate biodiversity data across multiple disparate systems, the GNA is capable of filling the gaps in taxonomic coverage for groups of organisms not already well-represented in the broader biodiversity data landscape.
Biodiversity is Earth’s greatest Library, representing the culmination of information that has been written and re-written, edited and re-edited, over the past four billion years. We are like Kindergartners running through the Library of Congress: surrounded by vast amounts of incredibly valuable information – the genomic equivalents of the works of Homer, Shakespeare, and blueprints for a nuclear power plant and 95% efficient conversion of sunlight energy to stored chemical energy, but we are currently only able to interpret this information at the equivalent of “See Spot Run”. Someday soon (within the next few decades) we will have the ability to truly understand the information in the Biodiversity Library. As we face the 6th Great Extinction event, we recognize that the Biodiversity Library is burning, so the information will be gone before we have a chance to understand its true value. Whenever a species goes extinct, it’s like burning the last copy of a book. Taxonomists are the Librarians, and have perhaps the most important job of all: building the digital equivalent of the “card catalog” for the Biodiversity Library.
This audacious task was begun in the 1750s by Carl Linnaeus, and was dramatically extended by Charles Davies Sherborn 150 years later. With the advent of modern electronic information management, we are poised to achieve the vision of these two pillars of science; but we are in a race against the destruction of what we seek to document. We are the first generation in human history to understand our own impact to biodiversity, and we are very likely the last generation in a position to do anything about it. The Global Biodiversity Library is burning, and we must tirelessly continue to document the richness of form and function in nature before it is lost forever.
Throughout most of the history of modern taxonomy and nomenclature, the basic tasks performed by taxonomists have remained remarkably unchanged. Technology has allowed some improvements, with modern electronic information dissemination representing the most significant advancement. At this point in history, biodiversity data are being digitized at an impressive rate, but in most cases the data remain in “silos”, with limited interconnectivity. As such, the accumulated digitized data cannot be used to its full potential. The most effective way to integrate disparate biodiversity datasets is through scientific names, but for many reasons, text-string names alone are not effective for this purpose. The Global Names Architecture (GNA) has been developed to provide core indexes and cross-linking services, to help leverage otherwise disparate biodiversity data. ZooBank, the official online registry of zoological nomenclature, represents only one example of how the GNA can improve interconnectivity among biodiversity data. Going forward, the priority should be to continue digitizing data, and to develop robust cross-links among existing biodiversity datasets. Global biodiversity is precious – perhaps Earth’s most valuable resource – yet we have only begun to catalog its contents. With global climate change and accelerating rates of extinction, it is more important than ever to extend the work of Charles Davies Sherborn to apply to all known and yet-to-be-discovered taxa.
Sherborn himself seemed to understand the challenges of his task, many of which remain true today. In the Epilogue of Index Animalium, (Section 2, Part 29, pp. vi-vii), he wrote:
Now my work is finished, it may well be to glance at the difficulties met with during compilation… This want of every book and every edition has been a serious hindrance and loss of time to me while working for over forty years in the British Museum (Natural History) and though I have acquired over a thousand volumes for the libraries, gaps still remain to be filled… On the whole one has met with a generous response, but the amused smile, real apathy, or the remark ‘we have no money’ … have been encountered…
He was also acutely aware of the nature of evolving technology:
And now that rotography has superseded photography as regards cost, a rare tract can be reproduced in a few hours and placed on its proper shelf in any Library for a few shillings.
But most important of all, Sherborn understood the grandeur of his quest, and knew full well that it was far greater than his own personal contributions:
In conclusion I may add that the whole of my papers, Books of Reference and apparatus will remain at the Museum for my continuator and I trust that arrangements will be made for the permanent indexing of even current literature as the only true method of economizing the time of the working zoologist.
It is our responsibility as modern biologists to harness the power of new technology to continue is this all-important task of documenting biodiversity.
I wish to thank Donat Agosti, Rod Page, Ross Mounce, and especially Michael P. Taylor (http://orcid. org/0000-0002-1003-5675) for their excellent and constructive reviews of this manuscript.