Corresponding author: David Remsen (
Academic editor: E. Michel
Scientific names serve to label biodiversity information: information related to species. Names, and their underlying taxonomic definitions, however, are unstable and ambiguous. This negatively impacts the utility of names as identifiers and as effective indexing tools in biological informatics where names are commonly utilized for searching, retrieving and integrating information about species. Semiotics provides a general model for describing the relationship between taxon names and taxon concepts. It distinguishes syntactics, which governs relationships among names, from semantics, which represents the relations between those labels and the taxa to which they refer. In the semiotic context, changes in semantics (i.e., taxonomic circumscription) do not consistently result in a corresponding and reflective change in syntax. Further, when syntactic changes do occur, they may be in response to semantic changes or in response to syntactic rules. This lack of consistency in the cardinal relationship between names and taxa places limits on how scientific names may be used in biological informatics in initially anchoring, and in the subsequent retrieval and integration, of relevant biodiversity information. Precision and recall are two measures of relevance. In biological taxonomy, recall is negatively impacted by changes or ambiguity in syntax while precision is negatively impacted when there are changes or ambiguity in semantics. Because changes in syntax are not correlated with changes in semantics, scientific names may be used, singly or conflated into synonymous sets, to improve recall in pattern recognition or search and retrieval. Names cannot be used, however, to improve precision. This is because changes in syntax do not uniquely identify changes in circumscription.
These observations place limits on the utility of scientific names within biological informatics applications that rely on names as identifiers for taxa. Taxonomic systems and services used to organize and integrate information about taxa must accommodate the inherent semantic ambiguity of scientific names. The capture and articulation of circumscription differences (i.e., multiple taxon concepts) within such systems must be accompanied with distinct concept identifiers that can be employed in association with, or in replacement of, traditional scientific names.
Remsen D (2016) The use and limits of scientific names in biological informatics. In: Michel E (Ed.) Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond. ZooKeys 550: 207–223. doi:
Scientific names are labels for taxa that are governed by formalized rules of nomenclature. These rules were introduced to establish clarity, stability, economy and uniqueness to the fragmented landscape of pre-Linnaean nomenclature (
The use and value of
Throughout the past 250 years, nearly all information about taxonomic groups such as species has been linked through a name, nearly always a scientific name. (
Names label voucher specimens in natural history museums, for instance, and are used to identify biological observations at all scales, from molecules to ecosystems, providing the key biological context to associated metacontent such as the observation locality and date Figure
Scientific names label information about species.
Given the ubiquitous linkage between biodiversity information and scientific names, there must exist an enormous and virtual super-index of names tied to the world’s species information. Such an index, assembled and presented within a Sherborn-like data store, would, in principle, link to all, or nearly all, information related to all described species. This implies a far more important and central role for names as mediators to biodiversity information. As more and more retrospective and prospective information is placed in online data stores, such an index is becoming increasingly realistic (
Indexing and search engines like Google and Yahoo generate billions of dollars in revenue by processing countless electronic data stores and producing searchable indexes (
There are limits to this utility however, and these limits are inherent within biological nomenclature and its relationship to the taxa they label.
These two terms have analogs in taxonomy. Nomenclature, particularly formalized scientific nomenclature governs much of the syntax domain while semantics is the realm of taxonomy, which links names with taxon definitions or
The triangle of reference, or semiotic triangle (
In the model (Figure
All information of a species is linked by a name.
The semiotic triangle describes how names communicate meaning.
In biological taxonomy, a species name refers to a concept anchored by a specimen but created in the mind of a biologist. The function of the name is to facilitate communication. Communication is facilitated, however, only when the concepts (not the objects) are approximately congruent. Success is not black and white, but can be partial – whether partial is good enough is contingent on context-specific inference needs that the reciprocal concept alignment must fulfill. Thus, two persons look at the same avocado and one declares it a fruit, because it is derived from floral ovaries, while another declares it is not a fruit because it is not sweet. This conflict occurs when there is no congruency in the concepts invoked through the use of the name. Similar issues occur within taxonomy. In the simple case above, the term ‘fruit’ is associated with two definitions, or, more formally, the
Identifiers such as names have utility in information discovery and retrieval that is directly proportional to the degree of correlation between the term and the associated meaning or, in the semiotic context, in the correlation between syntax and semantics. Laypersons may think of scientific names as stable and unique, where a single Latin binomial name refers to one species and remains that way for all time. In other words, that there is a stable one-to-one relationship between a name (syntax) and the taxon (semantics) that it labels (
Relevance in information retrieval is measured as a combination of two factors: precision and recall (
Precision vs. recall in search results.
A search result, therefore, can produce two kinds of relevance errors.
A false positive error occurs when the system returns a result that is non-relevant. This is an error of precision.
A false negative error occurs when the system fails to return a relevant result. This is an error of recall.
Based on the above, we can define a perfect identifier as one that returns 100% relevant results; that is, zero false positive, and zero false negative, results. This is easy to understand in a relational database system that uses internal unique identifiers to ensure that all relevant records are returned in queries. Relational integrity within a database management system relies on a 1:1 relationship between a primary key and the object it represents. Integrity would be lost if two identifiers referred to the same object or if the same identifier referred to two objects. For an identifier to be a perfect identifier both the cardinality and correlation between syntax and semantics is exactly 1:1. From a taxonomic standpoint, this would require a single, unique name to refer to a single, distinct taxon. Any change or difference in semantics should be linked to a corresponding change in syntax (
Laypersons are often surprised to learn that scientific names are neither stable nor unique identifiers for taxa. The underlying causes for this instability have their roots in both syntax and semantics (
There are four cardinal relationships possible between syntax (names) and semantics (taxa) in this regard and they are summarized in Table
Summary of cardinal relationships between names and taxa.
Impact | |||||
---|---|---|---|---|---|
Cardinality | Abbrev. | Diagram | Example | Recall | Precision |
One to One | 1:1 |
|
Stable taxon | No | No |
Many-to-One | N:1 |
|
Synonyms | Yes | No |
One-to-Many | 1:N |
|
Homonyms/ Polysemes | No | Yes |
Many-to-Many | N:N |
|
Taxon Concept | Yes | Yes |
The relationship between a scientific name and the taxon to which it refers always falls into one of the four conditions in this table. Each of these conditions is represented within biological taxonomy and imposes informatics challenges that, in many cases, may be mitigated.
One-to-One (1:1) | Cardinality | Impact | Result | |
---|---|---|---|---|
|
Syntax | Semantics | No impact on precision/recall | Maximum relevance |
One Name | One Meaning |
The perfect identifier, as defined above, returns no false positive or false negative results when applied in a search. Thus, a search by name returns all and only the relevant related objects. In biology, there are many taxa that are so under-studied that they are only known from their original description and none or very few subsequent references (
Recall – A single name will ensure no false negatives will be missed
Precision – A single taxon labeled with the name will ensure no false positives are included in the results.
In reality Latinized scientific names are complex and easily misspelled such that this pure one-to-one condition is not as easily met. When this occurs, multiple synonyms refer to the same taxon.
Many-to-One (N:1) | Cardinality | Impacts | Result | |
---|---|---|---|---|
|
Syntax | Semantics | Recall | False negatives |
Multiple names | One meaning |
Synonyms are multiple names associated with a single taxon. Rules of nomenclature dictate that only one name is the correct label for a taxon. Any others must be “sunk” in synonymy (
Recall – Synonyms impact recall because the use of a single name will result in false negative results.
Precision – Synonyms in a N:1 condition do not impact precision because, by definition, only a single concept is involved. Thus, false positive results are not possible through matching any of the names.
Orthographic or lexical synonyms
Variations in spelling represent one class of synonyms although they are often not formally referred as such. The names “
Nomenclatural synonyms
Nomenclatural synonyms represent a syntactic change without an associated change in semantics. This may occur when two names are discovered to refer to the same original publication or to the specimens that form the basis for the description. For example, the name
The binomial name of scientific names result in a change in syntax when a taxon is moved to a different genus or if a name is not published according to formal nomenclatural rules (
Taxonomic synonyms are the result of a change in circumscription that occurs when two, formerly distinct taxa, are merged. This may occur due to broad variation within a species giving rise to multiple, correctly published species descriptions that are ultimately deemed to belong to the same taxon. For example,
In all of these cases, information tied to a single taxon may be labeled with multiple different labels. This will result in false negative results in search and retrieval across data stores containing multiple names for the taxon.
Different approaches have been applied to overcome the impact on recall inherent to synonymy.
“Fuzzy” name-matching services are used to group orthographic variants and misspellings (
Taxonomic names servers, such as provided by uBio, iPlant and ITIS offer thesaurus-like services that provide the list of related names that can be used to conflate a search and improve recall (
One-to-Many (1:N) | Cardinality | Impacts | Result | |
---|---|---|---|---|
|
Syntax | Semantics | Precision | False positives |
One name | Multiple meanings |
Homonyms are two identically-spelled names that refer to two distinct taxa. For example, the genus
The word
Recall – Homonyms do not impact recall in this condition, because, by definition, only a single name is relevant and false negative matches are not possible.
Precision – Homonyms impact precision because the name is ambiguous and can produce false positive results when a match is made to a non-target taxon.
There are two ways to improve precision when a name is too ambiguous; syntactic and pragmatic. The syntactic approach is to change the cardinality between the names and taxa from one-to-many to one-to-one. This is achieved by changing the syntax to two distinct forms. In the case of
The pragmatic approach relies on analytic techniques that try to identify context to disambiguate the term. For example, the term “monkey” or “pea” in the vicinity of the use of the name
One-to-Many | Cardinality | Impacts | Result | |
---|---|---|---|---|
|
Syntax | Semantics | Precision | False positives |
One name | Multiple meanings |
Polysemy (literally “many meanings”) is a condition similar to homonymy and refers to a single name that refers to two taxa. Instead of consisting of entirely distinct taxa, however, the circumscriptions overlap. This occurs when taxa are lumped and split and result in two or more taxon concepts (
Recall – Recall is not impacted. Syntactic ambiguity is not a factor here as there is only a single name.
Precision – Polysemes impact precision because a single taxon name refers to two or more different circumscriptions for a taxon.
A polyseme is a single name referring to more than one overlapping or included concept.
The name,
Rules of nomenclature do not support reflective syntax changes due to changes in circumscription. When a taxon is split, the original name is carried on to refer to one of the resultant parts.
(Berendsohn W. G. 1995) has suggested that the name be concatenated with the annotation “sensu” followed by the author of the split to denote the circumscription reference with a unique label. In this case, the taxon would be known by two names:
“
“
This syntax would provide the means to distinguish the two circumscriptions in any future application but it leaves all previous applications ambiguous since the earlier application of the name can, in the context of the subsequent split, refer to either of the two new concepts. Any previous applications of the name would have to be re-assessed and re-labeled for any retrospective precision improvements. In some cases, this can be inferred through re-inspection and reasoning, using both manual and automated methods (
Many-to-Many | Cardinality | Impact | Result | |
---|---|---|---|---|
|
Syntax | Semantics | Precision & Recall | False positives False negatives |
Many | Many |
Polysemes were introduced as referring to a single name referring to multiple, related taxa; a condition that results from splitting a taxon concept into two or more new circumscriptions. Polysemy, however, is not the only result of semantic changes. When
The result of a taxonomic split on syntax and semantics.
Syntax | ||
---|---|---|
|
|
Original taxon that infects both dogs and humans |
|
New taxon that only infects dogs | |
|
|
New taxon that only infects humans |
The relationship between the names and the circumscriptions corresponds to a many-to-many (N:N). The two names and three concepts are all inter-related (Franz N. M. 2014).
The net result of the split, and the resultant impact on relevance in search, is summarized in Table
Lumped and split taxon and use of names to impact relevance where P=Precision and R=Relevance.
Taxon infects | Names | Semantics | P | R |
---|---|---|---|---|
|
|
Y | Y | |
|
|
N | Y | |
Humans only |
|
N | Y |
Scientific names link nearly all information related to a species but the relationship between nomenclatural syntax and taxonomic semantics is inherently ambiguous. Informatics processes that rely on data-gathering methods linked to taxon names are susceptible to this ambiguity and run the risk of providing imprecise or incomplete sets of data to subsequent downstream processes.
Sets of related scientific names may be used, as in today’s array of taxonomic name servers, to improve recall in search and retrieval for information tied to a taxon. The ambiguity of scientific names that occurs when the same name refers to two distinct, or overlapping taxa, however means that, in many cases, a single name returns an imprecise result and this is something that cannot be rectified through the use of name services.
Comprehensive taxonomic thesauri are required to model the relationships between names and taxa. Nomenclatural databases that currently capture the objective syntactic properties of names could improve their relevance by cataloging nomenclatural synonyms, as attempted in Index Animalium. Effectively modeling semantics requires a clean division between these syntactic aspects of taxonomy and the subsequent subjective processes that result in changes in circumscription (
The author would like to extend his thanks and graditude to reviewers, Nico M Franz, Ph.D., Associate Professor and Curator of Insects, Arizona State University, and John Todd, Ph.D., Curator of Molluscs, Natural History Museum, London. An additional thanks to Dr. Ellinor Michel, ICZN Executive Secretary, for serving as general editor and expediting the whole Sherborn publication process.