Commentary |
Corresponding author: Derek S. Sikes ( dssikes@alaska.edu ) Academic editor: Jeremy Miller
© 2016 Derek S. Sikes, Kyle Copas, Tim Hirsch, John T. Longino, Dmitry Schigel.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Sikes DS, Copas K, Hirsch T, Longino JT, Schigel D (2016) On natural history collections, digitized and not: a response to Ferro and Flick. ZooKeys 618: 145-158. https://doi.org/10.3897/zookeys.618.9986
|
Natural History collections, Museums, digitization, GBIF, georeferencing, data sharing
We thank
Ferro and Flick raise the concern that funding for digitization efforts is siphoning funds away from the maintenance of natural history collections (NHCs). We argue that the distinction between funding NHCs and the production of GBIF-mediated data is artificial – specimen records from NHCs are the foundation of the entomological data accessible through GBIF.org, with U.S. institutions alone providing 7.5 million georeferenced insect occurrence records citing a specimen as the ‘basis of record’, including 3.5 million records relating to insect specimens collected from U.S. lands (GBIF.org (2016-09-14) GBIF Occurrence Download http://doi.org/10.15468/dl.5txrti and GBIF.org (2016-09-14) GBIF Occurrence Download http://doi.org/10.15468/dl.1kayda). There need be no ‘choice’ between maintaining good regional specimen collections and the digitization and publication of data through online aggregated databases. Increasingly in the U.S. (
Number of insect records in GBIF.org (triangles) between December 2007 and March 2016, in comparison to all records (circles).
We agree entirely with the sentiment Ferro and Flick promote with their quote of
The data that
We strongly agree with
As remains typical of the majority of taxonomic work currently being published,
Via communication with Ferro (in lit.) during which we asked about the sharing of his dataset, we learned that Ferro felt his data were not produced in a manner ideal for sharing with GBIF. Ferro also commented on the lack of a user-friendly data-pipeline to prepare and upload data to share (more on this issue below). Ferro explained that he georeferenced records using the centers of counties for each locality record rather than georeferenced them following the best practices suggestions in
Distribution mapping is changing. Such maps are not constellations of the maximum number of georeferenced occurrences, but effectively projected, modeled areas where species are thought to occur with a certain uniform or changing probability.
GBIF is currently working to improve the representation of available data to make the completeness and fitness for use of any dataset as transparent to the user as possible. We agree that GBIF.org can include clearer text and information about both the context and limitations of data accessible through GBIF. Data will always be of variable completeness and precision, and GBIF’s approach should be to ensure that users such as distribution modelers can easily restrict searches to data fit for their use, while not excluding other data that may still be useful for other purposes. However, taxonomists who are well aware that museum collections are rife with misidentifications and data quality issues such as collector bias (
Indeed, not all scientific users understand that globally aggregated data always need filtering and post processing, as well as dealing with data gaps. A constructive alliance would enlist experts to help address quality issues in the process of global data aggregation. For example, despite the increasing fraction of wrongly annotated fungal sequences in GenBank, the trustworthy ones (
The issue of data quality will never, and should never, go away. All data need vetting. The study of Hjarding et al. (2014) compared expertly vetted data obtained from various NHCs, many of which didn’t share data with GBIF, to unvetted data available from GBIF and, not surprisingly, found the unvetted data to be unreliable. They wrote “Our results suggest that before conducting desktop assessments of the threatened status of species, aggregated museum locality data should be vetted against current taxonomy and localities should be verified. We conclude that available online databases are not an adequate substitute for taxonomic experts in assessing the threatened status of species and that Red List assessments may be compromised unless this extra step of verification is carried out.” We agree. This study, and the consequent discussions on iPhylo and Taxacom covered many of the same concerns seen in
It is worth considering an analogy with GenBank regarding
To carry the GenBank analogy a little further, imagine a researcher who assembled via their own lab-work a thorough genetic dataset to do a proper phylogenetic analysis of a taxon and then did the following (1) published a critique of GenBank complaining it lacked most of the data that the researcher had to generate and (2) held their data back rather than shared it with the scientific community. It is generally a requirement by journals for authors publishing on newly obtained genetic data to deposit their data with GenBank. It is our hope that the taxonomic community will see the benefits of treating specimen data the way most journals treat genetic data - as an investment in the greater good, as a way of raising the standards of taxonomic research, as a way of saving future generations the time and effort of digitizing specimens (again), as a way of making taxonomic research more useful for non-taxonomic researchers, and as a way of meeting the expectations of funding agencies. We need a GenBank for specimen data – a point made by
The taxonomic community is often quite vocal about conservation of biodiversity. Many conservation efforts are based on geo-political regions, be they nations, states, parks, or refuges. However, because taxonomy organizes data by taxon rather than region, it is easier to determine where a species occurs than to determine how many and which species occur in a region. For entomology, most of these data are found only on labels on pins scattered among various NHCs and scattered literature organized by taxon, not region. As a result, most regional checklists are usually limited in taxonomic scope (e.g. one large order or family).
If these data are shared globally they can be used for conservation of biodiversity related to land preservation or in analyses of shifting distributions resulting from climate change (e.g.
Any taxonomist who publishes new occurrence records but fails to share these data is, in effect, handicapping conservation efforts by hiding their taxa “in the dark” from geographically based searches. In particular, newly described species are often highly localized endemics known from few localities, or just the type locality. These species are of great interest to conservationists but it is the rarest of exceptions in entomology for occurrence data for these species to be shared with GBIF. Given the conservation importance of these species, and often the relatively few specimens involved, it is unfortunate that more such small and easily-prepared datasets are not shared.
Identifying and prioritizing more collections for digitization and publication through GBIF.org would serve the long-term needs of conservationists, while providing collections with greater visibility and return on investment because funding agencies are more likely to make awards to NHCs that are digitizing their holdings. A task force convened by GBIF is currently investigating how this can be best achieved through wide consultation with the global collections community (http://www.gbif.org/newsroom/news/accelerating-discovery-of-biocollections-data).
Imagine if all the NHCs from which
Historian J. J. O’Donnell, in his book Avatars of the Word (
We welcome and appreciate the great effort invested by
Because datasets generated from taxonomic revisionary work are the most thorough and high quality datasets available, we hope to see changes that enable these datasets to be more easily archived and shared. It is unrealistic and not efficient to expect all specimen digitization efforts to be performed by museums – especially when so much digitization is already being performed by taxonomists who borrow specimens. The changes necessary to realize this goal are both technological (e.g. easy access to data templates that can be filled in and user-friendly methods to share data) and behavioral (e.g. rewards for authors who take the extra effort to archive data,
By publishing data papers and sharing their high-quality data, taxonomic experts critical of the quality of GBIF-mediated data can contribute constructively to improvements and at the same time gain wider visibility and recognition of their professional efforts. It has been asserted many times - the future of taxonomy is decline or digital renaissance (
We thank Ferro and Flick for their thoughtful insight and concern in raising these issues. We thank Tim Robertson, Head of Informatics, at GBIF for preparation of the figure. We also thank reviewers who made helpful suggestions to improve the manuscript.