DNA barcode library for European Gelechiidae (Lepidoptera) suggests greatly underestimated species diversity

Abstract For the first time, a nearly complete barcode library for European Gelechiidae is provided. DNA barcode sequences (COI gene – cytochrome c oxidase 1) from 751 out of 865 nominal species, belonging to 105 genera, were successfully recovered. A total of 741 species represented by specimens with sequences ≥ 500bp and an additional ten species represented by specimens with shorter sequences were used to produce 53 NJ trees. Intraspecific barcode divergence averaged only 0.54% whereas distance to the Nearest-Neighbour species averaged 5.58%. Of these, 710 species possessed unique DNA barcodes, but 31 species could not be reliably discriminated because of barcode sharing or partial barcode overlap. Species discrimination based on the Barcode Index System (BIN) was successful for 668 out of 723 species which clustered from minimum one to maximum 22 unique BINs. Fifty-five species shared a BIN with up to four species and identification from DNA barcode data is uncertain. Finally, 65 clusters with a unique BIN remained unidentified to species level. These putative taxa, as well as 114 nominal species with more than one BIN, suggest the presence of considerable cryptic diversity, cases which should be examined in future revisionary studies.


Introduction
The megadiverse family, Gelechiidae, includes approximately 4,700 known species and perhaps a similar number of undescribed taxa . With a remarkable 865 species reported from Europe and adjacent islands (Huemer and Karsholt 2020), the Gelechiidae are the fourth most diverse family of Lepidoptera after the Noctuidae, Geometridae, and Tortricidae in Europe. Due to their general dull-coloured and inconspicuously patterned wings (Fig. 1), and frequently small size, the Gelechiidae have received little attention from lepidopterists, leading to considerable gaps in knowledge of their taxonomy, systematics, biology, and distribution. In particular, the lack of generic revisions in several diverse groups has created the widespread impression of a "difficult" family which has acted to further limit interest in this group.
Over the last two decades, the Gelechiidae have received increasing attention as a result of two monographs that treated approximately half the known European species Karsholt 1999, 2010) and another on the Central European fauna (Elsner et al. 1999). Unfortunately, these publications, as well as several subsequent revisions (i.e., Bidzilya 2005a, 2005b, Bidzilya and Karsholt 2015, Karsholt and Rutten 2005, Karsholt and Šumpich 2015, Li and Sattler 2012, did not take advantage of new molecular methods, in particular DNA barcoding. On the contrary phylogenetic analysis of higher taxa in Gelechiidae benefitted greatly from molecular analysis . However, recent studies on several genera of European Gelechiidae , Huemer and Mutanen 2012, Huemer and Karsholt 2014, Landry et al. 2017) revealed the power of this approach to aid species delimitation in taxonomically difficult groups, even those with a high level of unrecorded species and cryptic diversity. Similar patterns have been analyzed in several other Lepidoptera in different parts of the world, e.g., in another gelechioid group , in Iberian butterflies (Dincă et al. 2015), in North American Noctuoidea (Zahiri et al. 2017), or in the Lepidoptera fauna of Costa Rica (Janzen and Hallwachs 2016). These results motivated the present effort to compile a comprehensive DNA barcode library for the European Gelechiidae fauna, with the aim of simplifying future revisionary studies while also improving their quality.

Checklist of European Gelechiidae
The lack of an updated checklist for European Gelechiidae (see Karsholt 2004Karsholt -2019 was such a major impediment to the present study that it necessitated the assembly of a new systematic list (Huemer and Karsholt 2020). This list, which includes 865 species of Gelechiidae in 109 genera, provided the basis for selecting the specimens that were analysed in this study.

Sample material
One major challenge was the difficulty in accessing specimens suitable for molecular analysis, reflecting the rarity of many species. In addition, DNA quality of the specimens was another very important limitation as sequence recovery from older specimens of rare taxa was either partial or failed completely even with protocols that employed high-throughput sequencers to analyze short amplicons. In some cases, efforts were made to recollect taxa that lacked a sequence record.
Voucher material was obtained from Europe ( Fig. 2) except for eleven taxa whose sequences could not be recovered from specimens from this continent or where it seemed important to analyze specimens to clarify taxonomy (e.g., extra-European type-material) (Suppl. material 2, 3). Approximately two-thirds of specimens originated from four nations -Germany (1319), Austria (1157), Italy (906), and Finland (707). The remaining specimens derived from 33 other countries (Fig. 2).
Many institutions and private collectors contributed to the dataset (see below), supplemented by DNA barcodes from earlier studies.

DNA sequencing
A single leg was removed from each specimen and placed in a 96-well lysis plate that was submitted for analysis at the CCDB (Canadian Center for DNA Barcoding, University of Guelph, Canada) where DNA extraction, PCR amplification, and sequencing were performed following standard high-throughput protocols (deWaard et al. 2008). In total, 5986 specimens of European Gelechiidae, initially pre-identified from external and partially genitalia morphology by several colleagues and cross-checked by PH and OK in dubious cases, were successfully sequenced. Details of specimens, including complete voucher data, images, and GenBank accession numbers are available on BOLD (Ratnasingham 2018, Ratnasingham andHebert 2007) in the public dataset "Lepidoptera (Gelechiidae) of Europe" under the DOI: dx.doi.org/10.5883/ DS-GELECHEU.

Data analysis
Levels of intra-and interspecific variation in the DNA barcode fragment were calculated under the Kimura 2-parameter (K2P) model of nucleotide substitution using analytical tools in BOLD systems v4.0 (http://www.boldsystems.org). Fifty-three Neighbor-Joining trees (Maximum Composite Likelihood method, default settings), most including representatives of a single genus, were constructed using MEGA X (Kumar et. al 2018) (Suppl. material 2 and 3). Node confidences were estimated using 500 bootstrap replicates. For genera with few species, several morphologically closely related genera were included in a single tree. For calculating these trees only sequences ≥ 500 bp were used, except for ten species where only shorter sequences were available (Suppl. material 1). In those cases where the specimens of a single species were assigned to two or more different BINs, they were discriminated by a letter code. Because of the high number of BINs for Megacraspedus dolosellus and M. lanceolellus, these taxa were figured in two separate NJ trees with BINs separated as single clusters. Species sharing a BIN, but still with a diagnostic barcode were grouped in separate clusters. A threeletter code (ISO 3166-1 alpha-3, https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) was used to abbreviate country names. Identification success was assessed by the Barcode Index Number (BIN) system as implemented on BOLD . This system employs a two-stage algorithm that groups all sequences > 500 bp that meet defined quality criteria into Operational Taxonomic Units (OTUs) and automatically assigns new sequences, irrespective of their previous taxonomy and origin. Concordance or discordance between BINs and morphological species identification was assessed.

Overview
DNA barcode sequences were recovered from 5986 specimens representing 751 of the 865 species of Gelechiidae described from Europe (Suppl. material 1). In addition, the analysis revealed 65 putative species whose members were each assigned to a different unique BIN. Most sequences (5476) were compliant with the barcode standard as described in BOLD (http://www.boldsystems.org). Most subsequent analyses only considered the 741 species with sequences ≥ 500bp, but ten additional species with sequences ≥ 300 bp were included in the NJ trees. Sequences from 723 species qualified for BIN analysis.

Species delimitation from DNA barcode divergences
Intraspecific DNA barcode variation in the 741 named species with sequences ≥500 bp averaged 0.54%, but this may be an underestimate as sample sizes for 224 taxa were low and only represented by singletons. In respect to the distribution of mean intraspecific DNA barcode variation: 73.1% of sequenced species had variation ranging from 0-1%, 15.8% between 1-2%, 6.3% between 2-3%, and 4.8% > 3%.

Species delimitation with Barcode Index Number (BIN) system
In total, 5877 sequences were assigned to a BIN. These records were assigned to 992 BINs that belong to 788 putative taxa (Suppl. material 2 and 3). Among these, 723 corresponded with named species, while another 65 belong to a unique BIN that is currently unidentified, but many likely represent additional, unrecognised species. Specimens from another 114 named species were assigned to more than one BIN; members of 68 species were placed in two BINs, while BIN counts for the other 46 species ranged from three to 22 (Table 2).
Altogether 668 (92.4%) of 723 named species have one or more unique BINs, while 55 species (7.6%) share a BIN with up to four species (Table 3). BIN sharing was particularly frequent in six genera (Acompsia, Dirhinosia, Iwaruna, Scrobipalpula, Teleiopsis, Xenolechia) where species often cannot be discriminated by DNA barcodes. However, most specimens in these taxa have diagnostic barcodes and all possess diagnostic morphological characters.

Potential cryptic diversity -unrevised taxa
High levels of 'intraspecific' barcode variation often reflect overlooked species, but there is no fixed level of divergence that indicates species status. Furthermore, deep barcode splits can also arise as a result of the inadvertent recovery of pseudogenes, as a consequence of hybridisation, or Wolbachia infection (Mally et al. 2018, Werren et al. 2008. In Lepidoptera, 2-3% divergence is occasionally viewed as signalling the need for further integrative analysis (Hausmann et al. 2013), but there is clear evidence that  no such threshold values exist (see e.g., Kekkonen et al. 2015). In the present dataset 146 of 741 nominal species possessed a maximum intraspecific divergence of > 2%, 88 species > 3%, while 33 species showed greater than > 5% (Table 4).
In some recently revised taxa with high, geographically structured intraspecific barcode divergence such as Megacraspedus (Huemer and Karsholt 2018) or the Oxypteryx libertinella species-group , no evidence for cryptic diversity was found. However, even lower 'intraspecific' barcode divergence may reflect cases of either allopatric or sympatric speciation, as proven e.g., for the genus Sattleria Hebert 2011, Huemer andTimossi 2014). In consequence, several species with unusual genetic pattern need to be carefully re-assessed as they may include additional species. Cryptic diversity was, for example, already suspected for some Caryocolum (Huemer et al. 2015) or Stomopteryx remissella, but may also be detected in recently revised genera such as Acompsia or Chionodes Karsholt 2002, Huemer andSattler 1995).
A further group of unrevised species in our dataset includes 65 unidentified DNA barcode clusters which were assigned to separate BINs (Table 5). Many of these cases are likely to represent undescribed species or alternatively, they may represent described species that currently lack barcode coverage. Altogether 26 genera representing approximately one-quarter of European genera are candidates for additional taxa. In fact, four genera (Aproaerema, Aristotelia, Monochroa, Scrobipalpa) are each represented by more than five unidentified clusters. For detailed comments on these cases, see Huemer and Karsholt (2020).

Discussion
During the past decade, several national DNA barcoding campaigns have led to the development of an increasingly well-parameterised DNA barcode library for European Lepidoptera. However, these projects have mainly focused on the fauna of central and northern Europe. As a consequence, genetic coverage for species in the Mediterranean region remains patchy. Reflecting this fact, continent-wide analysis has only considered a few groups so far, such as Nepticulidae (van Nieukerken pers. comm.), Gracillariidae (Lopez-Vaamonde pers. comm.), Elachistinae , Depressariidae (Buchner pers. comm), Geometridae (Hausmann et al. 2013, Müller et al. 2019, and Papilionoidea (Dincă pers. comm.). By contrast, for most families either few DNA barcodes exist, or comprehensive genetic analysis is not available.
The current DNA barcode library makes it clear that the Gelechiidae is a particularly good example of the serious gaps in the knowledge of European biodiversity. Nearly a quarter of current fauna has been described since 1990 (Fig. 3). This gap between European gelechiid diversity and adequate coverage in published alpha-taxonomy is most probably a result of: 1) the small number of gelechiid experts, 2) the lack of adequate vouchers for phenotypic and molecular study 3) the frequently cryptic morphology making them less attractive to non-expert workers, and 4) the infrequent consideration of molecular data to assess taxonomic boundaries.  1785 1795 1805 1815 1825 1835 1845 1855 1865 1875 1885 1895 1905 1915 1925 1935 1945 1955 1965 1975 1985 1995 2005 2015 Decennium Species descriptions per decennium Cumulative n of described species In the present study, DNA sequences revealed a high level of possible cryptic diversity in European Gelechiidae, despite extensive revisionary work over the last decades (see e.g., Karsholt 1999, 2010). Although almost 96% of all 741 species possessed unique barcodes, intraspecific divergences exceeded 2% in nearly a fifth of currently recognised species, and 33 of these cases of divergence values exceeded 5%, values that likely signal overlooked species.
The intraspecific DNA barcode variation is reflected in some taxa as allopatric divergence, but in other cases, it reflects sympatric deep splits. However, few of these species have received detailed taxonomic assessment such as the recent comprehensive study on Megacraspedus (Huemer and Karsholt 2018). In many other unrevised genera/species-groups a significant increase in species diversity is likely. The major gaps in taxonomic treatment of European Gelechiidae are further demonstrated by the large number of unidentified genetic clusters revealed by the present investigation as many of these 65 putative taxa are likely to represent undescribed species.

Conclusions
By providing coverage for 751 species of European Gelechiidae, the current DNA barcode library represents the largest release in terms of species diversity for any family of Lepidoptera on this continent. The results reveal unexpected genetic diversity in many taxa as well as numerous unidentified taxa. This indicates that the alpha-taxonomy of this family, still requires serious attention despite one-quarter of the known species described after 1990. The current results indicate that the Gelechiidae remain one of the most taxonomically challenging families of Lepidoptera in the World as complete coverage of even European fauna will require extensive effort. However, the DNA barcode library generated in this study will allow these revisionary studies to target groups that are particularly problematic, accelerating the documentation of the fauna.
Cultural foundation and Kone foundation for financial support to the Finnish Barcode of Life (FinBOL) project.
We gratefully acknowledge the support with material donated to our institutional collections by several colleagues, which was of crucial value for our analysis and to several colleagues mentioned in the list of private and institutional collections) for supporting our DNA barcoding approach.
Philipp Kirschner (Innsbruck, Austria) most kindly helped with the construction of Fig. 3, Romed Unterasinger (TLMF) with the compilation of Suppl. material 2, and Andreas Eckelt (TLMF) with Fig. 1. This study was only possible due to contributions from many colleagues. As such it provides an impetus for closer co-operation among the community of taxonomists working on Gelechiidae and similarly 'difficult' groups of other micro-moths.