A complete time-calibrated multi-gene phylogeny of the European butterflies

With the aim of supporting ecological analyses in butterflies, the third most species-rich superfamily of Lepidoptera, this paper presents the first time-calibrated phylogeny of all 496 extant butterfly species in Europe, including 18 very localized endemics for which no public DNA sequences had been available previously. It is based on a concatenated alignment of the mitochondrial gene COI and up to 11 nuclear gene fragments, using Bayesian inference of phylogeny. To avoid analytical biases that could result from our region-focus sampling, our European tree was grafted upon a global genuslevel backbone butterfly phylogeny for analyses. In addition to a consensus tree, we provide the posterior distribution of trees and the fully-concatenated alignment for future analyses.

The data were mainly collated from published sources and downloaded from NCBI GenBank (Table  S1). One mitochondrial gene, cytochrome c oxidase subunit I (COI, 1464 bp), was available for all species in the data matrix, in particular the 5' half of the gene (658 bp, also known as the DNA barcode). Eleven nuclear genes were included when available: elongation factor-1α (EF-1α, 1240 bp), ribosomal protein S5 (RpS5, 617 bp), ribosomal protein S2 (RpS2, 411 bp), carbamoylphosphate synthase domain protein (CAD, 850 bp), cytosolic malate dehydrogenase (MDH, 733 bp), glyceraldehyde-3-phosphate dehydrogenase (GAPDH, 691 bp), isocitrate dehydrogenase (IDH, 711 bp), wingless (412 bp), Arginine kinase (ArgK, 596 bp) and Dopa Decarboxylase (DDC, 373 bp) and histone 3 (H3, 329 bp). H3 has been sequenced almost exclusively from the family Lycaenidae, while the other gene regions have been sampled widely also in the other butterfly families. For each gene, the longest available sequence was used. However, in the case of several available sequences of similar length, those of European origin were preferentially used. Sequences were aligned manually to maintain protein reading frame, and were curated and managed using VoSeq 77 .
In several cases, new sequences were generated for this study. For these specimens, protocols followed Wahlberg and Wheat 78 or Wiemers and Fiedler 66 . These include several species that did not have any available published sequences, many of which are island endemics ( Table 1). The new sequences have been submitted to GenBank (accessions xxxx-xxxx).

Table 1. Newly sequenced species for which no published sequences had previously been available Taxon
Origin Coenonympha orientalis Greece A biogeographically restricted tree of a given taxon is inherently very asymmetrically sampled. To avoid potentially strong biases when estimating topology and divergence times we chose to build upon the recent genus-level tree of butterflies 47 , which provides a well-supported time-calibrated backbone and corresponds well with a recent phylogenomic analysis of Lepidoptera 80 . This backbone tree contains 994 taxa, each taxon representing a genus across all Papilionoidea. The tree was timecalibrated using a set of 14 fossil calibration points, which provided minimum ages and 10 calibration points based on ages of host plant clades taken from the literature, which provided maximum ages. Importantly, Chazot, et al. 47 tested the robustness of their results to a wide range of alternative assumptions made in the time-calibration analysis, and showed that the estimated times of divergences were robust.

Analysis overview
In order to produce our time-calibrated tree of European butterflies, we identified the position of the European lineages and designed a grafting procedure accordingly. We split the European butterflies that needed to be added to the tree into 12 subclades. For each of these subclades we combined the DNA sequences of the taxa already included in the backbone to the DNA sequences of the European taxa to assemble an aligned molecular matrix. After identifying the best partitioning scheme, we performed a tree reconstruction without time-calibration (only estimating relative branch lengths). The subclade trees were then rescaled using the ages estimated in the backbone and were subsequently grafted. This procedure was repeated using 1000 trees from BEAST posterior distributions of the backbone and subclade trees in order to obtain a posterior distribution of grafted trees. The details of these procedures are described below.

Backbone and subclades
The time-calibrated backbone tree provided by Chazot, et al. 47 contained about 55% of all butterfly genera, including most genera occurring in Europe. A fixed topology was obtained using RAxML 81 and node ages where estimated with BEAST v.1.8.3. 82 . We used this fixed topology from Chazot, et al. 47 to identify at which nodes European clades should be grafted. We partitioned the analysis into 12 subclades. For each subclade, the DNA sequences of all taxa already included in the global backbone (including also non-European taxa) were combined with the DNA sequences of all the new European taxa that were added. In addition to the focal taxa, we added between two and four outgroups.
The subclades, sorted by families, were defined as follows: Nymphalidae -European Nymphalidae were divided into seven subclades. (i) A subclade for the Apaturinae. (ii) In order to add Danaus chrysippus we generated a tree of Danainae. (iii) We combined the sister clades Heliconiinae and Limenitidinae into a single subclade. (iv) Nymphalinae was treated as a single subclade. ( 83 ) was already included in the backbone tree from Chazot et al. 47 . Hence, we used the position of Charaxes castor for Charaxes jasius.

Partitioning the dataset
For each subclade we ran Partition Finder 2 84 in order to partition the data and choose substitution models. The dataset was initially partitioned into genes and codon positions. Branch lengths were set to linked and the comparison between partitioning strategies was made using the greedy algorithm and BIC score 85 .

Phylogenetic reconstruction
For each subclade, the dataset was imported in BEAUTi v.1.8.3 86 and partitioned according to the partitioning strategy identified by Partition Finder 2. We enforced the monophyly of the clade to be grafted (i.e., excluding the outgroups). All other relationships were estimated by BEAST v.1.8.3. 82 . We used an uncorrelated relaxed clock with lognormal distribution. By default, we started by setting one molecular clock per partition. If convergence or good mixing could not be obtained after running BEAST we reduced the number of molecular clocks (see details for each dataset further below). We did not add any time-calibration and therefore only estimated the relative timing of divergence. We performed at least two independent runs with BEAST for each subclade. We checked for convergence and mixing of the MCMC using Tracer v.1.6.0 87 and in the case of full convergence of the runs, the posterior distribution of trees from different runs were combined after removing the burn-in fraction.

Grafting procedure
Subclades were grafted on the backbone as follows. One backbone was sampled from the posterior distribution of time-calibrated trees from Chazot, et al. 47 . For each subclade, one subclade tree was sampled from the posterior distribution of trees, the outgroups removed and the tree was rescaled based on the crown age of the subclade extracted from the backbone tree. Finally, the rescaled subclade tree was grafted on the backbone after removing all lineages belonging to this subclade in the backbone (i.e. only keeping the stem branch). We repeated this procedure for 1000 backbone trees and 1000 subclade trees and thus we obtained a posterior distribution of 1000 grafted trees. The topology of the backbone was fixed (see 47 ) but the topologies of the subclades were free. Hence the posterior distribution of grafted trees includes a posterior distribution of topologies and node ages.
We describe below the details of the phylogenetic tree reconstruction for each subclade.

1-Papilionidae
Dataset -The dataset for the Papilionidae consisted of 36 taxa to which three outgroups were added: Macrosoma tipulata (Hedylidae), Achlyodes busiris (Hesperiidae), Pieris rapae (Pieridae). We concatenated 11 gene fragments (COI, CAD, EF-1α, GAPDH, ArgK, IDH, MDH, RpS2, RpS5, DDC, wingless). BEAST analysis -In order to improve the quality of our runs we replaced the default priors for rates of substitutions by uniform prior ranging between 0 and 10 for the following cases: subset5.at, subset5.cg, subset7.cg, subset7.gt, subset12.cg, subset12.gt. We used one molecular clock per subset identified by Partition Finder and obtained good mixing and convergence. We used a Birth-Death tree prior. We performed three runs of 40 million generations, sampling trees and parameters every 4000 generations.
Grafting -For grafting, the outgroups were removed, as well as Baronia brevicornis, the first Papilionidae to diverge and endemic to Mexico, i.e. we grafted at the most recent common ancestor (mrca) of all Papilionidae but Baronia brevicornis. BEAST analysis -Preliminary analyses showed problems with the subset 3 (ArgKin_pos3) and was therefore removed from the analyses. In order to improve the quality of our runs we replaced the default priors for rates of substitutions by uniform priors ranging between 0 and 10 for the following cases: subset17.cg. The substitution model for the subset 14 was also changed into HKY+I after preliminary analyses. We used one molecular clock per subset identified by Partition Finder and obtained good mixing and convergence. We used a Birth-Death tree prior. We performed two runs of 150 million generations, sampling trees and parameters every 15000 generations.

2-Hesperiidae: Hesperiinae
Grafting -For grafting, the outgroups were removed and the subclade grafted at the mrca of Hesperiinae.

3-Hesperiidae: Pyrginae
Dataset -The dataset for the Pyginae consisted of 77 taxa to which three outgroups were added: Typhedanus ampyx (Hesperiidae), Pyrrhopyge zenodorus (Hesperiidae) and Hasora khoda (Hesperiidae). We concatenated 10 gene fragments (COI, CAD, EF-1α , GAPDH, ArgK, IDH, MDH, RpS2, RpS5, wingless). BEAST analysis -In order to improve the quality of our runs we replaced the default priors for rates of substitutions by uniform prior ranging between 0 and 10 for the following cases: subset7.ac, subset7.gt, subset14.cg, subset3.cg. Preliminary analyses showed problems when using a separate molecular clock for each subset identified by Partition Finder. We restricted the analysis to one molecular clock. We used a Birth-Death tree prior. We performed two runs of 100 million generations, sampling trees and parameters every 10000 generations.
Grafting -For grafting, the outgroups were removed and the subclade grafted at the mrca of Pyrginae. BEAST analysis -In order to improve the quality of our runs we replaced the default priors for rates of substitutions by uniform prior ranging between 0 and 10 for the following case: subset7.cg. The substitution model for the subset 7 was also changed into GTR+G after preliminary analyses. We used one molecular clock per subset identified by Partition Finder and obtained good mixing and convergence. We used a Birth-Death tree prior. We performed two runs of 100 million generations, sampling trees and parameters every 10000 generations.

4-Pieridae
Grafting -For grafting, the outgroups were removed and the subclade grafted at the mrca of Pieridae. BEAST analysis -In order to improve the quality of our runs we replaced the default priors for rates of substitutions by uniform prior ranging between 0 and 10 for the following cases: subset3.cg, subset6.ag, subset6.at, subset11.gt_subst7.cg. We used one molecular clock per subset identified by Partition Finder and obtained good mixing and convergence. We used a Birth-Death tree prior. We performed two runs of 150 million generations, sampling trees and parameters every 15000 generations.

5-Lycaenidae
Grafting -For grafting, the outgroups were removed and the subclade grafted at the mrca of Lycaenidae.

6-Nymphalidae: Danainae
Dataset -The dataset for the Danainae consisted of 7 taxa to which two outgroups were added: Euploea camaralzeman (Nymphalidae) and Lycorea halia (Nymphalidae). We concatenated 9 gene fragments (COI, CAD, EF-1α, GAPDH, IDH, MDH, RpS2, RpS5, wingless). BEAST analysis -We used one molecular clock per subset identified by Partition Finder and obtained good mixing and convergence. We used a Birth-Death tree prior. We performed two runs of 20 million generations, sampling trees and parameters every 2000 generations.
Grafting -For grafting, the outgroups were removed and the subclade grafted at the mrca of Danainae.

7-Nymphalidae: Apaturinae
Dataset -The dataset for the Apaturinae consisted of 9 taxa to which two outgroups were added: Timelaea albescens (Nymphalidae) and Biblis hyperia (Nymphalidae). We concatenated 10 gene fragments (COI, CAD, EF-1α , GAPDH, ArgK, IDH, MDH, RpS2, RpS5, wingless). BEAST analysis -We used one molecular clock per subset identified by Partition Finder and obtained good mixing and convergence. We used a Birth-Death tree prior. We performed two runs of 20 million generations, sampling trees and parameters every 2000 generations.
Grafting -For grafting, the outgroups were removed and the subclade grafted at the mrca of Danainae.

8-Nymphalidae: Heliconiinae + Limenitidinae
Dataset -The dataset combined the sister clades Heliconiinae and Limenitidinae and consisted of 92 taxa to which three outgroups were added: Amnosia decora (Nymphalidae), Apatura iris (Nymphalidae) and Libythea celtis (Nymphalidae). We concatenated 11 gene fragments (COI, CAD, EF-1α , GAPDH, ArgK, IDH, MDH, RpS2, RpS5, DDC, wingless). BEAST analysis -Preliminary analyses showed problems with the subset 14 (RpS2_pos2) which was therefore removed from the analyses. In order to improve the quality of our runs we replaced the default priors for rates of substitutions by uniform priors ranging between 0 and 10 for the following case: subset7.cg. We used one molecular clock per subset identified by Partition Finder and obtained good mixing and convergence. We used a Birth-Death tree prior. We performed two runs of 100 million generations, sampling trees and parameters every 10000 generations.
Grafting -For grafting, the outgroups were removed and the subclade grafted at the split between Limenitidinae and Heliconiinae. BEAST analysis -In order to improve the quality of our runs we replaced the default priors for rates of substitutions by uniform priors ranging between 0 and 10 for the following case: subset5.cg. Preliminary analyses revealed problems when using one molecular clock per subset identified by Partition Finder. We restricted the analysis to one molecular clock for the mitochondrial gene fragments and one molecular clock for the nuclear gene fragments. We used a Birth-Death tree prior. We performed two runs of 100 million generations, sampling trees and parameters every 10000 generations.

9-Nymphalidae: Nymphalinae
Grafting -For grafting, the outgroups were removed and the subclade grafted at the mrca of Nymphalinae.

10-Nymphalidae: Satyrinae 1
Dataset - BEAST analysis -We used one molecular clock per subset identified by Partition Finder and obtained good mixing and convergence. We used a Birth-Death tree prior. We performed two runs of 20 million generations, sampling trees and parameters every 2000 generations.
Grafting -For grafting, the outgroups were removed and the subclade grafted at the crown of the clade after removing the outgroups. BEAST analysis -In order to improve the quality of our runs we replaced the default priors for rates of substitutions by uniform prior ranging between 0 and 10 for the following cases: subset5.ac, subset5.ag, subset5.at, subset5.cg, subset5.gt. We used one molecular clock per subset identified by Partition Finder and obtained good mixing and convergence. We used a Birth-Death tree prior. We performed two runs of 100 million generations, sampling trees and parameters every 10000 generations.

11-Nymphalidae: Satyrinae 2
Grafting -For grafting, the outgroups were removed and the subclade grafted at the crown of the clade after removing the outgroups.

12-Nymphalidae: Satyrinae 3
Dataset -The third Satyrinae dataset consisted of 15 taxa all belonging to the genus Coenonympha, to which two outgroups were added: Sinonympha amoena (Nymphalidae) and Oressinoma sorata (Nymphalidae). We concatenated 9 gene fragments (COI, CAD, EF-1α , GAPDH, IDH, MDH, RpS2, RpS5, wingless). BEAST analysis -We used one molecular clock per subset identified by Partition Finder and obtained good mixing and convergence. We used a Birth-Death tree prior. We performed two runs of 20 million generations, sampling trees and parameters every 2000 generations.
Grafting -For grafting, the outgroups were removed and the subclade grafted at the crown of Coenonympha.

Technical Validation
Species identities of the chosen sequences for the dataset were validated by blasting the DNA barcode sequence against the Barcode Of Life Database, which has a good representation of European butterfly species due to a number of barcoding projects implemented in different countries. In almost all cases, the sequences came from the same voucher specimen itself. In 88 cases (Supporting Information), the sequences used were from different individuals. In these cases special care was taken to use sequences from reliable sources, preferably those with voucher photographs.
We based our time-calibration from a recent re-evaluation of the timing of divergence of higher-level Papilionoidea. We used the topology inferred by Chazot et al. 47 as a backbone in our grafting procedure. This topology was fixed in Chazot et al. 47 , hence only node ages were estimated. Within each subclade we grafted however, we let BEAST estimate the topology in addition to node height. Several sections of the European butterfly tree remain poorly supported. This most likely arises from the lack of molecular information as well as recent and rapid diversification events within Polyommatus, Hipparchia, or Pseudochazara for example. We show here a synthetic tree summarizing the posterior distribution of topologies and node ages but the posterior distribution of grafted trees can be found in Supporting Information, providing a distribution of alternative topologies and node ages estimated by BEAST. We strongly advise any researcher using our phylogenetic framework to repeat the analyses on at least 100 trees randomly sampled from this posterior distribution in order to account for topology and node age uncertainty. This tree can also help identify the sections of the tree lacking molecular information and therefore points at the sections that should be targeted in the future when generating new molecular data.

Usage Notes
We have generated a robust phylogenetic hypothesis for all European species of butterflies with associated times of divergence (Fig. 1). Our purpose is to provide a complete phylogenetic framework for use by the ecological and evolutionary communities. The demand for such a phylogenetic information is high at the moment and various proxies have been used that are not ideal, starting already in 2005 88 . We provide a posterior distribution of topologies and node ages, in order for researchers to be able to take phylogenetic and node age uncertainty into account if they so wish. The tree files are provided in standard Newick format as output from BEAST. Future studies will not necessarily be as comprehensive as the tree we provide. In such cases we recommend using tools such as the ape package 89 in R 90 to remove tips from the tree that are not relevant to a given study.   or ie nt al is C ar ch ar od us flo cc ife ra C ar ch ar od us la va th er ae C ar ch ar od us ba et ic us C ar ch ar od us st au de ri E ry nn is ta ge s E ry n n is m a rl o yi P yr g u s si d a e P yr g u s ca rt h a m i P yr g u s m a lv o id e s P yr g u s m a lv a e P yr g u s a n d ro m e d a e P y rg u s c a c a lia e P y rg u s c e n ta u re a e P y rg u s s e rr a tu la e P y rg u s a rm o ri c a n u s P y rg u s c in a ra e P y rg u s c ir s ii P y rg u s c a rl in a e P y rg u s o n o p o rd i P y rg u s w a rr e n e n s is P y rg u s fo u lq u ie ri P y rg    K re ta n ia e u ry p ilu s K re ta n ia h e sp e ri ca K re ta n ia tr a p p i K re ta n ia se p h ir u s K re ta n ia p yl a o n C ya n iri s se m ia rg u s A gr ia de s op til et e K re ta ni a ps yl or ita A gr ia de s or bi tu lu s A gr ia de s da rd an us A gr ia de s py re na ic us A gr ia de s zu lli ch i A gr ia de s aq ui lo A gr ia de s gl an do n G la br oc ul us cy an e Ar ic ia m or ro ne ns is B ol or ia ch ar ic le a B ol or ia an ga re ns is A pa tu ra iri s A pa tu ra m et is A pa tu ra ili a A ra sc hn ia le va na V an es sa ca rd ui V an es sa vi rg in ie ns is V a n e ss a vu lc a n ia V a n e ss a a ta la n ta A g la is io A g la is u rt ic a e A g la is ic h n u sa P o ly g o n ia e g e a