Research Article |
Corresponding author: Thomas M. Donegan ( thomasdonegan@yahoo.co.uk ) Academic editor: Yasen Mutafchiev
© 2018 Thomas M. Donegan.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Donegan TM (2018) What is a species? A new universal method to measure differentiation and assess the taxonomic rank of allopatric populations, using continuous variables. ZooKeys 757: 1-67. https://doi.org/10.3897/zookeys.757.10965
|
Existing models for assigning species, subspecies, or no taxonomic rank to populations which are geographically separated from one another were analyzed. This was done by subjecting over 3,000 pairwise comparisons of vocal or biometric data based on birds to a variety of statistical tests that have been proposed as measures of differentiation. One current model which aims to test diagnosability (
diagnosis, species limits, species scoring, statistics, subspecies limits, taxonomy
This paper aims to help address the “allopatric problem” when determining species rank in taxonomic science. Humans have categorized populations into named groups since the dawn of known civilization (
Sympatric species, which occur together in the same place during the breeding season but do not successfully interbreed to any material extent, are demonstrably real. With enough data and persistence, it is usually possible to determine whether or not sympatric populations interbreed regularly and whether they produce fertile offspring (
A traditionally more difficult problem, and the focus of this paper, is that of “allopatric” (
The subjectivity involved in comparing allopatric species and the rise of molecular science have doubtless encouraged the development of a multitude of different species criteria or concepts. As noted by
Whilst statistical and mathematical techniques to analyze molecular data have been a rich field for methodological advancement, the same cannot be said for the study of real world variables. Supportable statistical schemes for assessing between-population differentiation are noteworthy principally by their absence. Those schemes which have been proposed are either widely criticized, only applicable to particular taxonomic groups or vague.
This paper will concentrate on the traditional currency of taxonomy: continuous variables such as those based on measurement of specimens, whether in the museum or in the field. Many researchers and advanced amateurs do not have a molecular laboratory available and few genera have been exhaustively sampled in a way that includes multiple individuals at population level. In contrast, vocal and biometric data are easy to collate, accessible to many and cheaper to analyze. A wide variety of other ‘real world’ organism characters are capable of measurement as continuous variables. For vocalizations, lengths or acoustic frequencies of notes can be measured using sonograms, for example. Coloration can be measured using spectrometry. Non-continuous or discrete variables, e.g., presence or absence of a particular character and molecular markers, can be analyzed best using cladistics and other phylogenetic tools and are not covered here in detail.
When species rank is assessed across a taxonomic group as a whole, consistency is a virtue. Under a biological species concept-based approach, attaining such consistency will require a determination of which allopatric populations have differentiated to the same extent as related sympatrics and which have not. Those that have so differentiated are species; those that have not are, at most, subspecies. Unfortunately, consistency is not attained in current classifications, especially as regards more diverse tropical faunas. This is generally due to discrepancies in available data, the regularity of different genera being revised and differences in approaches by regional committees or textbook authorities to studies using different taxonomic methods (e.g., molecular vs. morphological) (
Neither of these two splits is problematic from a phylogenetic species concept or “enthusiastic splitter” perspective in isolation; and further studies could give stronger support to these treatments. However, based on my experience of working with birds in the Neotropics, the benchmark applied to these situations would result in the specific recognition of probably several thousands of current subspecies or unnamed taxa occurring in that region.
The
In light of the difficulties with scoring “systems” and other developments, Halley et al. (2017) have argued for a return to monophyly and essentially
Over the last 20 years, I have been studying the taxonomy of birds in Colombia using biometric data (from mist-netting and museums) and using sound recordings. This resulted in the production of a large amount of data relevant to studying differentiation. It has become transparent to me that steps might be taken towards resolving some of these seemingly intractable fundamental disagreements, by developing an objective and agreeable basis, grounded in scientific method, statistics, the analysis of large data and based on traditional biological species concept thinking, that could be used better, more consistently and more rationally to assess the rank of allopatric populations. Ultimately, the aim of this study is to attempt definitively to provide a robust, objective and universal method to address the centuries-old question (unresolved since
In the present study, I took a large data set that had been developed for purposes of various particular taxonomic studies of birds (citations below) and used this to road-test proposed and possible alternative statistical tests for measuring differentiation or diagnosis, with the intention of studying outcomes of tests in order to inform recommendations.
I compiled vocal and biometric data from multiple studies, including of representatives of the three major assemblages of birds: non-passerines (three families), suboscine passerines (four families), and oscine passerines (two families) (citations in Tables
Order: Family | Genus | No. taxa / populations | No. spp. before review | No. spp. after review | No. continuous vocal variables | No. Pairwise tests omitted | Pairwise comparisons | Sample sizes (mean ± s.d.) (min–max) | Reference |
---|---|---|---|---|---|---|---|---|---|
Columbiformes: Columbidae | Geotrygon | 2 | 1 | 2 | 2 | 0 | 2 | 22.0 ± 4.6 (18–26) |
|
Apodiformes: Trochilidae | Adelomyia | 2 | 1 | 1 | 10 | 0 | 10 | 15.7 ± 1.5 (14–18) | Donegan and Avendaño (2015) |
Piciformes: Bucconidae | Hypnelus | 2 | 1 | 2 | 5 | 0 | 5 | 5.5 ± 1.6 (4–7) |
|
Passeriformes: Thamnophilidae | Myrmeciza | 8 | 4 | 5 | 26 | 114 | 614 | 42.7 ± 49.2 (3–179) |
|
Passeriformes: Grallariidae | Grallaricula | 10 | 1 | 2 | 14 | 224 | 406 | 18.2 ± 12.9 (3–63) |
|
Passeriformes: Rhinocryptidae | Scytalopus 1 | 8 | 3 | 3 | 12 | 0 | 336 | 23.0 ± 13.8 (4–57) |
|
Passeriformes: Rhinocryptidae | Scytalopus 2 | 2 | 1 | 1 | 7 | 0 | 7 | 14.9 ± 2.2 (12–17) |
|
Passeriformes: Tyrannidae | Sirystes | 4 | 1 | 4 | 18 | 64 | 44 | 39.1 ± 41.5 (3–146) |
|
Passeriformes: Parulidae | Basileuterus | 13 | 3 | 6 | 19 | 558 | 924 | 25.5 ± 19.0 (2–78) |
|
TOTALS | 51 | 16 | 26 | 113 | 960 | 2348 | 29.0 ± 32.5 (2–179) |
Order: Family | Genus | No. taxa / populations | No. spp. before review | No. spp. after review | No. continuous biometric variables | No. Pairwise tests omitted | Pairwise comparisons | Sample sizes (mean ± s.d.) (min.–max.) | Reference |
---|---|---|---|---|---|---|---|---|---|
Apodiformes: Trochilidae | Adelomyia | 2 | 1 | 1 | 4 | 0 | 4 | 9.6 ± 3.2(6–13) | Donegan and Avendaño (2015) |
Passeriformes: Thamnophilidae | Myrmeciza | 7 | 4 | 5 | 5 | 18 | 87 | 21.1 ± 19.5(2–65) |
|
Passeriformes: Grallariidae | Grallaricula | 11 | 1 | 3 | 6 | 49 | 281 | 12.4 ± 9.3(3–37) |
|
Passeriformes: Rhinocryptidae | Scytalopus 1 | 8 | 4 | 3 | 5 | 31 | 109 | 8.7 ± 7.4(2–24) |
|
Passeriformes: Rhinocryptidae | Scytalopus 2 | 2 | 1 | 1 | 5 | 0 | 5 | 4.9 ± 2.3(3–9) |
|
Passeriformes: Thraupidae | Anisognathus | 10 | 2 | 2 | 5 | 39 | 186 | 25.1 ± 34.7(4–214) |
|
Passeriformes: Parulidae | Basileuterus | 9 | 3 | 5 | 5 | 30 | 150 | 15.4 ± 10.9(2–42) |
|
TOTALS | 49 | 16 | 20 | 35 | 167 | 822 | 15.5 ± 19.1(2–214) |
Vocal variables always included measures of maximum acoustic frequency, length, number of notes and speed. In some studies, change in pace, minimum frequencies, frequencies of particular notes, note bandwidth, changes in acoustic frequency and position of peaks or troughs of frequency within a vocalization, or any of the same measures for particular parts of vocalizations, were also measured. In each study, the variables under study were designed so as to document as fully as possible observed subjective differences between populations. Biometric variables were in all cases wing, tail, tarsus and bill length and mass, except for Trochilidae (no tarsus length) and Grallariidae (where bill width was additionally measured). Note shape and other subjective vocal characters were also studied, as were plumages. However, information on non-continuous variables was discarded for purposes of this present study.
Pairwise comparisons were undertaken on a matrix basis of each population against each other population. Some pairwise tests were omitted due to lack of data for a particular population, i.e., where there were n < 2 recordings of a particular type of vocalization (which could represent either a sampling gap or genuine lack of delivery of such vocalization by the population in question); or n < 2 specimens of the population available in museums that were studied. In such cases, where n < 2, standard deviations could not be calculated and t-tests could not be run, so the comparison was excluded to ensure full comparability between all tests applied.
The data set was not designed for the study of statistical tests used in taxonomy, since this study had not been conceived at the time of data collection. The choice of taxonomic groups was not based only on studies which include among their components sympatric pairs (cf.
Several statistical tests were applied multiple times on a pairwise basis using a Microsoft Excel spreadsheet devised by the author for rapid assessment of multiple pairwise statistical tests across multiple populations. This spreadsheet is being published on the author’s researchgate.net page, and should assist authors in better and more swiftly analyzing diagnosability in future studies. Calculations, described below, were undertaken to measure inter-population differences in the context of various species and subspecies concepts.
First, the entire data set was subjected to various proposed tests of species or subspecies rank. In the formulae used below, x̄1 and s1 are the sample mean and standard deviations of Population 1; x̄2 and s2 refer to the same parameters in Population 2; and the t value uses a one-sided confidence interval at the percentage specified for the relevant population and variable, with t1 referring to Population 1 and t2 referring to Population 2.
LEVEL 1: Welch’s t-test at p<0.05/nv, i.e., applying a Bonferroni correction. An unequal variance (Welch’s) t-test was used. This is preferable to other t-tests in that it makes no assumptions about whether the SD of one population differs from that of the other. For vocal data potentially based on ratios, such as song speed, a two-sample Kolmogorov-Smirnov test can be applied instead to account for the possibility of a non-normal distribution. However, in order to standardize the study outputs, only Welch’s t-test was applied here.
When applying tests of statistical significance across multiple variables for the same pair, there is a risk of so-called “type 1” errors occurring. If testing for p < 0.05 for 100 independent variables of the same two populations, it would be expected that 5 variables would meet the requirements of the relevant test at this level of confidence. Various methods were tested which purport to reduce the risk of “type 1” errors. First, Bonferroni corrections were applied based on each of: (i) the total number of variables studied for the pair as a whole; (ii) separately for two “families” of vocal versus biometric variables; and (iii) separately for each different kind of vocalization, where applicable. Applying Bonferroni correction for a study involving five variables, p < (0.05/5) = 0.01 is the corrected confidence interval. Dunn-Šidák is a widely used but less conservative alternative to Bonferroni and was applied also to all three of the same situations as above in order to examine the impacts and outcomes using alternative corrections.
LEVEL 2: a ‘50%/95%’ test, following one of
|(x̄1–x̄2)| > (s1(t1 @ 97.5%) + s2(t2 @ 97.5%))/2
LEVEL 3: The traditional ‘75% / 99+%’ test for subspecies (
|(x̄1–x̄2)| > s1(t1 @ 99%) + s2(t2 @ 75%) and
|(x̄2–x̄1)| > s2(t2 @ 99%) + s1(t1 @ 75%)
LEVEL 4: diagnosability based on non-overlap of recorded values (the first part of
LEVEL 5: ‘Full’ diagnosability (where sample means are four average SDs apart at the 95% level, controlling for sample size) the second part of
|(x̄1–x̄2)| > s1(t1 @ 97.5%) + s2(t2 @ 97.5%)
Figure
These five tests were applied to 2348 population/variable combinations for voice and 822 population/variable combinations for biometrics. A population/variable combination is one comparison between two populations for a single variable. For example, in the Grallaricula study, a comparison of the main East Andes population against the Central Andes population for song length would constitute a single population/variable combination. With five diagnosability tests (Levels 1–5 above) conducted per population/variable combination, this means that a total of 15,610 pairwise statistical tests were run in this part of the study. (A further four tests conducted in later sections bring that total to over 28,000 separate statistical tests in this study.) Each population/variable combination was placed in a category summarizing which diagnosability tests it satisfied. The total number of population/variable combinations meeting particular tests was then summed for the biometric and vocal data sets separately, and then similar kinds of outcomes were grouped using the framework set out in Table
Recorded test satisfaction outcomes and the mapping of such outcomes to diagnosis groupings.
Outcome | Meaning | Grouping |
---|---|---|
0 | None of the tests are met. | No diagnosis |
1 | Statistically significant difference between means but no tests of diagnosis are met and data overlap. | Statistical significance |
14 | Statistically significant difference between means and data show no overlap but no tests of diagnosis are met. | |
12 | Statistically significant difference between means but diagnosis only up to 50% and data overlap. | 50% differentiation |
124 | Statistically significant difference between means, diagnosis up to 50% and data show no overlap. | |
123 | Statistically significant difference between means and diagnosis at both 50% and 75% levels but data overlap. | 75% differentiation |
1234 | Statistically significant difference between means and diagnosis at both 50% and 75% levels and data do not overlap | |
12345 | Statistically significant difference between means and diagnosis at 50%, 75% and 95% levels and data do not overlap | 95% differentiation |
1235 | Statistically significant difference between means and diagnosis at 50%, 75% and 95% levels but data overlap. | |
1245 | Statistically significant difference between means and diagnosis at 50% and 95% levels and data overlap but 75% test is not met. | |
125 | Statistically significant difference between means and diagnosis at 50% and 95% levels and data overlap but 75% test is not met and data overlap. | |
2 | No statistically significant difference between means, but 50% diagnosis test is met. | Possible false results |
2345 | No statistically significant difference between means, but 50%, 75% and 95% diagnosis tests are met. | |
24 | No statistically significant difference between means, but 50% diagnosis test is met and data do not overlap. | |
245 | No statistically significant difference between means, but 50% and 95% diagnosis tests are met and data do not overlap. | |
25 | No statistically significant difference between means and data overlap, but 50% and 95% diagnosis tests are met. | |
4 | Data do not overlap but no other statistical tests are met |
Certain minor methodological changes were undertaken here as compared to some of the underlying studies on which this paper is based: (i) where a single population had only one data point, it was excluded here from analyses, since only “Level 4” tests can be applied where degrees of freedom are 0 and this paper sought to compare outcomes for all comparisons; (ii) for the number of notes in the call for Grallaricula, several populations had uniformly one note in their calls, with standard deviation of zero, producing “divide by zero” errors for several tests, and so pairwise comparisons between such populations for that variable were excluded; (iii) some underlying studies presented biometric data for either males or females or all specimens or both; here, one or other of the “male” or “all specimens” data sets was selected, depending on whether material sexual differences in biometrics were observed and on sample size (generally, for studies with larger samples, using male only data is preferable, whilst in those studies with fewer specimens available, a combined data set was used here); (iv) for the main Scytalopus data set (
The second part of this study aimed to measure effect sizes four different ways, in order to inform appropriate benchmarks for measuring or scoring differentiation. The impacts of using pooled standard deviations (as per
Effect sizes were first calculated using the following formula:
|(x̄1–x̄2)| /[(s1+s2)/2]
This uses an arithmetic mean of the standard deviations of the two populations to measure the difference between the means of the same two populations.
A control was applied using t-distribution values, following
|(x̄1–x̄2)| / ¼[s1(t1 @ 97.5%) + s2(t2 @ 97.5%)]
This measures the distance between the means of two populations in terms of numbers of SDs, but controlling for sample size using a t distribution. The factor of ¼ is included to maintain parity with bare unpooled effect sizes and other measures studied in this section, i.e., where mean differences are measured with the equivalent of a single standard deviation for their denominator. For a normal distribution, as n tends to infinity, t tends to c.2 (actually nearer to 1.98), capturing essentially the whole sample within 2 standard deviations. As a result, s1(t1 @ 97.5%) + s2(t2 @ 97.5%) is equivalent to 2s1 + 2s2, or 4s, calling for division by 4 to retain parity with 1s.
To illustrate the impact of this correction versus the results from using bare unpooled effect sizes, the maximum acoustic frequency in the “slow song” in Santa Marta Warbler Basileuterus basilicus differs from that in the East Andes population of Three-striped Warbler B. tristriatus by 4.087 SDs, using bare unpooled effect sizes. The controlled unpooled effect size for this variable is lower at 3.910 SDs. This is because n = 9 for basicilus and n = 53 for East Andes tristriatus; one-sided t-distribution values at 97.5% are 2.306 and 2.007 respectively, effectively reflecting that an average SD of 2.157 using these sample sizes is equivalent to an SD of c.2 with infinite data points). This particular population/variable comparison therefore moved from being in a diagnosable category (> 4 SDs’ difference) to not being diagnosable (< 4 SDs’ difference and failing
Effect sizes using a pooled standard deviation, or Cohen’s d, were calculated. First, the pooled standard deviation was calculated:
sp = √[((n1–1)s22 + (n2–1)s22))/(n1+n2–2)]
Cohen’s d was then calculated as:
|(x̄1-x̄2)| / sp.
or, in full:
|(x̄1-x̄2)| / √[((n1–1)s22 + (n2–1)s22))/(n1+n2–2)]
This was the measure of effect size used by
Bare pooled effect sizes were subjected to an equivalent control for sample size (as for bare unpooled effect sizes), but using t-values at the degrees of freedom of the pooled standard deviation:
Cohen’s d / ((tpooled @ 97.5)/2),
Where tpooled is based on the degrees of freedom for the pooled standard deviation: d.f.=n1+n2–2.
or, in full:
|(x̄1–x̄2)| / (√[((n1–1)s22 + (n2–1)s22))/(n1+n2–2)]) / ((tpooled@97.5%)/2).
Thee four measures of effect sizes were calculated for each population/variable combination and each outcome was then placed into two sets of buckets. First, in order to obtain a general resolution on effect sizes magnitude in taxonomic studies, population/variable combinations were placed into a set of buckets divided at 2 effect sizes (i.e., at approximately 50% differentiation) intervals: 0–2, 2–4, 4–6, etc. A second set of buckets was based on
To compare the outcomes achieved using the four different measures of effect size and analyses of Levels 1–5, plots were produced between several of the outcomes. Spearman’s rank correlation coefficient was calculated as between statistical significance and effect size outcomes, based on the entire vocal and biometric data sets, so as to examine the inter-relation between the outcomes of applying different measures of differentiation.
Tables
Effect of applying different “Type 1 error” corrections on the vocal data set. The tests are ordered (A-G) from least to most conservative corrections. Sirystes, Geotrygon and Hypnelus data are presented outside the totals, since there were no biometric data set on which more conservative cumulative corrections could be applied. In the case of the latter two genera, Adelomyia and Scytalopus 2, only one kind of vocalization was studied.
Adelomyia | Myrmeciza | Grallaricula | Scytalopus 1 | Scytalopus 2 | Basileuterus | TOTALS | [Sirystes] | [Geotrygon] | [Hypnelus] | |
---|---|---|---|---|---|---|---|---|---|---|
No. of vocal variables | 10 | 26 | 14 | 12 | 7 | 19 | 18 | 2 | 5 | |
A. No correction | ||||||||||
p< | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | |
Passed | 2 | 406 | 323 | 236 | 4 | 240 | 1211 | 32 | 2 | 2 |
Total | 10 | 614 | 406 | 336 | 7 | 343 | 1716 | 44 | 2 | 5 |
% passed | 20% | 66.1% | 79.6% | 70.2% | 57.1% | 70.0% | 70.6% | 72.7% | 100% | 40% |
B. Dunn-Šidák with each kind of vocalisation separately | ||||||||||
p< | 0.00512 | 0.00639 | 0.00730 | 0.00851 | 0.00730 | 0.00730 | 0 | 0.0253 | 0.0102 | |
Passed | 2 | 357 | 293 | 197 | 3 | 200 | 1052 | 25 | 2 | 1 |
Total | 10 | 614 | 406 | 336 | 7 | 343 | 1716 | 44 | 2 | 5 |
% | 20% | 58.1% | 72.2% | 58.6% | 42.9% | 58.3% | 61.3% | 56.8% | 100% | 20% |
C. Bonferroni with each kind of vocalisation separately | ||||||||||
p< | 0.005 | 0.00625 | 0.00714 | 0.00833 | 0.00714 | 0.00714286 | 0.01 | 0.025 | 0.01 | |
Passed | 2 | 357 | 293 | 197 | 3 | 200 | 1052 | 25 | 2 | 1 |
Total | 10 | 614 | 406 | 336 | 7 | 343 | 1716 | 44 | 2 | 5 |
% | 20% | 58.1% | 72.2% | 58.6 | 42.9% | 58.3% | 61.3% | 56.8% | 100% | 20% |
D. Dunn-Šidák with voice and biometrics separately | ||||||||||
p< | 0.00512 | 0.00197 | 0.00366 | 0.00427 | 0.00730 | 0.00730 | 0.00285 | 0.0253 | 0.0102 | |
Passed | 2 | 321 | 267 | 186 | 3 | 189 | 968 | 20 | 2 | 1 |
Total | 10 | 614 | 406 | 336 | 7 | 343 | 1716 | 44 | 2 | 5 |
% | 20% | 52.2% | 65.8% | 55.3% | 42.9% | 55.1% | 56.4% | 45.4% | 100% | 20% |
E. Bonferroni with voice and biometrics separately | ||||||||||
p< | 0.005 | 0.00192 | 0.00357 | 0.00417 | 0.00714 | 0.00714 | 0.00278 | 0.025 | 0.01 | |
Passed | 2 | 321 | 266 | 185 | 3 | 188 | 965 | 20 | 2 | 1 |
Total | 10 | 614 | 406 | 336 | 7 | 343 | 1716 | 44 | 2 | 5 |
% | 20% | 52.2% | 65.5% | 55.1% | 42.9% | 54.8% | 56.2% | 45.5% | 100% | 20% |
F. Dunn-Šidák: biometrics plus voice | ||||||||||
p< | 0.00366 | 0.00165 | 0.00256 | 0.00301 | 0.00427 | 0.00427 | ||||
Passed | 2 | 317 | 260 | 179 | 3 | 182 | 943 | |||
Total | 10 | 614 | 406 | 336 | 7 | 343 | 1716 | |||
% | 20% | 51.6% | 64.0% | 53.3% | 42.9% | 53.1% | 55.0% | |||
G. Bonferroni: biometrics plus voice | ||||||||||
p< | 0.00357 | 0.00161 | 0.0025 | 0.00294 | 0.00417 | 0.00417 | ||||
Passed | 2 | 317 | 260 | 178 | 3 | 181 | 941 | |||
Total | 10 | 614 | 406 | 336 | 7 | 343 | 1716 | |||
% | 20% | 51.6% | 64.0% | 53.0% | 42.9% | 52.8% | 54.8% |
Effect of applying different “Type 1 error” corrections on the biometric data set. The tests are ordered (A–E) from least to most conservative corrections. Anisognathus data are presented outside the totals, since there was no vocal data set on which more conservative cumulative corrections could be applied.
Adelomyia | Myrmeciza | Grallaricula | Scytalopus 1 | Scytalopus 2 | Basileuterus | TOTALS | [Anisognathus] | |
---|---|---|---|---|---|---|---|---|
No. of biometric variables | 4 | 5 | 6 | 5 | 5 | 5 | 5 | |
A. No correction | ||||||||
p< | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | |
Passed | 0 | 46 | 142 | 45 | 3 | 66 | 302 | 88 |
Total | 4 | 87 | 281 | 109 | 5 | 150 | 636 | 186 |
% passed | 0% | 52.9% | 51.8% | 41.3% | 60% | 44% | 47.5% | 47.3% |
B. Dunn-Šidák with biometrics and voice separately | ||||||||
p< | 0.0127 | 0.0102 | 0.00851 | 0.0102 | 0.0102 | 0.0102 | 0.0102 | |
Passed | 0 | 31 | 108 | 35 | 2 | 50 | 226 | 66 |
Total | 4 | 87 | 281 | 109 | 5 | 150 | 636 | 186 |
% | 0% | 35.6% | 38.4% | 32.1% | 40% | 33.3% | 35.5% | 35.5% |
C. Bonferroni with biometrics and voice separately | ||||||||
p< | 0.0125 | 0.01 | 0.00833 | 0.01 | 0.01 | 0.01 | 0.01 | |
Passed | 0 | 31 | 108 | 35 | 2 | 50 | 226 | 66 |
Total | 4 | 87 | 281 | 109 | 5 | 150 | 636 | 186 |
% | 0% | 35.6% | 38.4% | 32.1% | 40% | 33.3% | 35.5% | 35.5% |
D. Dunn-Šidák: biometrics plus voice | ||||||||
p< | 0.00366 | 0.00165 | 0.00256 | 0.00301 | 0.00427 | 0.00213 | ||
Passed | 0 | 33 | 92 | 31 | 2 | 40 | 198 | |
Total | 4 | 87 | 281 | 109 | 5 | 150 | 636 | |
% | 0% | 37.9% | 32.7% | 28.4% | 40% | 26.7% | 31.1% | |
E. Bonferroni: biometrics plus voice | ||||||||
p< | 0.00357 | 0.00161 | 0.0025 | 0.00294 | 0.00417 | 0.00208 | ||
Passed | 0 | 33 | 92 | 31 | 2 | 40 | 198 | |
Total | 4 | 87 | 281 | 109 | 5 | 150 | 636 | |
% | 0% | 37.9% | 32.7% | 28.4% | 40% | 26.7% | 31.1% |
In the biometrics study, lower levels of statistically significant differentiation were found than for voice. More comparisons were non-significant (52.5%) than significant, even prior to applying any type 1 corrections. Applying type 1 corrections eliminated a further 12–16% of outcomes. Fewer than 5% of these eliminations result from treating voice and biometrics together; the bulk resulted from applying Bonferroni on the biometric data set itself. Dunn-Šidák corrections had no impact compared to using Bonferroni.
Tables
Outcomes of pairwise comparisons for vocal characters, placed into the different categories recovered by testing statistical tests of Levels 1 through 5. 1 = statistically significant, 2 = 50% diagnosis, 3=75% diagnosis, 4 = actual value diagnosis, 5 = 95% diagnosis (as detailed further in methods). See Table
Voice: levels passed | None | 1 | 14 | 12 | 124 | 123 | 1234 | 12345 | 1235 | 1245 | 125 | 2 | 2345 | 24 | 245 | 4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Geotrygon | 1 | 1 | ||||||||||||||
Adelomyia | 8 | 2 | ||||||||||||||
Hypnelus | 3 | 1 | 1 | |||||||||||||
Myrmeciza | 284 | 38 | 53 | 35 | 23 | 9 | 4 | 109 | 11 | 2 | 37 | 2 | 5 | 2 | ||
Grallaricula | 118 | 92 | 1 | 47 | 27 | 7 | 5 | 71 | 3 | 12 | 1 | 2 | 12 | 8 | ||
Scytalopus 1 | 148 | 82 | 37 | 17 | 5 | 7 | 36 | 1 | 1 | 2 | ||||||
Scytalopus 2 | 4 | 3 | ||||||||||||||
Sirystes | 22 | 7 | 7 | 3 | 1 | 2 | 1 | 1 | ||||||||
Basileuterus | 504 | 188 | 3 | 73 | 53 | 7 | 6 | 47 | 7 | 2 | 0 | 1 | 11 | 0 | 22 | |
TOTAL | 1091 | 410 | 57 | 199 | 128 | 28 | 23 | 265 | 22 | 16 | 38 | 7 | 12 | 18 | 3 | 31 |
Percentages | 46.5% | 17.5% | 2.4% | 8.5% | 5.5% | 1.2% | 1.0% | 11.3% | 0.9% | 0.7% | 1.6% | 0.3% | 0.5% | 0.8% | 0.1% | 1.3% |
Outcomes of pairwise comparisons for biometric characters, placed into the different categories recovered by testing statistical tests of Levels 1 through 5. 1 = statistically significant, 2 = 50% diagnosis, 3 = 75% diagnosis, 4 = actual value diagnosis, 5 = 95% diagnosis (as detailed further in methods). See Table
Biometrics: levels passed | None | 1 | 14 | 12 | 124 | 123 | 1234 | 12345 | 1235 | 1245 | 125 | 2 | 2345 | 24 | 245 | 4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Adelomyia | 4 | |||||||||||||||
Myrmeciza | 25 | 24 | 9 | 1 | 4 | 24 | ||||||||||
Grallaricula | 166 | 17 | 1 | 15 | 23 | 1 | 5 | 35 | 3 | 7 | 8 | |||||
Scytalopus 1 | 47 | 6 | 4 | 8 | 12 | 2 | 3 | 27 | ||||||||
Scytalopus 2 | 3 | 1 | 1 | |||||||||||||
Anisognathus | 120 | 43 | 12 | 6 | 1 | 4 | ||||||||||
Basileuterus | 97 | 27 | 1 | 10 | 9 | 1 | 2 | 1 | 2 | |||||||
TOTAL | 462 | 117 | 6 | 54 | 52 | 1 | 9 | 49 | 0 | 3 | 0 | 0 | 0 | 8 | 0 | 61 |
Percentages | 56.2% | 14.2% | 0.7% | 6.6% | 6.3% | 0.1% | 1.1% | 6.0% | 0.0% | 0.4% | 0.0% | 0.0% | 0.0% | 1.0% | 0.0% | 7.4% |
Outcomes of pairwise comparisons using Levels analysis, for voice, by grouping. See Table
Voice: Taxon | Pairwise statistical tests (/5) | No diff. | Poss. false results | Signif. Only | 50% | 75% | 95% |
---|---|---|---|---|---|---|---|
Geotrygon | 2 | 0 | 0 | 1 | 1 | 0 | 0 |
Adelomyia | 10 | 8 | 0 | 2 | 0 | 0 | 0 |
Hypnelus | 5 | 3 | 1 | 0 | 1 | 0 | 0 |
Myrmeciza | 614 | 284 | 9 | 91 | 58 | 13 | 159 |
Grallaricula | 406 | 118 | 22 | 93 | 74 | 12 | 87 |
Scytalopus 1 | 336 | 148 | 3 | 82 | 54 | 12 | 37 |
Scytalopus 2 | 7 | 4 | 0 | 0 | 3 | 0 | 0 |
Sirystes | 44 | 22 | 2 | 7 | 10 | 1 | 2 |
Basileuterus | 924 | 504 | 34 | 191 | 126 | 13 | 56 |
TOTALS | 2348 | 1091 | 71 | 467 | 327 | 51 | 341 |
OVERALL % | 46.5% | 3.0% | 19.9% | 13.9% | 2.2% | 14.5% | |
% (comparable) | 46.4% | 2.9% | 20.0% | 13.7% | 2.1% | 14.7% |
Outcomes of pairwise comparisons using Levels analysis, for biometrics, by grouping. See Table
Biometrics: Taxon | Pairwise statistical tests (/5) | No diff. | Poss. false results | Signif. only | 50% | 75% | 95% |
---|---|---|---|---|---|---|---|
Adelomyia | 4 | 4 | 0 | 0 | 0 | 0 | 0 |
Myrmeciza | 87 | 25 | 24 | 24 | 10 | 0 | 4 |
Grallaricula | 281 | 166 | 15 | 18 | 38 | 6 | 38 |
Scytalopus 1 | 109 | 47 | 27 | 10 | 20 | 2 | 3 |
Scytalopus 2 | 5 | 3 | 0 | 0 | 1 | 0 | 1 |
Anisognathus | 186 | 120 | 0 | 43 | 18 | 1 | 4 |
Basileuterus | 150 | 97 | 3 | 28 | 19 | 1 | 2 |
TOTALS | 822 | 462 | 69 | 123 | 106 | 10 | 52 |
OVERALL % | 56.2% | 8.4% | 15.0% | 12.9% | 1.2% | 6.3% | |
% (comparable) | 53.8% | 10.8% | 12.6% | 13.8% | 1.4% | 7.5% |
As foreshadowed in the type 1 error analysis (Tables
Levels 1–5 were generally ordered by least to most exacting in terms of difficulty to pass. However, several examples of “outliers” were uncovered, where more liberal test outcomes were apparently “skipped”, e.g.: (i) only statistical significance and non-overlap (1&4); (ii) statistical significance with 50% and non-overlap but not 75% diagnosis (124); (iii) all tests being passed except non-overlap (123&5); (iv) all tests including 95% diagnosis being passed, but excluding 75% diagnosis (124&5); (v) full statistical diagnosis and 50% and 95% diagnosis being met but neither 75% nor non-overlap (12&5); and (vi) combinations skipping statistical significance altogether, but passing other tests (all outcomes starting with 2 or 4). These outcomes are all statistically plausible, including as a result of the values of t at particular sample sizes for different confidence limits, even if in some cases they are logically counterintuitive.
In terms of specific findings for birds, biometric data were less informative than vocal data with “possibly false results” also being more frequent for biometric comparisons.
Results for effect sizes divided into buckets of 2d are set out in Tables
Results of the effects size study for voice, partitioning the data into 2 effect size intervals. The top two tables are based upon actual standard deviations for each set of data subjected to pairwise comparison. The lower two tables are based on pooled standard deviation data pooling. In each case, “bare” effect sizes are shown first (above). The second and fourth tables use “controlled effect sizes” for the relevant pooling approach, calculated by taking into account t-distribution values for the relevant sample size (or pooled sample size).
Bare Unpooled Effect Sizes (Voice) | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Taxon | 0–2 | 2–4 | 4–6 | 6–8 | 8–10 | 10–12 | 12–14 | 14–16 | 16–18 | 18–20 | 20+ |
Geotrygon | 1 | 1 | |||||||||
Adelomyia | 10 | ||||||||||
Hypnelus | 3 | 2 | |||||||||
Myrmeciza | 361 | 119 | 55 | 33 | 11 | 14 | 8 | 6 | 6 | 1 | 0 |
Grallaricula | 208 | 91 | 47 | 16 | 10 | 7 | 3 | 14 | 6 | 3 | 1 |
Scytalopus 1 | 227 | 66 | 30 | 7 | 6 | ||||||
Scytalopus 2 | 4 | 3 | |||||||||
Sirystes | 27 | 14 | 2 | 0 | 1 | ||||||
Basileuterus | 654 | 168 | 47 | 29 | 20 | 4 | 1 | 1 | 0 | 0 | 0 |
TOTAL | 1497 | 462 | 181 | 85 | 48 | 25 | 12 | 21 | 12 | 4 | 1 |
Percentage | 63.7% | 19.8% | 7.7% | 3.6% | 2.0% | 1.1% | 0.5% | 0.9% | 0.5% | 0.2% | 0.0% |
Controlled Unpooled Effect Sizes (Voice) | |||||||||||
Taxon | 0–2 | 2–4 | 4–6 | 6–8 | 8–10 | 10–12 | 12–14 | 14–16 | 16–18 | 18–20 | 20+ |
Geotrygon | 1 | 1 | |||||||||
Adelomyia | 10 | ||||||||||
Hypnelus | 4 | 1 | |||||||||
Myrmeciza | 375 | 114 | 53 | 32 | 14 | 15 | 0 | 7 | 3 | 1 | |
Grallaricula | 219 | 88 | 43 | 25 | 14 | 5 | 6 | 4 | 1 | 0 | 1 |
Scytalopus 1 | 230 | 69 | 27 | 7 | 3 | ||||||
Scytalopus 2 | 4 | 3 | |||||||||
Sirystes | 29 | 12 | 3 | ||||||||
Basileuterus | 717 | 151 | 34 | 14 | 6 | 2 | |||||
TOTAL | 1589 | 439 | 160 | 78 | 37 | 22 | 6 | 11 | 4 | 1 | 1 |
Percentage | 67.7% | 18.7% | 6.8% | 3.3% | 1.6% | 0.9% | 0.3% | 0.5% | 0.2% | 0.0% | 0.0% |
Bare Pooled Effect Sizes (Voice) | |||||||||||
Taxon | 0–2 | 2–4 | 4–6 | 6–8 | 8–10 | 10–12 | 12–14 | 14–16 | 16–18 | 18–20 | 20+ |
Geotrygon | 1 | 1 | |||||||||
Adelomyia | 10 | ||||||||||
Hypnelus | 3 | 2 | |||||||||
Myrmeciza | 371 | 130 | 48 | 19 | 17 | 10 | 9 | 4 | 6 | 0 | 0 |
Grallaricula | 222 | 98 | 33 | 16 | 6 | 9 | 7 | 6 | 3 | 2 | 4 |
Scytalopus 1 | 232 | 58 | 36 | 5 | 5 | ||||||
Scytalopus 2 | 4 | 3 | |||||||||
Sirystes | 28 | 13 | 2 | 0 | 0 | 0 | 0 | 1 | |||
Basileuterus | 687 | 151 | 40 | 16 | 11 | 8 | 5 | 2 | 1 | 1 | 2 |
TOTAL | 1558 | 456 | 159 | 56 | 39 | 27 | 21 | 13 | 10 | 3 | 6 |
Percentage | 66.4% | 19.4% | 6.8% | 2.4% | 1.7% | 1.1% | 0.9% | 0.6% | 0.4% | 0.1% | 0.3% |
Controlled Pooled Effect Sizes (Voice) | |||||||||||
Taxon | 0–2 | 2–4 | 4–6 | 6–8 | 8–10 | 10–12 | 12–14 | 14–16 | 16–18 | 18–20 | 20+ |
Geotrygon | 1 | 1 | |||||||||
Adelomyia | 10 | ||||||||||
Hypnelus | 3 | 2 | |||||||||
Myrmeciza | 374 | 127 | 49 | 21 | 19 | 6 | 8 | 4 | 6 | ||
Grallaricula | 226 | 99 | 31 | 17 | 8 | 8 | 6 | 5 | 2 | 2 | 2 |
Scytalopus 1 | 233 | 58 | 35 | 5 | 5 | ||||||
Scytalopus 2 | 4 | 3 | |||||||||
Sirystes | 28 | 13 | 2 | 0 | 0 | 0 | 1 | ||||
Basileuterus | 694 | 153 | 31 | 22 | 8 | 7 | 4 | 1 | 1 | 1 | 2 |
TOTAL | 1573 | 456 | 148 | 65 | 40 | 21 | 19 | 10 | 9 | 3 | 4 |
Percentage | 67.0% | 19.4% | 6.3% | 2.8% | 1.7% | 0.9% | 0.8% | 0.4% | 0.4% | 0.1% | 0.2% |
Results of the effects size study for biometrics, partitioning the data into 2 effect size intervals. The top two tables are based upon actual standard deviations for each set of data subjected to pairwise comparison. The lower two tables are based on pooled standard deviation data. In each case, “bare” effect sizes are shown first (above). The second and fourth tables use “controlled effect sizes” for the relevant pooling approach, calculated by taking into account t-distribution values for the relevant sample size (or pooled sample size).
Bare Unpooled Effect Sizes (Biometrics) | |||||||||||
Taxon | 0–2 | 2–4 | 4–6 | 6–8 | 8–10 | 10–12 | 12–14 | 14–16 | 16–18 | 18–20 | 20+ |
Adelomyia | 4 | ||||||||||
Myrmeciza | 58 | 25 | 4 | ||||||||
Grallaricula | 170 | 63 | 21 | 14 | 5 | 3 | 1 | 2 | 0 | 2 | |
Scytalopus 1 | 67 | 33 | 7 | 1 | 1 | ||||||
Scytalopus 2 | 2 | 2 | 0 | 1 | |||||||
Anisognathus | 154 | 26 | 5 | 0 | 1 | ||||||
Basileuterus | 116 | 27 | 7 | ||||||||
TOTAL | 571 | 176 | 44 | 16 | 7 | 3 | 1 | 2 | 0 | 2 | 0 |
Percentage | 69.5% | 21.4% | 5.4% | 1.9% | 0.9% | 0.4% | 0.1% | 0.2% | 0.0% | 0.2% | 0.0% |
Controlled Unpooled Effect Sizes (Biometrics) | |||||||||||
Taxon | 0–2 | 2–4 | 4–6 | 6–8 | 8–10 | 10–12 | 12–14 | 14–16 | 16–18 | 18–20 | 20+ |
Adelomyia | 4 | ||||||||||
Myrmeciza | 73 | 10 | 4 | ||||||||
Grallaricula | 192 | 51 | 20 | 12 | 3 | 0 | 0 | 1 | 2 | ||
Scytalopus 1 | 84 | 22 | 2 | 1 | |||||||
Scytalopus 2 | 3 | 1 | 1 | ||||||||
Anisognathus | 163 | 19 | 3 | 1 | |||||||
Basileuterus | 127 | 21 | 2 | ||||||||
TOTAL | 646 | 124 | 32 | 14 | 3 | 0 | 0 | 1 | 2 | 0 | 0 |
Percentage | 78.6% | 15.1% | 3.9% | 1.7% | 0.4% | 0.0% | 0.0% | 0.1% | 0.2% | 0.0% | 0.0% |
Bare Pooled Effect Sizes (Biometrics) | |||||||||||
Taxon | 0–2 | 2–4 | 4–6 | 6–8 | 8–10 | 10–12 | 12–14 | 14–16 | 16–18 | 18–20 | 20+ |
Adelomyia | 4 | ||||||||||
Myrmeciza | 57 | 28 | 2 | ||||||||
Grallaricula | 177 | 54 | 23 | 14 | 3 | 5 | 2 | 2 | 1 | ||
Scytalopus 1 | 67 | 36 | 6 | ||||||||
Scytalopus 2 | 2 | 2 | 1 | ||||||||
Anisognathus | 158 | 23 | 4 | 0 | 1 | ||||||
Basileuterus | 127 | 17 | 6 | ||||||||
TOTAL | 592 | 160 | 42 | 14 | 4 | 5 | 2 | 2 | 1 | 0 | 0 |
Percentage | 72.0% | 19.5% | 5.1% | 1.7% | 0.5% | 0.6% | 0.2% | 0.2% | 0.1% | 0.0% | 0.0% |
Controlled Pooled Effect Sizes (Biometrics) | |||||||||||
Taxon | 0–2 | 2–4 | 4–6 | 6–8 | 8–10 | 10–12 | 12–14 | 14–16 | 16–18 | 18–20 | 20+ |
Adelomyia | 4 | ||||||||||
Myrmeciza | 58 | 27 | 2 | ||||||||
Grallaricula | 181 | 54 | 21 | 11 | 8 | 3 | 1 | 1 | 1 | ||
Scytalopus 1 | 75 | 30 | 4 | ||||||||
Scytalopus 2 | 3 | 2 | |||||||||
Anisognathus | 161 | 21 | 3 | 0 | 1 | ||||||
Basileuterus | 128 | 17 | 5 | ||||||||
TOTAL | 610 | 151 | 35 | 11 | 9 | 3 | 1 | 1 | 1 | 0 | 0 |
Percentage | 74.2% | 18.4% | 4.3% | 1.3% | 1.1% | 0.4% | 0.1% | 0.1% | 0.1% | 0.0% | 0.0% |
A good portion (15–22%) of outcomes fell into the 2–4 effect sizes category, which, when using controlled unpooled effect sizes, corresponds to Level 2 in Tables
Table
Changes in effect size categories resulting from increasingly more conservative tests of effect size being applied. This table is based upon changes between the categories in Tables
Voice: changes into effect size category | 0–2 | 2–4 | 4–6 | 6–8 | 8–10 | 10–12 | 12–14 | 14–16 | 16–18 | 18–20 | 20+ |
Bare Unpooled –> Bare Pooled | +63 | -8 | -22 | -29 | -9 | +2 | +9 | -8 | -2 | -1 | +5 |
Bare Pooled –> Controlled Pooled | +15 | 0 | -11 | +9 | +1 | -6 | -2 | -3 | -1 | 0 | -2 |
Controlled Pooled –> Controlled Unpooled | +16 | -17 | +12 | +13 | -3 | +1 | -13 | +1 | -5 | -2 | -3 |
Total change from Bare Unpooled –> Controlled Unpooled | +94 | -25 | -21 | -7 | -11 | -3 | -6 | -10 | -8 | -3 | 0 |
As percentage of total | +4.0% | -1.1% | -0.9% | -0.3% | -0.5% | -0.1% | -0.3% | -0.4% | -0.3% | -0.1% | 0.0% |
Biometrics: changes into effect size category | 0–2 | 2–4 | 4–6 | 6–8 | 8–10 | 10–12 | 12–14 | 14–16 | 16–18 | 18–20 | 20+ |
Bare Unpooled –> Bare Pooled | +21 | -16 | -2 | -2 | -3 | +2 | +1 | 0 | +1 | -2 | 0 |
Bare Pooled –> Controlled Pooled | +18 | -9 | -7 | -3 | +5 | -2 | -1 | -1 | 0 | 0 | 0 |
Controlled Pooled –> Controlled Unpooled | +36 | -27 | -3 | +3 | -6 | -3 | -1 | 0 | +1 | 0 | 0 |
Total change from Bare Unpooled –> Controlled Unpooled | +75 | -52 | -12 | -2 | -4 | -3 | -1 | -1 | +2 | -2 | 0 |
As percentage of total | +9.1% | -6.3% | -1.5% | -0.2% | -0.5% | -0.4% | -0.1% | -0.1% | +0.2% | -0.2% | 0.0% |
Results for effect sizes divided into
Results of the effects size study for voice using
Bare Unpooled Effect Sizes (Voice) | |||||
Taxon | 0–0.2 | 0.2–2 | 2–5 | 5–10 | 10+ |
Geotrygon | 0 | 1 | 1 | ||
Adelomyia | 3 | 7 | |||
Hypnelus | 0 | 3 | 2 | ||
Myrmeciza | 70 | 291 | 152 | 66 | 35 |
Grallaricula | 34 | 174 | 119 | 45 | 34 |
Scytalopus 1 | 31 | 196 | 83 | 26 | |
Scytalopus 2 | 0 | 4 | 3 | ||
Sirystes | 4 | 23 | 14 | 3 | |
Basileuterus | 96 | 558 | 200 | 64 | 6 |
TOTAL | 239 | 1257 | 574 | 204 | 75 |
Percentage | 10.1% | 53.5% | 24.4% | 8.7% | 3.2% |
Controlled Unpooled Effect Sizes (Voice) | |||||
Taxon | 0–0.2 | 0.2–2 | 2–5 | 5–10 | 10+ |
Geotrygon | 0 | 1 | 1 | ||
Adelomyia | 3 | 7 | |||
Hypnelus | 0 | 4 | 1 | ||
Myrmeciza | 73 | 302 | 144 | 69 | 26 |
Grallaricula | 37 | 182 | 121 | 49 | 17 |
Scytalopus 1 | 33 | 197 | 85 | 21 | |
Scytalopus 2 | 0 | 4 | 3 | ||
Sirystes | 4 | 25 | 12 | 3 | |
Basileuterus | 113 | 604 | 171 | 34 | 2 |
TOTAL | 263 | 1326 | 538 | 176 | 45 |
Percentage | 11.2% | 56.5% | 22.9% | 7.5% | 1.9% |
Bare Pooled Effect Sizes (Voice) | |||||
Taxon | 0–0.2 | 0.2–2 | 2–5 | 5–10 | 10+ |
Geotrygon | 0 | 1 | 1 | ||
Adelomyia | 3 | 7 | |||
Hypnelus | 0 | 3 | 2 | ||
Myrmeciza | 71 | 300 | 157 | 57 | 29 |
Grallaricula | 35 | 187 | 118 | 35 | 31 |
Scytalopus 1 | 31 | 201 | 78 | 26 | |
Scytalopus 2 | 0 | 4 | 3 | ||
Sirystes | 4 | 24 | 14 | 2 | |
Basileuterus | 105 | 582 | 181 | 37 | 19 |
TOTAL | 249 | 1309 | 554 | 157 | 79 |
Percentage | 10.6% | 55.7% | 23.6% | 6.7% | 3.4% |
Controlled Pooled Effect Sizes (Voice) | |||||
Taxon | 0–0.2 | 0.2–2 | 2–5 | 5–10 | 10+ |
Geotrygon | 0 | 1 | 1 | ||
Adelomyia | 3 | 7 | |||
Hypnelus | 0 | 3 | 2 | ||
Myrmeciza | 71 | 303 | 158 | 58 | 24 |
Grallaricula | 36 | 190 | 115 | 40 | 25 |
Scytalopus 1 | 30 | 203 | 79 | 24 | |
Scytalopus 2 | 0 | 4 | 3 | ||
Sirystes | 4 | 24 | 14 | 2 | |
Basileuterus | 109 | 585 | 177 | 37 | 16 |
TOTAL | 253 | 1320 | 549 | 161 | 65 |
Percentage | 10.8% | 56.2% | 23.4% | 6.9% | 2.8% |
Results of the effects size study for biometrics using
Bare Unpooled Effect Sizes (Biometrics) | |||||
Taxon | 0–0.2 | 0.2–2 | 2–5 | 5–10 | 10+ |
Adelomyia | 1 | 3 | |||
Myrmeciza | 5 | 53 | 27 | 2 | |
Grallaricula | 24 | 146 | 70 | 33 | 8 |
Scytalopus 1 | 7 | 60 | 40 | 2 | |
Scytalopus 2 | 0 | 2 | 2 | 1 | |
Anisognathus | 22 | 132 | 28 | 4 | |
Basileuterus | 19 | 97 | 32 | 2 | |
TOTAL | 78 | 493 | 199 | 44 | 8 |
Percentage | 9.5% | 60.0% | 24.2% | 5.4% | 1.0% |
Controlled Unpooled Effect Sizes (Biometrics) | |||||
Taxon | 0–0.2 | 0.2–2 | 2–5 | 5–10 | 10+ |
Adelomyia | 1 | 3 | |||
Myrmeciza | 8 | 65 | 14 | ||
Grallaricula | 29 | 163 | 61 | 25 | 3 |
Scytalopus 1 | 17 | 67 | 24 | 1 | |
Scytalopus 2 | 0 | 3 | 2 | ||
Anisognathus | 24 | 139 | 21 | 2 | |
Basileuterus | 28 | 99 | 23 | ||
TOTAL | 107 | 539 | 145 | 28 | 3 |
Percentage | 13.0% | 65.6% | 17.6% | 3.4% | 0.4% |
Bare Pooled Effect Sizes (Biometrics) | |||||
Taxon | 0–0.2 | 0.2–2 | 2–5 | 5–10 | 10+ |
Adelomyia | 1 | 3 | |||
Myrmeciza | 6 | 51 | 28 | 2 | |
Grallaricula | 26 | 151 | 63 | 31 | 10 |
Scytalopus 1 | 8 | 59 | 40 | 2 | |
Scytalopus 2 | 0 | 2 | 3 | ||
Anisognathus | 23 | 135 | 24 | 4 | |
Basileuterus | 20 | 97 | 21 | 12 | |
TOTAL | 84 | 498 | 179 | 51 | 10 |
Percentage | 10.2% | 60.6% | 21.8% | 6.2% | 1.2% |
Controlled Pooled Effect Sizes (Biometrics) | |||||
Taxon | 0–0.2 | 0.2–2 | 2–5 | 5–10 | 10+ |
Adelomyia | 1 | 3 | |||
Myrmeciza | 7 | 51 | 28 | 1 | |
Grallaricula | 30 | 151 | 63 | 31 | 6 |
Scytalopus 1 | 11 | 64 | 32 | 2 | |
Scytalopus 2 | 0 | 2 | 3 | ||
Anisognathus | 23 | 138 | 22 | 3 | |
Basileuterus | 21 | 107 | 22 | ||
TOTAL | 93 | 516 | 170 | 37 | 6 |
Percentage | 11.3% | 62.8% | 20.7% | 4.5% | 0.7% |
Changes between category (Table
Changes in
Voice: changes into effect size category | 0–0.2 | 0.2–2 | 2–5 | 5–10 | 10+ |
Bare Unpooled –> Bare Pooled | +11 | +52 | -20 | -47 | +4 |
Bare Pooled –> Controlled Pooled | +4 | +11 | -5 | +4 | -14 |
Controlled Pooled –> Controlled Unpooled | +10 | +6 | -11 | +15 | -20 |
Total change fromBare Unpooled –> Controlled Unpooled | +25 | +69 | -36 | -28 | -30 |
As percentage of total | 1.1% | 2.9% | -1.5% | -1.2% | -1.3% |
Biometrics: changes into effect size category | 0–0.2 | 0.2–2 | 2–5 | 5–10 | 10+ |
Bare Unpooled –> Bare Pooled | +6 | +5 | -20 | +7 | +2 |
Bare Pooled –> Controlled Pooled | +9 | +18 | -9 | -14 | -4 |
Controlled Pooled –> Controlled Unpooled | +14 | +23 | -25 | -9 | -3 |
Total change fromBare Unpooled –> Controlled Unpooled | +29 | +46 | -54 | -16 | -5 |
As percentage of total | 3.5% | 5.6% | -6.6% | -1.9% | -0.6% |
Tables
Scatter-graphs showing the effects of applying different corrections of effect size on the entire vocal data set. Each axis shows effect size, measured in a different way. A Controlling for sample size using unpooled data – x-axis: bare unpooled effect size; y-axis: controlled unpooled effect size B Controlling for sample size using pooled data – x-axis: bare pooled effect size; y-axis: controlled pooled effect size C Using pooled versus unpooled effect sizes without controlling for sample size – x-axis: bare unpooled effect size; y-axis: bare pooled effect size D Using pooled versus unpooled effect sizes and controlling for sample size – x-axis: controlled unpooled effect size; y-axis: controlled pooled effect size. A single data point of greater than 25 effect sizes was excluded to improve presentation of the results.
Scatter-graphs showing the effects of applying different corrections of effect size on the entire biometric data set. Scatter-graphs showing the effects of applying different corrections of effect size on the entire vocal data set. Each axis shows effect size, measured in a different way. A Controlling for sample size using unpooled data – x-axis: bare unpooled effect size; y-axis: controlled unpooled effect size B Controlling for sample size using pooled data – x-axis: bare pooled effect size; y-axis: controlled pooled effect size C Using pooled versus unpooled effect sizes without controlling for sample size – x-axis: bare unpooled effect size; y-axis: bare pooled effect size D Using pooled versus unpooled effect sizes and controlling for sample size – x-axis: controlled unpooled effect size; y-axis: controlled pooled effect size.
Spearman’s rank correlation coefficients as between the results of five statistical tests carried out on the vocal data set. Confidence for the values stated were given in PAST as zero or less than p<1x10100 for all tests.
Tests conducted on vocal data set | Bare Unpooled Effect Sizes | Controlled Unpooled Effect Sizes | Bare Pooled Effect Sizes | Controlled Pooled Effect Sizes |
---|---|---|---|---|
Statistical significance (student’s t) | -0.863 | -0.910 | -0.849 | -0.859 |
Bare Unpooled Effect Sizes | 0.987 | 0.988 | 0.908 | |
Controlled Unpooled Effect Sizes | 0.894 | 0.980 | ||
Bare Pooled Effect Sizes | 0.999 |
Spearman’s rank correlation coefficients as between the results of five statistical tests carried out on the biometric data set. Confidence for the values stated were given in PAST as zero or less than p<1×10100 for all tests.
Tests conducted on biometric data set | Bare Unpooled Effect Sizes | Controlled Unpooled Effect Sizes | Bare Pooled Effect Sizes | Controlled Pooled Effect Sizes |
---|---|---|---|---|
Statistical significance (student’s t) | -0.851 | -0.951 | -0.831 | -0.857 |
Bare Unpooled Effect Sizes | 0.936 | 0.990 | 0.983 | |
Controlled Unpooled Effect Sizes | 0.920 | 0.936 | ||
Bare Pooled Effect Sizes | 0.994 |
Changes in actual effect sizes resulting from changes between different methods of measuring effect sizes in the format actual mean ± standard deviation (minimum–maximum) [mean using absolute values ± standard deviation using absolute values], for voice. The figure in each cell demonstrates the outcomes of subtracting the effect size in the columns from the effect sizes in the rows, for each data point studied.
Tests conducted on vocal data set | Controlled Unpooled Effect Sizes | Bare Pooled Effect Sizes | Controlled Pooled Effect Sizes |
---|---|---|---|
Bare Unpooled Effect Sizes | 0.35 ± 1.15(-0.20–24.26)[0.35 ± 1.15] | 0.15 ± 0.91(-11.73–7.13)[0.36 ± 0.84] | 0.24 ± 1.07(-11.32–19.28)[0.41 ± 1.02] |
Controlled Unpooled Effect Sizes | -0.19 ± 1.55(-20.70–4.40)[0.45 ± 1.50] | -0.11 ± 1.36(-20.26–5.63)[0.40 ± 1.31] | |
Bare Pooled Effect Sizes | 0.08 ± 0.46(-0.27–15.40)[0.09 ± 0.46] |
Changes in actual effect sizes resulting from changes between different methods of measuring effect sizes in the format actual mean ± standard deviation (minimum–maximum) [mean using absolute values ± standard deviation using absolute values] for biometrics. The figure in each cell demonstrates the outcomes of subtracting the effect size in the columns from the effect sizes in the rows, for each data point studied.
Tests conducted on biometric data set | Controlled Unpooled Effect Sizes | Bare Pooled Effect Sizes | Controlled Pooled Effect Sizes |
---|---|---|---|
Bare Unpooled Effect Sizes | 0.41 ± 0.68(-0.01–5.61)[0.41 ± 0.68] | 0.09 ± 0.44(-2.17–5.34)[0.19 ± 0.40] | 0.21 ± 0.53(-1.97–5.77)[0.26 ± 0.51] |
Controlled Unpooled Effect Sizes | -0.32 ± 0.73(-6.31–2.36)[0.39 ± 0.69] | -0.20 ± 0.57(-4.4–2.78)[0.30 ± 0.53] | |
Bare Pooled Effect Sizes | 0.12 ± 0.28(-0.01–2.84)[0.12 ± 0.28] |
The largest shift was observed between bare unpooled effect sizes versus controlled unpooled effect sizes, where a 3.9% (voice) or 9.1% (biometrics) increase in the number of outcomes in the lowest category of differentiation (0–2) was observed.
The impact of using bare pooled versus bare unpooled effect sizes is illustrated in Figures
The overall magnitude of reduction of effect size measurements between bare pooled effect sizes and controlled pooled effect sizes was moderate. Degrees of freedom for pooled standard deviation are higher (the sum of the two samples’ sample sizes minus 2) than when using unpooled methods (where each sample is treated separately), resulting in lower t-values when using pooled standard deviations. Application of t-distribution corrections on effect sizes using unpooled standard deviations resulted in the most conservative of all measures of effect sizes, linked to overall lowest degrees of freedom in corrections and overall higher t-values.
Although these overall trends were observed, the impact of applying differing methods of measurement of effect sizes on actual pairwise comparisons was not uniform (see Figures
Statistical significance presented a weak negative correlation with most effect size measurements, but being most closely correlated with controlled unpooled effect sizes. In the case of biometrics, there was a strong negative correlation with controlled unpooled effect sizes (Tables
The variability between particular scores using different effect size measures are defined further in Tables
The relationship between each measurement of effect size and statistical significance is explored in Tables
Effect sizes under the four models studied here, grouped into the three “zones” of statistical significance illustrated in Figures
Statistical significance | Bare unpooled effect sizes | Bare pooled effect sizes | Controlled pooled effect sizes | Controlled unpooled effect sizes |
---|---|---|---|---|
p>0.05 | 0.60 ± 1.31(0.00–11.56) | 0.38 ± 0.33(0.00–2.21) | 0.71 ± 2.13(0.00–22.92) | 0.67 ± 2.02(0.00–22.47) |
0.05/nv<p<0.05 | 1.82 ± 2.69(0.33–18.22) | 1.29 ± 1.24(0.33–8.47) | 1.82 ± 3.00(0.26–21.60) | 1.67 ± 2.61(0.26–20.14) |
p<0.05/nv | 3.69 ± 3.30(0.36–45.33) | 3.32 ± 2.73(0.37–21.07) | 3.32 ± 3.05(0.37–41.45) | 3.22 ± 2.81(0.37–26.05) |
Effect sizes under the four models studied here, grouped into the three “zones” of statistical significance illustrated in Figures
Statistical significance | Bare unpooled effect sizes | Bare pooled effect sizes | Controlled pooled effect sizes | Controlled unpooled effect sizes |
---|---|---|---|---|
p>0.05 | 0.72 ± 0.70(0.00–3.97) | 0.42 ± 0.29(0.00–1.95) | 0.71 ± 0.72(0.00–3.92) | 0.63 ± 0.60(0.00–3.41) |
0.05/nv<p<0.05 | 1.45 ± 0.86(0.44–3.98) | 1.06 ± 0.41(0.44–2.44) | 1.37 ± 0.96(0.40–6.01) | 1.26 ± 0.84(0.40–5.81) |
p<0.05/nv | 3.37 ± 2.64(0.61–19.80) | 2.79 ± 2.07(0.61–17.09) | 3.15 ± 2.44(0.58–16.04) | 2.97 ± 2.23(0.58–15.57) |
The dataset studied here exhibits comparable overall levels of variation to
Several broader aspects of the results can be explained by considering the number of standard deviations’ difference required to satisfy various models (Figure
The overall lower differentiation levels in biometrics can in part be explained due to lower sample size (see Tables
The outcomes of using pooled versus unpooled and bare versus controlled effect sizes are substantial across the data set as a whole and can be drastic in individual cases (Tables
The distinction between using pooled versus unpooled standard deviations in taxonomy has passed by barely without discussion in ornithological taxonomic literature.
Usage of pooled standard deviations, as a matter of statistical methodology, should only be undertaken where the standard deviations of the two populations under comparison can be assumed to be equal. This does not necessarily mean that measured standard deviations of the two populations must be equal, or even close to one another, since these will usually differ for two measured populations as a result of the sampling. However, it must be reasonable to make this assumption in order to apply this method. The pooling formula attributes greater weight to the standard deviation of the population with higher sample size and produces a “weighted average” standard deviation which is closer to that of the population of which there is a larger sample. Degrees of freedom for the pooled standard deviation are greater due to summing those of the two separate populations. In practice, in taxonomy, we will usually have no idea as to whether or not the standard deviations of two populations under comparison are equal or not. Special care should be adopted in using pooled standard deviations where estimated population sizes, molecular or geographical attributes of the two populations vary greatly. For example, comparing an isolated, very small montane population with low intra-population molecular variation versus a very widespread lowland population which is known to exhibit substantial clinal variation and has higher intra-population molecular variation would be inappropriate, since assumptions underlying the usage of pooled standard deviations are likely not just to be unknown but incorrect. The greater correlation between statistical significance and controlled unpooled effect sizes (Tables
There are still likely to be “use cases” for pooled standard deviations to measure effect sizes in taxonomy.
Most papers concerning the application of statistical tests for determining the taxonomic rank of allopatric populations have noted that statistical significance is not a good measure, due to its potential for liberal satisfaction by increasing sample size, its failure to indicate higher levels of differentiation or false positives when sampling from different parts of a geographical cline; and then move quickly on to discuss better tests (
Although large samples sizes are cited as the basis to reject statistical significance as a useful measure in taxonomy, such problems are rarely faced by taxonomists. Having too few specimens (whether in museums or measured in the field) or sound recordings is likely a more material problem. For new species descriptions in birds published between 1935–2009, 332 of 477 (70%) were based on 0–5 specimens, only one was based on >100 specimens and the mean number of specimens was 6 (
The classic test of statistical significance between two populations of data is the Student’s t-test. This evaluates the probability of whether two normally-distributed data sets relate to two different populations, by considering whether or not their mean averages are likely to differ from one another. Various other similar tests can assess differences between mean, median, or modal averages, such as F, Mann-Whitney U, Kolmorov-Smirnov, Wilks’ Lamda, ANOVA, Kruskal-Wallis, and Tukey-Kramer. Some of these tests are better suited to continuous variables which are non-normally distributed, such as ratios or products of raw data.
Although the t-test will evaluate the likelihood that two populations are different, it tells us little about the extent of differences between the two populations. With a large enough sample, the two sample means may be very close to one another. Here, the lowest distance between statistically significant outcomes was 0.36 effect sizes. Tests of statistical significance can also be failed on data showing effects sizes as high as 18 (Table
Myrmeciza melanoceps: 2.42 ± 0.12 (1.98–2.62) (n = 143)
Myrmeciza goeldii: 2.29 ± 0.08 (2.01–2.49) (n = 173)
The t-test was passed at p<0.0002, yet these data reveal small differences between means and substantial overlaps in recorded values. The t-test result suggests that the two populations in question have begun to diverge from one another, which is interesting and makes it valid to discuss their relationship and possible isolation mechanisms. However, identification of a sound recording to one or the other species on the basis of these data would be impossible. The effect size here was 1.31, considerably in excess of the lowest (0.36) score, but fewer than 50% of individuals could be identified based on this variable and it would be useless for identification. Regularly, diagnosis is incorrectly asserted in the taxonomic literature based on data like those in this example (see citations above).
The t-test and similar tests demonstrate statistical significance of differences between means. Such differences may have some evolutionary significance. However, a positive t-test is not necessarily of much taxonomic significance (Fig.
In the field of medicine, the outcome of tests of statistical significance is widely understood and accepted to be just a first phase in demonstrating an interesting result. A variety of different approaches exist in medical science which must also be passed to show clinical significance, which, for example, would support the usage of drugs. In any taxonomic study, it is similarly important to move on from the ecology class, beyond statistical significance to consider the taxonomic significance of any results.
That all said, statistical significance can be a tougher one than some proposed measures of differentiation. Instances were found here of pairwise comparisons passing
Scatter-graphs of controlled unpooled effect size (x-axis) versus statistical significance (p<x) (y-axis). All outcomes to the right of the two purple lines represent pairwise comparisons which are given scores of at least 1 under the
Introducing the Dunn-Šidák correction (as opposed to the simpler but overly conservative Bonferroni correction) had a virtually negligible effect (Tables
Bonferroni (and Dunn-Šidák) corrections are appropriately applied to “families” of variables. A middle-ground of treating voice and biometrics as separate “families” is recommended based on this study. This could be criticized, since certain aspects of voice and biometrics can be linked (e.g.,
Diagnosability was considered to be the most frequently applied criterion to assess rank in a review of over 1000 taxonomic revisions (
In this study, 14.5% of the vocal data set and 6.3% of the biometric data set passed the
There is no consensus as to whether any differentiation below diagnosability for particular characters ought to be recognized in taxonomy.
Originally, 75% diagnosis tests for subspecies used one of two approaches: (i) a 75%/75% test (i.e., 75% of population 1 is diagnosable from 75% of population 2; or (ii) a 75%/99+% test (i.e., 75% of population 1 is diagnosable from essentially all of population 2).
The application of sharp, seemingly arbitrary, tests such as these to classify normally distributed data into segments to which scores are attributed is a situation not unique to taxonomy. Similar hard boundaries are also rife in most education and examination systems. In UK universities, a student scoring 60.1% or 69.9% in an examination will be given the same award (an upper second class degree) but a student attaining 70.1% will get a different award, a first class degree. This is despite the students scoring 69.9% and 70.1% having attained more similar levels of achievement to one another. Whilst any cut-off may be criticized as arbitrarily generous or harsh to outcomes falling close to the line on either side, the application of cut-offs is something that humans tend to do in their quest to categorize things. Where cut-offs are applied, a test of whether the cut-off is a valid one should best be based upon: (a) differentiation of a meaningful number of outcomes; and (b) the setting of boundaries at statistically-, mathematically-, or biologically-meaningful positions.
In this data set, with very large numbers of pairwise comparisons, necessarily many individual cases fall very close to each of the cut-off boundaries proposed by previous models for attributing taxonomic significance, whether at or below diagnosability. Two populations differing by 95% using
In contrast to the 75% (Level 3) test, 50%/95% differentiation (Level 2) measures a mathematically relevant point of differentiation, when the mean of one population moves outside the normal distribution of the other. It also signifies the point at which a population has moved half way towards diagnosability. The number of pairwise comparisons meeting the Level 2 test (but not falling in other buckets) was material but not enormous. Only 30.6% for voice and 20.6% for biometrics of outcomes passed this test at all (Tables
In addition to the diagnosis formula for Level 5,
Traditional interpretations of effect sizes may be appropriately used in other fields but are inappropriate for taxonomic study. It should be borne in mind that the traditional subjective descriptors for effect sizes starting at 0.2 have been developed largely in the fields of social and behavioral science (Cohen 1998,
Overall, this study suggests that: (i)
Solution A:
1 point: Level 1 statistical significance only.
2 points: Level 1 plus Level 2 50% diagnosability.
3 points: Level 1 plus Level 5 full diagnosability (3 points).
4 points: Level 1 plus a new measure of a “species and a half” worth of diagnosability (equivalent to 6 controlled effect sizes).
Solution B: would use more proportionate scoring, eliminate the weighting for statistical significance and allow only three scores:
1.5 points: Level 1 plus Level 2 (2 controlled effect sizes).
3 points: Level 1 plus Level 5 (4 controlled effect sizes).
4.5 points: Level 1 plus 6 controlled effect sizes.
Solution C: would abandon these various cut-offs and instead use controlled unpooled effect sizes, calibrated by a scale factor such that no difference = 0 and full diagnosability = 3 and capped at a score of 4.
As has been argued elsewhere (e.g.,
For the reasons above, the Level 4 non-overlap test should be abandoned from this framework in order to positively score the 2.5% of vocal outcomes which were diagnosable to 95% but actually overlapped due to very large sample sizes (Table
A disadvantage of the
Logarithmic plot of the same data as in Figure
Graph showing relationship between sample size (x-axis) and numbers of effective SD differences between means or effect sizes (y-axis) required in order to pass a test of diagnosability shown in the legend. Dashed lines represent the four boundaries for affording scores under
Graph illustrating the “hard cut-off” approaches of
One could adapt the
∑[n=1 IF (min s1>max s2 OR min s2>max s1) AND (|(x̄1–x̄2)| > s1(t1 @ 97.5%) + s2(t2 @ 97.5%))]
≤
∑[n=1 IF (min s3>max s4 OR min s3>max s4) AND (|(x̄3–x̄4)| > s3(t3 @ 97.5%) + s4(t4 @ 97.5%))].
Such a hard-edged statistical framework would go beyond the recommendations of
The
There is a relatively simple solution to this shortcoming: with continuous data, to move away from models which attribute cut-offs and instead to apply precise scoring under a system which only uses a hard cut-off at the very final point of determining species or subspecies rank. Such an approach was effectively attempted in a recent study (e.g.,
Immediately prior to going to press,
In this and the next section, a new, universal measure of differentiation is developed. It is potentially usable in any taxonomic group where continuous variables are studied and in other contexts to measure effect sizes.
Step 1: identify a comparison group.
For an assessment of the rank of allopatric populations, this method compares: (i) two sympatric and closely related populations which are demonstrably good species and broadly accepted as such (Species 1 and Species 2) as well as (ii) two allopatric populations under study (Population 3 and Population 4). Ideally, Species 1 and Species 2 should also be sister taxa or be known or suspected to be very closely related through molecular studies, such that they represent a good benchmark. However, this may not always be known for certain. Preferably, Species 1 and 2 and Populations 3 and 4 should all be congeneric, but this might not be possible and they might be merely a good example from the same family or order, depending on how speciose the relevant higher-level taxonomy is. Either (but not both) of Population 3 or Population 4 might be the same as Species 1 or Species 2 or they may all be different populations.
Step 2: collect data for relevant variables using continuous measurements.
It is critical to ensure a fair identification of variables, which adequately and honestly document the maximum possible observed variations between all populations (i.e., not just the allopatric pair, but also the sympatric pair). Variables differentiating sympatric Species 1 and 2 should not be overlooked, even if more time is spent studying allopatric Populations 3 and 4. Returning to the theme of taxonomic significance and not simply statistical significance, it is important that the variables under study are likely to be taxonomically relevant. Field experience or knowledge of the organisms concerned is important to avoid splits or lumps being published based on statistical tests applied to inappropriately selected variables.
Unlike in multivariate statistics, the technique presented here will not require each data set to have the same measures from the same individuals. This means that a biometric data set based on museum specimens and a vocal data set based on a different set of individuals and with different sampling can be combined, so data from all possible sources can be collated and combined. The broadest possible geographical and numerical sampling is important (e.g.,
Step 3: undertake pairwise comparisons using controlled unpooled effect sizes.
The following formula should be applied to measure controlled unpooled effect sizes on a pairwise basis, separately for each population/variable combination under study, e.g., for Species 1 and Species 2:
|(x̄1–x̄2)| / ¼[s1(t1 @ 97.5%) + s2(t2 @ 97.5%)]
Step 4: exclude all the statistically insignificant data.
Comparisons showing no statistical significance should be eliminated and scored as 0. This process needs conducting separately for each population/variable combination under study: a variable might be scored as zero as between Species 1 and Species 2, but may be scored positively as between Population 3 and Population 4. Bonferroni correction is applied here, in order to keep the formula simple and due to the near-nil impact of using less conservative “type 1” error corrections. It is recommended that different sets or “families” of data (biometric, vocal, colorimetric) are treated separately for purposes of determining the appropriate Bonferroni correction. Other more complex “type 1 error” corrections such as Dunn-Šidák should be considered for situations where very large numbers of variables are compared. The exclusion of statistically insignificant data results in the following modification to the effect size formula above, e.g., for Species 1 and Species 2:
p<0.05/nv → |(x̄1–x̄2)| / ¼[s1(t1 @ 97.5%) + s2(t2 @ 97.5%)]
Step 5: add up all the results of the above calculations (using a Euclidian approach).
It would be simple then to add up all the effect sizes, as follows, and see whether Species 1 vs Species 2 or Population 3 vs Population 4 had the better score. This would apply the formula:
∑ [p<0.05/nv → |(x̄1–x̄2)| / ¼[s1(t1 @ 97.5%) + s2(t2 @ 97.5%)]]
≤
∑ [p<0.05/nv → |(x̄1–x̄2)| / ¼[s3(t3 @ 97.5%) + s4(t4 @ 97.5%)]]
However, this would be sub-optimal statistically. Applying such a formula would reflect the underlying conceptual approach of existing systems to rank allopatric populations (including
(a1, a2, a3, a4 …. an) and (b1, b2, b3, b4 … bn),
then Pythagorian principles result in the following calculation of distance between points a and b in multi-dimensional space:
√[(a1–b1)2 + (a2–b2)2 + (a3–b3)2 … + (an–bn)2]
And this can be simplified to:
√(∑ (an–bn)2)
This approach cannot perfectly be applied to a series of effect size measures based on multiple pairwise comparisons, in that such data are not necessarily linked to one another as a set of corresponding coordinates. However, assuming that the variables studied are independent, it is valid to measure distance this way. Independence of variables can be verified through correlation tests and promoted in variable selection by seeking to capture the maximum possible observed variation efficiently.
Each controlled unpooled effect size (that has not been eliminated to zero using the statistical significance filter) can be considered to represent the equivalent of a distance |an-bn|. The distance in multi-dimensional space between the two populations is better approximated than through simple addition by taking the square of each controlled unpooled effect size (which has not been excluded due to non-significance), adding those up, and then calculating the square root of the sum of all of them.
In studies using continuous variables, allopatric populations should be ranked as species if they show equal to or greater variation than that shown between closely related sympatric species (
Viz, an allopatric population will be a candidate for species rank if:
√(∑ [p<0.05/nv → |(x̄1–x̄2)| / ¼[s1(t1 @ 97.5%) + s2(t2 @ 97.5%)]] 2)
≤
√(∑ [p<0.05/nv → |(x̄3–x̄4)| / ¼[s3(t3 @ 97.5%) + s4(t4 @ 97.5%)]]2)
Where:
Species 1 and Species 2 are two sympatric species that are closely related to one another (preferably known to be sisters) and which are related to Population 3 and Population 4.
Population 3 and Population 4 are two allopatric populations whose rank is being determined.
p: the probability using Welch’s unequal variance t-test (or other similar technique for non-normally distributed data), as set out under Level 1 in Methods that the means of the populations differ.
nv: the number of continuous variables of a particular “family” considered in the study, so as to apply a Bonferroni correction.
1, 2, 3, and 4 refer to relevant data for Species 1, Species 2, Population 3 and Population 4 respectively.
x̄1, x̄2, x̄3, and x̄4 are the sample means of a relevant data set for a particular variable for Species 1, Species 2, Population 3 and Population 4, respectively.
s1,s2,s3, and s4 are the standard deviations for a relevant data set for a particular variable for Species 1, Species 2, Population 3 and Population 4, respectively.
t refers to the t-value (based on t-distribution) using a one-sided confidence interval at the percentage specified for the relevant population and variable, with t1,t2,t3 and t4, referring to such value for Species 1, Species 2, Population 3 and Population 4 respectively.
Because this formula is not simple to calculate, a spreadsheet is being published alongside this paper on the author’s researchgate.net site, which facilitates rapid calculations.
In some ways, when one looks carefully at the outcomes of different statistical tests undertaken here, the formula is a statement of the obvious. The statistics underlying it are basic. It merely relies upon the good aspects of long-established statistical methods for comparing continuous variables of previous authors (e.g.,
As regards
Some important recommendations should be borne in mind when using this method, in addition to those set out above under “Steps”:
(i) Continuous versus non-continuous variables: Some issues arose in the case studies here due to data gaps. In the studies of Sirystes and Basileuterus, non-homologous vocalizations were not compared with one another. There, populations with different measures for the same sorts of variables are recovered as more differentiated than those populations whose variables cannot validly be compared at all. Diagnosability based on the comparison of non-homologous vocalizations can be important, but it can also have pitfalls (notably, Chaves et al. 2000 claimed discrete variation in calls to claim sufficient differentiation under the
(ii) Sample sizes: If either Species 1 or Species 2 have very low sample sizes, then: (i) for many variables under study, data may not meet the threshold test of statistical significance; and (ii) those which do could be affected by low standard deviations and inflated effect sizes caused by clustering. These issues apply in reverse where Population 3 or Population 4 suffers from such constraints. Although the test above is in principle sample size-neutral, caution should be exercised in interpreting results based on smaller samples (see Myrmeciza biometrics discussion below and Table
(iii) Scale factoring and manipulation through overloading: Where there are 15 vocal measures and 5 biometric measures, it should be considered whether to weight scoring on a 50:50 basis, following
(iv) Not going over the top: The formula presented here is proposed for usage in more difficult, borderline, or complicated cases. Where simpler studies can show allopatric populations or newly discovered populations to be very different indeed from one another, then there should be no need for a litany of statistical analyses to be undertaken. It should be appropriate in some cases simply for an author to publish photographs of specimens or sonograms or a brief subjective text to describe the differences observed. A good example of a situation in this category would be the allopatric Western and Eastern Woodhaunters Automolus virgatus and A. subulatus, whose vocalizations resemble one another not one iota (
(v) Possible usage of controlled pooled effect size. The formula above does not use pooled standard deviations and so makes no assumptions about the comparability of the variances of the different populations under study. As discussed above, there may be use cases for controlled pooled effect size, especially as a hedge for small sample sizes, but this should be applied only with caution. In any cases where assumptions of equal SD may be made among all four populations 1, 2, 3 and 4, then a more complicated formula using controlled pooled standard deviations might be used instead (see Materials and methods for details of equations that may be substituted in).
Myrmeciza was chosen here as an example because the recommendations of the relevant paper (
Table
Worked-through example of the new formula proposed herein, assessing the rank of two allopatric Myrmeciza antbirds (M. immaculata vs M. zeledoni) by comparison to a pair of sympatric sister taxa in the same genus (M. goeldii vs. M. melanoceps), using vocal data only.
Variable | M. goeldii vs. M. melanoceps | M. immaculata vs. M. zeledoni | ||||
---|---|---|---|---|---|---|
Controlled unpooled effect size | p value | Score | Controlled unpooled effect size | p value | Score | |
Male song | ||||||
No. of notes | 1.41 | 5.2 × 10-29 | 1.41 | 4.54 | 6.92 × 10-33 | 4.54 |
Song length | 0.37 | 0.0015 | 0.37 | 0.71 | 0.0022 | 0 |
Song speed | 2.05 | 3.9 × 10-49 | 2.05 | 7.65 | 1.03 × 10-89 | 7.65 |
Max. acoustic frequency second note | 1.03 | 2.7 × 10-16 | 1.03 | 3.30 | 7.20 × 10-18 | 3.30 |
Max. acoustic frequency of last note | 1.31 | 3.9 × 10-23 | 1.31 | 2.88 | 2.03 × 10-15 | 2.88 |
Change in acoustic frequency | 0.52 | 3.7 × 10-5 | 0.52 | 1.36 | 2.62 × 10-8 | 1.36 |
Position of peak of frequency | 4.75 | 4.5 × 10-74 | 4.75 | 0.08 | 0.76 | 0 |
Position of trough in frequency | 3.83 | 9.5 × 10-55 | 3.83 | 0.03 | 0.91 | 0 |
Single note call | ||||||
Call length | 0.91 | 0.00270 | 0 | 0.99 | 0.040 | 0 |
Maximum acoustic frequency | 0.99 | 0.00103 | 0.99 | 0.37 | 0.16 | 0 |
Multi-note call | ||||||
No. of notes | 0.11 | 0.770 | 0 | 0.01 | 0.99 | 0 |
Song length | 0.06 | 0.880 | 0 | 0.22 | 0.57 | 0 |
Song speed | 0.21 | 0.494 | 0 | 0.21 | 0.56 | 0 |
Max. acoustic frequency | 0.31 | 0.371 | 0 | 1.71 | 0.00049 | 1.71 |
Min. acoustic frequency | 0.19 | 0.529 | 0 | 2.29 | 0.00013 | 2.29 |
Change in acoustic frequency | 0.24 | 0.447 | 0 | 0.45 | 0.25 | 0 |
Position of peak of frequency | 0.49 | 0.396 | 0 | 0.65 | 0.12 | 0 |
Position of trough in frequency | 0.02 | 0.946 | 0 | 0.12 | 0.74 | 0 |
Female song | ||||||
No. of notes | 0.78 | 0.0073 | 0 | 5.52 | 2.41 × 10-15 | 5.52 |
Song length | 0.60 | 0.034 | 0 | 1.00 | 0.023 | 0 |
Song speed | 1.18 | 7.07 × 10-5 | 1.18 | 7.00 | 1.50 × 10-19 | 7.00 |
Max. acoustic frequency second note | 0.61 | 0.031 | 0 | 1.62 | 0.0034 | 0 |
Max. acoustic frequency of last note | 1.14 | 0.00020 | 1.14 | 2.61 | 3.86 × 10-7 | 2.61 |
Change in acoustic frequency | 0.23 | 0.40 | 0 | 0.71 | 0.18 | 0 |
Position of peak of frequency | 0.04 | 0.89 | 0 | 0.96 | 0.11 | 0 |
Position of trough in frequency | 0.27 | 0.33 | 0 | 0.47 | 0.41 | 0 |
Euclidian distance (square root of sum of the squares) | 7.09 (7.14 using data to more s.f.) | 13.95 (13.75 using data to more s.f.) |
Table
Full scores across the Myrmeciza data set for vocal data only. Bold denotes a sympatric pair of sister taxa. Bold italics denote other sympatric pairs. Denote pairs (with asterisk) which are subspecies based on overall scoring and discrete characters (see also Table
M. i. concepcion | M. z. macrorhyncha | M. z. zeledoni | M. fortis | M. goeldii | M. melanoceps | |
---|---|---|---|---|---|---|
M. i. immaculata | 3.56* | 12.22 | 11.23 | 28.00 | 18.15 | 14.60 |
M. i. concepcion | 13.75 | 15.64 | 19.51 | 21.49 | 15.50 | |
M. z. macrorhyncha | 4.08* | 23.74 | 21.13 | 17.85 | ||
M. z. zeledoni | 28.45 | 28.15 | 24.39 | |||
M. fortis | 22.81 | 12.73 | ||||
M. goeldii | 7.14 |
Biometric scores are shown in Table
Scores across the Myrmeciza data set for biometrics. All goeldii scores (sample size n=2 specimens) were actually zero. Square bracketed figures showing alongside M. goeldii are based on controlled effect sizes without deleting insignificant data and are presented for reference only Bold denotes a sympatric pair of sister taxa. Bold italics denote other sympatric pairs. Denote subspecies (with asterisk) based on overall scoring (see also Table
Taxon | M. i. concepcion | M. z. macrorhyncha | M. z. zeledoni | M. fortis | M. goeldii | M. melanoceps |
---|---|---|---|---|---|---|
M. i. immaculata | 0* | 3.02 | 1.72 | 3.35 | [5.36] | 5.96 |
M. i. concepcion | 2.49 | 0 | 3.01 | [5.13] | 5.72 | |
M. z. macrorhyncha | 1.64* | 2.22 | [3.01] | 5.07 | ||
M. z. zeledoni | 2.54 | [4.59] | 5.35 | |||
M. fortis | [3.47] | 3.58 | ||||
M. goeldii | [3.52] |
As above, although a universal formula is proposed here, no universal score is proposed here for ranking species, since the differentiation required to rank a species is likely to vary depending on the number of variables studied and by taxonomic group (
There are however some parameters and examples available from the case studies (Table
Scores of examples from the data set which are both (i) sister species (or relevant sympatric subspecies of sister species) as shown by molecular studies; and (ii) sympatric, to show ranges of scores. Also presented are examples of scores passing other authors’ species tests. Note the
Sympatric pair or proposed score for species rank | Type of data | Score |
---|---|---|
Myrmeciza goeldii vs Myrmeciza melanoceps | Voice | 7.14 |
Scytalopus griseicollis griseicollis vs. Scytalopus spillmanni undescribed East Andes population | Voice + biometrics = total | 9.16 + 0 = 9.16 |
Scytalopus griseicollis gilesi vs. Scytalopus spillmanni undescribed East Andes population | Voice + biometrics = total | 10.59 + 0 = 10.59 |
Scytalopus griseicollis morenoi vs. Scytalopus spillmanni undescribed East Andes population | Voice + biometrics = total | 8.79 + 0 = 8.79 |
Grallaricula ferrugineipectus Venezuela vs G. nana nanitaea Merida Andes | Voice + biometrics = total | 7.90 + 8.01 = 15.91 |
Average ± s.d. (min.–max.) (n=sample number) | 10.32 ± 3.36 (7.14–15.91) (n=5) | |
Basis for |
Voice: diagnosability of three characters = 3 × 4 SD | 6.92 |
A basis for |
Voice or biometrics: 1 × 10 SD (score 4), 1 × 5 SD (score 3) | 11.18 |
A basis for |
Voice and biometrics: 1 × 10 SD (score 4), 1 × 2 SD (score 2), 1 x 0.2 SD (score 1) | 10.20 |
A basis for |
Voice and biometrics: 1 × 10 SD (score 4), 3 × 0.2 SD (score 1 each) | 10.01 |
A basis for |
Voice and biometrics: 2 × 5 SD (score 3 each), 1 × 0.2 SD (score 1) | 7.07 |
A basis for |
Voice and biometrics: 1 × 5 SD (score 3), 2 × 2 SD (score 2 each) | 5.74 |
A basis for |
Voice and biometrics: 1 × 5 SD (score 3), 1 × 2 SD (score 2), 2 × 0.2 SD (score 1 each) | 5.39 |
A basis for |
Voice and biometrics: 3 × 2 SD (2 each), 1 × 0.2 SD (1 each) | 3.47 |
As regards subspecies or PSC species, any score of 4 or more (i.e., allowing full diagnosis in multidimensional space) would be a supportable benchmark. There may however be cases of valid subspecies which achieve lower scores than this, such as in the Adelomyia melanogenys study where the pair discussed above scored only 2.10 for voice and 0 for biometrics, based on a fairly exhaustive attempt at measuring biometric and vocal variables. However, this is probably an exceptionally low-scoring example.
Among the un-named populations in the study group, only the “Apurímac south” population of Basileuterus tristriatus in Peru was recovered as diagnosable versus all proximate subspecies (6.33 versus Marañon to Apurímac population, 14.79 versus Bolivia) and therefore requires formal description. Other notable unnamed populations include the Tamá population of Grallaricula nana (scores 3.32 versus Mérida) and the West Andes populations of the same species (scores 2.86 against Central Andes). The two new Grallaricula taxa described in
Probably the most difficult taxonomic decision in this series of papers was that of how to rank Scytalopus rodriguezi, whose allopatric subspecies scored 5.40 for biometrics and 5.01 for voice, total 10.41 and so was more diagnosable than some sympatric tapaculos. A large component of this score (compared to sympatric pairs) was in biometrics and the two populations were found to respond to one another’s playback. In borderline cases such as this, where different kinds of variables differ between the sympatric pair and allopatric pair, then scale factoring may be appropriate. Voice is a very important character for tapaculos and in this case, the vocal score, whilst showing full diagnosis, did not attain the differentiation shown between known sympatric comparators.
The
The method proposed here involves no universal score for species rank. However, it would still be interesting to see how other pairwise situations involving sympatric sister species measure up under this system, and then possibly to revisit the philosophy underlying
Thanks to Michael Patten, Mort Isler, and George Sangster for helpful comments on this paper. The inclusion of a Bonferroni correction in the Level 1 test was suggested by F. Gary Stiles (in litt. 2007). Michael Patten made various insightful comments which resulted in addition of all the analyses and methods involving pooled standard deviation data and the various additional “type 1” corrections. Some of the methods underlying this paper were developed in papers published together with Jorge Avendaño, Paul Salaman, and others. Thanks as always to Blanca and Lucas for their love, good cheer, and encouragement.