Research Article 
Corresponding author: Thomas M. Donegan ( thomasdonegan@yahoo.co.uk ) Academic editor: Yasen Mutafchiev
© 2018 Thomas M. Donegan.
This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Citation:
Donegan TM (2018) What is a species? A new universal method to measure differentiation and assess the taxonomic rank of allopatric populations, using continuous variables. ZooKeys 757: 167. https://doi.org/10.3897/zookeys.757.10965

Existing models for assigning species, subspecies, or no taxonomic rank to populations which are geographically separated from one another were analyzed. This was done by subjecting over 3,000 pairwise comparisons of vocal or biometric data based on birds to a variety of statistical tests that have been proposed as measures of differentiation. One current model which aims to test diagnosability (
diagnosis, species limits, species scoring, statistics, subspecies limits, taxonomy
This paper aims to help address the “allopatric problem” when determining species rank in taxonomic science. Humans have categorized populations into named groups since the dawn of known civilization (
Sympatric species, which occur together in the same place during the breeding season but do not successfully interbreed to any material extent, are demonstrably real. With enough data and persistence, it is usually possible to determine whether or not sympatric populations interbreed regularly and whether they produce fertile offspring (
A traditionally more difficult problem, and the focus of this paper, is that of “allopatric” (
The subjectivity involved in comparing allopatric species and the rise of molecular science have doubtless encouraged the development of a multitude of different species criteria or concepts. As noted by
Whilst statistical and mathematical techniques to analyze molecular data have been a rich field for methodological advancement, the same cannot be said for the study of real world variables. Supportable statistical schemes for assessing betweenpopulation differentiation are noteworthy principally by their absence. Those schemes which have been proposed are either widely criticized, only applicable to particular taxonomic groups or vague.
This paper will concentrate on the traditional currency of taxonomy: continuous variables such as those based on measurement of specimens, whether in the museum or in the field. Many researchers and advanced amateurs do not have a molecular laboratory available and few genera have been exhaustively sampled in a way that includes multiple individuals at population level. In contrast, vocal and biometric data are easy to collate, accessible to many and cheaper to analyze. A wide variety of other ‘real world’ organism characters are capable of measurement as continuous variables. For vocalizations, lengths or acoustic frequencies of notes can be measured using sonograms, for example. Coloration can be measured using spectrometry. Noncontinuous or discrete variables, e.g., presence or absence of a particular character and molecular markers, can be analyzed best using cladistics and other phylogenetic tools and are not covered here in detail.
When species rank is assessed across a taxonomic group as a whole, consistency is a virtue. Under a biological species conceptbased approach, attaining such consistency will require a determination of which allopatric populations have differentiated to the same extent as related sympatrics and which have not. Those that have so differentiated are species; those that have not are, at most, subspecies. Unfortunately, consistency is not attained in current classifications, especially as regards more diverse tropical faunas. This is generally due to discrepancies in available data, the regularity of different genera being revised and differences in approaches by regional committees or textbook authorities to studies using different taxonomic methods (e.g., molecular vs. morphological) (
Neither of these two splits is problematic from a phylogenetic species concept or “enthusiastic splitter” perspective in isolation; and further studies could give stronger support to these treatments. However, based on my experience of working with birds in the Neotropics, the benchmark applied to these situations would result in the specific recognition of probably several thousands of current subspecies or unnamed taxa occurring in that region.
The
In light of the difficulties with scoring “systems” and other developments, Halley et al. (2017) have argued for a return to monophyly and essentially
Over the last 20 years, I have been studying the taxonomy of birds in Colombia using biometric data (from mistnetting and museums) and using sound recordings. This resulted in the production of a large amount of data relevant to studying differentiation. It has become transparent to me that steps might be taken towards resolving some of these seemingly intractable fundamental disagreements, by developing an objective and agreeable basis, grounded in scientific method, statistics, the analysis of large data and based on traditional biological species concept thinking, that could be used better, more consistently and more rationally to assess the rank of allopatric populations. Ultimately, the aim of this study is to attempt definitively to provide a robust, objective and universal method to address the centuriesold question (unresolved since
In the present study, I took a large data set that had been developed for purposes of various particular taxonomic studies of birds (citations below) and used this to roadtest proposed and possible alternative statistical tests for measuring differentiation or diagnosis, with the intention of studying outcomes of tests in order to inform recommendations.
I compiled vocal and biometric data from multiple studies, including of representatives of the three major assemblages of birds: nonpasserines (three families), suboscine passerines (four families), and oscine passerines (two families) (citations in Tables
Order: Family  Genus  No. taxa / populations  No. spp. before review  No. spp. after review  No. continuous vocal variables  No. Pairwise tests omitted  Pairwise comparisons  Sample sizes (mean ± s.d.) (min–max)  Reference 

Columbiformes: Columbidae  Geotrygon  2  1  2  2  0  2  22.0 ± 4.6 (18–26) 

Apodiformes: Trochilidae  Adelomyia  2  1  1  10  0  10  15.7 ± 1.5 (14–18)  Donegan and Avendaño (2015) 
Piciformes: Bucconidae  Hypnelus  2  1  2  5  0  5  5.5 ± 1.6 (4–7) 

Passeriformes: Thamnophilidae  Myrmeciza  8  4  5  26  114  614  42.7 ± 49.2 (3–179) 

Passeriformes: Grallariidae  Grallaricula  10  1  2  14  224  406  18.2 ± 12.9 (3–63) 

Passeriformes: Rhinocryptidae  Scytalopus 1  8  3  3  12  0  336  23.0 ± 13.8 (4–57) 

Passeriformes: Rhinocryptidae  Scytalopus 2  2  1  1  7  0  7  14.9 ± 2.2 (12–17) 

Passeriformes: Tyrannidae  Sirystes  4  1  4  18  64  44  39.1 ± 41.5 (3–146) 

Passeriformes: Parulidae  Basileuterus  13  3  6  19  558  924  25.5 ± 19.0 (2–78) 

TOTALS  51  16  26  113  960  2348  29.0 ± 32.5 (2–179) 
Order: Family  Genus  No. taxa / populations  No. spp. before review  No. spp. after review  No. continuous biometric variables  No. Pairwise tests omitted  Pairwise comparisons  Sample sizes (mean ± s.d.) (min.–max.)  Reference 

Apodiformes: Trochilidae  Adelomyia  2  1  1  4  0  4  9.6 ± 3.2(6–13)  Donegan and Avendaño (2015) 
Passeriformes: Thamnophilidae  Myrmeciza  7  4  5  5  18  87  21.1 ± 19.5(2–65) 

Passeriformes: Grallariidae  Grallaricula  11  1  3  6  49  281  12.4 ± 9.3(3–37) 

Passeriformes: Rhinocryptidae  Scytalopus 1  8  4  3  5  31  109  8.7 ± 7.4(2–24) 

Passeriformes: Rhinocryptidae  Scytalopus 2  2  1  1  5  0  5  4.9 ± 2.3(3–9) 

Passeriformes: Thraupidae  Anisognathus  10  2  2  5  39  186  25.1 ± 34.7(4–214) 

Passeriformes: Parulidae  Basileuterus  9  3  5  5  30  150  15.4 ± 10.9(2–42) 

TOTALS  49  16  20  35  167  822  15.5 ± 19.1(2–214) 
Vocal variables always included measures of maximum acoustic frequency, length, number of notes and speed. In some studies, change in pace, minimum frequencies, frequencies of particular notes, note bandwidth, changes in acoustic frequency and position of peaks or troughs of frequency within a vocalization, or any of the same measures for particular parts of vocalizations, were also measured. In each study, the variables under study were designed so as to document as fully as possible observed subjective differences between populations. Biometric variables were in all cases wing, tail, tarsus and bill length and mass, except for Trochilidae (no tarsus length) and Grallariidae (where bill width was additionally measured). Note shape and other subjective vocal characters were also studied, as were plumages. However, information on noncontinuous variables was discarded for purposes of this present study.
Pairwise comparisons were undertaken on a matrix basis of each population against each other population. Some pairwise tests were omitted due to lack of data for a particular population, i.e., where there were n < 2 recordings of a particular type of vocalization (which could represent either a sampling gap or genuine lack of delivery of such vocalization by the population in question); or n < 2 specimens of the population available in museums that were studied. In such cases, where n < 2, standard deviations could not be calculated and ttests could not be run, so the comparison was excluded to ensure full comparability between all tests applied.
The data set was not designed for the study of statistical tests used in taxonomy, since this study had not been conceived at the time of data collection. The choice of taxonomic groups was not based only on studies which include among their components sympatric pairs (cf.
Several statistical tests were applied multiple times on a pairwise basis using a Microsoft Excel spreadsheet devised by the author for rapid assessment of multiple pairwise statistical tests across multiple populations. This spreadsheet is being published on the author’s researchgate.net page, and should assist authors in better and more swiftly analyzing diagnosability in future studies. Calculations, described below, were undertaken to measure interpopulation differences in the context of various species and subspecies concepts.
First, the entire data set was subjected to various proposed tests of species or subspecies rank. In the formulae used below, x̄_{1} and s_{1} are the sample mean and standard deviations of Population 1; x̄_{2} and s_{2} refer to the same parameters in Population 2; and the t value uses a onesided confidence interval at the percentage specified for the relevant population and variable, with t_{1} referring to Population 1 and t_{2} referring to Population 2.
LEVEL 1: Welch’s ttest at p<0.05/n_{v}, i.e., applying a Bonferroni correction. An unequal variance (Welch’s) ttest was used. This is preferable to other ttests in that it makes no assumptions about whether the SD of one population differs from that of the other. For vocal data potentially based on ratios, such as song speed, a twosample KolmogorovSmirnov test can be applied instead to account for the possibility of a nonnormal distribution. However, in order to standardize the study outputs, only Welch’s ttest was applied here.
When applying tests of statistical significance across multiple variables for the same pair, there is a risk of socalled “type 1” errors occurring. If testing for p < 0.05 for 100 independent variables of the same two populations, it would be expected that 5 variables would meet the requirements of the relevant test at this level of confidence. Various methods were tested which purport to reduce the risk of “type 1” errors. First, Bonferroni corrections were applied based on each of: (i) the total number of variables studied for the pair as a whole; (ii) separately for two “families” of vocal versus biometric variables; and (iii) separately for each different kind of vocalization, where applicable. Applying Bonferroni correction for a study involving five variables, p < (0.05/5) = 0.01 is the corrected confidence interval. DunnŠidák is a widely used but less conservative alternative to Bonferroni and was applied also to all three of the same situations as above in order to examine the impacts and outcomes using alternative corrections.
LEVEL 2: a ‘50%/95%’ test, following one of
(x̄_{1}–x̄_{2}) > (s_{1}(t_{1 @ 97.5%}) + s_{2}(t_{2 @ 97.5%}))/2
LEVEL 3: The traditional ‘75% / 99+%’ test for subspecies (
(x̄_{1}–x̄_{2}) > s_{1}(t_{1 @ 99%}) + s_{2}(t_{2 @ 75%}) and
(x̄_{2}–x̄_{1}) > s_{2}(t_{2 @ 99%}) + s_{1}(t_{1 @ 75%})
LEVEL 4: diagnosability based on nonoverlap of recorded values (the first part of
LEVEL 5: ‘Full’ diagnosability (where sample means are four average SDs apart at the 95% level, controlling for sample size) the second part of
(x̄_{1}–x̄_{2}) > s_{1}(t_{1 @ 97.5%}) + s_{2}(t_{2 @ 97.5%})
Figure
These five tests were applied to 2348 population/variable combinations for voice and 822 population/variable combinations for biometrics. A population/variable combination is one comparison between two populations for a single variable. For example, in the Grallaricula study, a comparison of the main East Andes population against the Central Andes population for song length would constitute a single population/variable combination. With five diagnosability tests (Levels 1–5 above) conducted per population/variable combination, this means that a total of 15,610 pairwise statistical tests were run in this part of the study. (A further four tests conducted in later sections bring that total to over 28,000 separate statistical tests in this study.) Each population/variable combination was placed in a category summarizing which diagnosability tests it satisfied. The total number of population/variable combinations meeting particular tests was then summed for the biometric and vocal data sets separately, and then similar kinds of outcomes were grouped using the framework set out in Table
Recorded test satisfaction outcomes and the mapping of such outcomes to diagnosis groupings.
Outcome  Meaning  Grouping 

0  None of the tests are met.  No diagnosis 
1  Statistically significant difference between means but no tests of diagnosis are met and data overlap.  Statistical significance 
14  Statistically significant difference between means and data show no overlap but no tests of diagnosis are met.  
12  Statistically significant difference between means but diagnosis only up to 50% and data overlap.  50% differentiation 
124  Statistically significant difference between means, diagnosis up to 50% and data show no overlap.  
123  Statistically significant difference between means and diagnosis at both 50% and 75% levels but data overlap.  75% differentiation 
1234  Statistically significant difference between means and diagnosis at both 50% and 75% levels and data do not overlap  
12345  Statistically significant difference between means and diagnosis at 50%, 75% and 95% levels and data do not overlap  95% differentiation 
1235  Statistically significant difference between means and diagnosis at 50%, 75% and 95% levels but data overlap.  
1245  Statistically significant difference between means and diagnosis at 50% and 95% levels and data overlap but 75% test is not met.  
125  Statistically significant difference between means and diagnosis at 50% and 95% levels and data overlap but 75% test is not met and data overlap.  
2  No statistically significant difference between means, but 50% diagnosis test is met.  Possible false results 
2345  No statistically significant difference between means, but 50%, 75% and 95% diagnosis tests are met.  
24  No statistically significant difference between means, but 50% diagnosis test is met and data do not overlap.  
245  No statistically significant difference between means, but 50% and 95% diagnosis tests are met and data do not overlap.  
25  No statistically significant difference between means and data overlap, but 50% and 95% diagnosis tests are met.  
4  Data do not overlap but no other statistical tests are met 
Certain minor methodological changes were undertaken here as compared to some of the underlying studies on which this paper is based: (i) where a single population had only one data point, it was excluded here from analyses, since only “Level 4” tests can be applied where degrees of freedom are 0 and this paper sought to compare outcomes for all comparisons; (ii) for the number of notes in the call for Grallaricula, several populations had uniformly one note in their calls, with standard deviation of zero, producing “divide by zero” errors for several tests, and so pairwise comparisons between such populations for that variable were excluded; (iii) some underlying studies presented biometric data for either males or females or all specimens or both; here, one or other of the “male” or “all specimens” data sets was selected, depending on whether material sexual differences in biometrics were observed and on sample size (generally, for studies with larger samples, using male only data is preferable, whilst in those studies with fewer specimens available, a combined data set was used here); (iv) for the main Scytalopus data set (
The second part of this study aimed to measure effect sizes four different ways, in order to inform appropriate benchmarks for measuring or scoring differentiation. The impacts of using pooled standard deviations (as per
Effect sizes were first calculated using the following formula:
(x̄_{1}–x̄_{2}) /[(s_{1}+s2)/2]
This uses an arithmetic mean of the standard deviations of the two populations to measure the difference between the means of the same two populations.
A control was applied using tdistribution values, following
(x̄_{1}–x̄_{2}) / ¼[s_{1}(t_{1 @ 97.5%}) + s_{2}(t_{2 @ 97.5%})]
This measures the distance between the means of two populations in terms of numbers of SDs, but controlling for sample size using a t distribution. The factor of ¼ is included to maintain parity with bare unpooled effect sizes and other measures studied in this section, i.e., where mean differences are measured with the equivalent of a single standard deviation for their denominator. For a normal distribution, as n tends to infinity, t tends to c.2 (actually nearer to 1.98), capturing essentially the whole sample within 2 standard deviations. As a result, s_{1}(t_{1 @ 97.5%}) + s_{2}(t_{2 @ 97.5%}) is equivalent to 2s_{1} + 2s_{2}, or 4s, calling for division by 4 to retain parity with 1s.
To illustrate the impact of this correction versus the results from using bare unpooled effect sizes, the maximum acoustic frequency in the “slow song” in Santa Marta Warbler Basileuterus basilicus differs from that in the East Andes population of Threestriped Warbler B. tristriatus by 4.087 SDs, using bare unpooled effect sizes. The controlled unpooled effect size for this variable is lower at 3.910 SDs. This is because n = 9 for basicilus and n = 53 for East Andes tristriatus; onesided tdistribution values at 97.5% are 2.306 and 2.007 respectively, effectively reflecting that an average SD of 2.157 using these sample sizes is equivalent to an SD of c.2 with infinite data points). This particular population/variable comparison therefore moved from being in a diagnosable category (> 4 SDs’ difference) to not being diagnosable (< 4 SDs’ difference and failing
Effect sizes using a pooled standard deviation, or Cohen’s d, were calculated. First, the pooled standard deviation was calculated:
s_{p} = √[((n_{1}–1)s_{2}^{2} + (n_{2}–1)s_{2}^{2}))/(n_{1}+n_{2}–2)]
Cohen’s d was then calculated as:
(x̄_{1}x̄_{2}) / s_{p}.
or, in full:
(x̄_{1}x̄_{2}) / √[((n_{1}–1)s_{2}^{2} + (n_{2}–1)s_{2}^{2}))/(n_{1}+n_{2}–2)]
This was the measure of effect size used by
Bare pooled effect sizes were subjected to an equivalent control for sample size (as for bare unpooled effect sizes), but using tvalues at the degrees of freedom of the pooled standard deviation:
Cohen’s d / ((t_{pooled @ 97.5})/2),
Where t_{pooled} is based on the degrees of freedom for the pooled standard deviation: d.f.=n_{1}+n_{2}–2.
or, in full:
(x̄_{1}–x̄_{2}) / (√[((n_{1}–1)s_{2}^{2} + (n_{2}–1)s_{2}^{2}))/(n_{1}+n_{2}–2)]) / ((t_{pooled@97.5%})/2).
Thee four measures of effect sizes were calculated for each population/variable combination and each outcome was then placed into two sets of buckets. First, in order to obtain a general resolution on effect sizes magnitude in taxonomic studies, population/variable combinations were placed into a set of buckets divided at 2 effect sizes (i.e., at approximately 50% differentiation) intervals: 0–2, 2–4, 4–6, etc. A second set of buckets was based on
To compare the outcomes achieved using the four different measures of effect size and analyses of Levels 1–5, plots were produced between several of the outcomes. Spearman’s rank correlation coefficient was calculated as between statistical significance and effect size outcomes, based on the entire vocal and biometric data sets, so as to examine the interrelation between the outcomes of applying different measures of differentiation.
Tables
Effect of applying different “Type 1 error” corrections on the vocal data set. The tests are ordered (AG) from least to most conservative corrections. Sirystes, Geotrygon and Hypnelus data are presented outside the totals, since there were no biometric data set on which more conservative cumulative corrections could be applied. In the case of the latter two genera, Adelomyia and Scytalopus 2, only one kind of vocalization was studied.
Adelomyia  Myrmeciza  Grallaricula  Scytalopus 1  Scytalopus 2  Basileuterus  TOTALS  [Sirystes]  [Geotrygon]  [Hypnelus]  

No. of vocal variables  10  26  14  12  7  19  18  2  5  
A. No correction  
p<  0.05  0.05  0.05  0.05  0.05  0.05  0.05  0.05  0.05  
Passed  2  406  323  236  4  240  1211  32  2  2 
Total  10  614  406  336  7  343  1716  44  2  5 
% passed  20%  66.1%  79.6%  70.2%  57.1%  70.0%  70.6%  72.7%  100%  40% 
B. DunnŠidák with each kind of vocalisation separately  
p<  0.00512  0.00639  0.00730  0.00851  0.00730  0.00730  0  0.0253  0.0102  
Passed  2  357  293  197  3  200  1052  25  2  1 
Total  10  614  406  336  7  343  1716  44  2  5 
%  20%  58.1%  72.2%  58.6%  42.9%  58.3%  61.3%  56.8%  100%  20% 
C. Bonferroni with each kind of vocalisation separately  
p<  0.005  0.00625  0.00714  0.00833  0.00714  0.00714286  0.01  0.025  0.01  
Passed  2  357  293  197  3  200  1052  25  2  1 
Total  10  614  406  336  7  343  1716  44  2  5 
%  20%  58.1%  72.2%  58.6  42.9%  58.3%  61.3%  56.8%  100%  20% 
D. DunnŠidák with voice and biometrics separately  
p<  0.00512  0.00197  0.00366  0.00427  0.00730  0.00730  0.00285  0.0253  0.0102  
Passed  2  321  267  186  3  189  968  20  2  1 
Total  10  614  406  336  7  343  1716  44  2  5 
%  20%  52.2%  65.8%  55.3%  42.9%  55.1%  56.4%  45.4%  100%  20% 
E. Bonferroni with voice and biometrics separately  
p<  0.005  0.00192  0.00357  0.00417  0.00714  0.00714  0.00278  0.025  0.01  
Passed  2  321  266  185  3  188  965  20  2  1 
Total  10  614  406  336  7  343  1716  44  2  5 
%  20%  52.2%  65.5%  55.1%  42.9%  54.8%  56.2%  45.5%  100%  20% 
F. DunnŠidák: biometrics plus voice  
p<  0.00366  0.00165  0.00256  0.00301  0.00427  0.00427  
Passed  2  317  260  179  3  182  943  
Total  10  614  406  336  7  343  1716  
%  20%  51.6%  64.0%  53.3%  42.9%  53.1%  55.0%  
G. Bonferroni: biometrics plus voice  
p<  0.00357  0.00161  0.0025  0.00294  0.00417  0.00417  
Passed  2  317  260  178  3  181  941  
Total  10  614  406  336  7  343  1716  
%  20%  51.6%  64.0%  53.0%  42.9%  52.8%  54.8% 
Effect of applying different “Type 1 error” corrections on the biometric data set. The tests are ordered (A–E) from least to most conservative corrections. Anisognathus data are presented outside the totals, since there was no vocal data set on which more conservative cumulative corrections could be applied.
Adelomyia  Myrmeciza  Grallaricula  Scytalopus 1  Scytalopus 2  Basileuterus  TOTALS  [Anisognathus]  

No. of biometric variables  4  5  6  5  5  5  5  
A. No correction  
p<  0.05  0.05  0.05  0.05  0.05  0.05  0.05  
Passed  0  46  142  45  3  66  302  88 
Total  4  87  281  109  5  150  636  186 
% passed  0%  52.9%  51.8%  41.3%  60%  44%  47.5%  47.3% 
B. DunnŠidák with biometrics and voice separately  
p<  0.0127  0.0102  0.00851  0.0102  0.0102  0.0102  0.0102  
Passed  0  31  108  35  2  50  226  66 
Total  4  87  281  109  5  150  636  186 
%  0%  35.6%  38.4%  32.1%  40%  33.3%  35.5%  35.5% 
C. Bonferroni with biometrics and voice separately  
p<  0.0125  0.01  0.00833  0.01  0.01  0.01  0.01  
Passed  0  31  108  35  2  50  226  66 
Total  4  87  281  109  5  150  636  186 
%  0%  35.6%  38.4%  32.1%  40%  33.3%  35.5%  35.5% 
D. DunnŠidák: biometrics plus voice  
p<  0.00366  0.00165  0.00256  0.00301  0.00427  0.00213  
Passed  0  33  92  31  2  40  198  
Total  4  87  281  109  5  150  636  
%  0%  37.9%  32.7%  28.4%  40%  26.7%  31.1%  
E. Bonferroni: biometrics plus voice  
p<  0.00357  0.00161  0.0025  0.00294  0.00417  0.00208  
Passed  0  33  92  31  2  40  198  
Total  4  87  281  109  5  150  636  
%  0%  37.9%  32.7%  28.4%  40%  26.7%  31.1% 
In the biometrics study, lower levels of statistically significant differentiation were found than for voice. More comparisons were nonsignificant (52.5%) than significant, even prior to applying any type 1 corrections. Applying type 1 corrections eliminated a further 12–16% of outcomes. Fewer than 5% of these eliminations result from treating voice and biometrics together; the bulk resulted from applying Bonferroni on the biometric data set itself. DunnŠidák corrections had no impact compared to using Bonferroni.
Tables
Outcomes of pairwise comparisons for vocal characters, placed into the different categories recovered by testing statistical tests of Levels 1 through 5. 1 = statistically significant, 2 = 50% diagnosis, 3=75% diagnosis, 4 = actual value diagnosis, 5 = 95% diagnosis (as detailed further in methods). See Table
Voice: levels passed  None  1  14  12  124  123  1234  12345  1235  1245  125  2  2345  24  245  4 

Geotrygon  1  1  
Adelomyia  8  2  
Hypnelus  3  1  1  
Myrmeciza  284  38  53  35  23  9  4  109  11  2  37  2  5  2  
Grallaricula  118  92  1  47  27  7  5  71  3  12  1  2  12  8  
Scytalopus 1  148  82  37  17  5  7  36  1  1  2  
Scytalopus 2  4  3  
Sirystes  22  7  7  3  1  2  1  1  
Basileuterus  504  188  3  73  53  7  6  47  7  2  0  1  11  0  22  
TOTAL  1091  410  57  199  128  28  23  265  22  16  38  7  12  18  3  31 
Percentages  46.5%  17.5%  2.4%  8.5%  5.5%  1.2%  1.0%  11.3%  0.9%  0.7%  1.6%  0.3%  0.5%  0.8%  0.1%  1.3% 
Outcomes of pairwise comparisons for biometric characters, placed into the different categories recovered by testing statistical tests of Levels 1 through 5. 1 = statistically significant, 2 = 50% diagnosis, 3 = 75% diagnosis, 4 = actual value diagnosis, 5 = 95% diagnosis (as detailed further in methods). See Table
Biometrics: levels passed  None  1  14  12  124  123  1234  12345  1235  1245  125  2  2345  24  245  4 

Adelomyia  4  
Myrmeciza  25  24  9  1  4  24  
Grallaricula  166  17  1  15  23  1  5  35  3  7  8  
Scytalopus 1  47  6  4  8  12  2  3  27  
Scytalopus 2  3  1  1  
Anisognathus  120  43  12  6  1  4  
Basileuterus  97  27  1  10  9  1  2  1  2  
TOTAL  462  117  6  54  52  1  9  49  0  3  0  0  0  8  0  61 
Percentages  56.2%  14.2%  0.7%  6.6%  6.3%  0.1%  1.1%  6.0%  0.0%  0.4%  0.0%  0.0%  0.0%  1.0%  0.0%  7.4% 
Outcomes of pairwise comparisons using Levels analysis, for voice, by grouping. See Table
Voice: Taxon  Pairwise statistical tests (/5)  No diff.  Poss. false results  Signif. Only  50%  75%  95% 

Geotrygon  2  0  0  1  1  0  0 
Adelomyia  10  8  0  2  0  0  0 
Hypnelus  5  3  1  0  1  0  0 
Myrmeciza  614  284  9  91  58  13  159 
Grallaricula  406  118  22  93  74  12  87 
Scytalopus 1  336  148  3  82  54  12  37 
Scytalopus 2  7  4  0  0  3  0  0 
Sirystes  44  22  2  7  10  1  2 
Basileuterus  924  504  34  191  126  13  56 
TOTALS  2348  1091  71  467  327  51  341 
OVERALL %  46.5%  3.0%  19.9%  13.9%  2.2%  14.5%  
% (comparable)  46.4%  2.9%  20.0%  13.7%  2.1%  14.7% 
Outcomes of pairwise comparisons using Levels analysis, for biometrics, by grouping. See Table
Biometrics: Taxon  Pairwise statistical tests (/5)  No diff.  Poss. false results  Signif. only  50%  75%  95% 

Adelomyia  4  4  0  0  0  0  0 
Myrmeciza  87  25  24  24  10  0  4 
Grallaricula  281  166  15  18  38  6  38 
Scytalopus 1  109  47  27  10  20  2  3 
Scytalopus 2  5  3  0  0  1  0  1 
Anisognathus  186  120  0  43  18  1  4 
Basileuterus  150  97  3  28  19  1  2 
TOTALS  822  462  69  123  106  10  52 
OVERALL %  56.2%  8.4%  15.0%  12.9%  1.2%  6.3%  
% (comparable)  53.8%  10.8%  12.6%  13.8%  1.4%  7.5% 
As foreshadowed in the type 1 error analysis (Tables
Levels 1–5 were generally ordered by least to most exacting in terms of difficulty to pass. However, several examples of “outliers” were uncovered, where more liberal test outcomes were apparently “skipped”, e.g.: (i) only statistical significance and nonoverlap (1&4); (ii) statistical significance with 50% and nonoverlap but not 75% diagnosis (124); (iii) all tests being passed except nonoverlap (123&5); (iv) all tests including 95% diagnosis being passed, but excluding 75% diagnosis (124&5); (v) full statistical diagnosis and 50% and 95% diagnosis being met but neither 75% nor nonoverlap (12&5); and (vi) combinations skipping statistical significance altogether, but passing other tests (all outcomes starting with 2 or 4). These outcomes are all statistically plausible, including as a result of the values of t at particular sample sizes for different confidence limits, even if in some cases they are logically counterintuitive.
In terms of specific findings for birds, biometric data were less informative than vocal data with “possibly false results” also being more frequent for biometric comparisons.
Results for effect sizes divided into buckets of 2d are set out in Tables
Results of the effects size study for voice, partitioning the data into 2 effect size intervals. The top two tables are based upon actual standard deviations for each set of data subjected to pairwise comparison. The lower two tables are based on pooled standard deviation data pooling. In each case, “bare” effect sizes are shown first (above). The second and fourth tables use “controlled effect sizes” for the relevant pooling approach, calculated by taking into account tdistribution values for the relevant sample size (or pooled sample size).
Bare Unpooled Effect Sizes (Voice)  

Taxon  0–2  2–4  4–6  6–8  8–10  10–12  12–14  14–16  16–18  18–20  20+ 
Geotrygon  1  1  
Adelomyia  10  
Hypnelus  3  2  
Myrmeciza  361  119  55  33  11  14  8  6  6  1  0 
Grallaricula  208  91  47  16  10  7  3  14  6  3  1 
Scytalopus 1  227  66  30  7  6  
Scytalopus 2  4  3  
Sirystes  27  14  2  0  1  
Basileuterus  654  168  47  29  20  4  1  1  0  0  0 
TOTAL  1497  462  181  85  48  25  12  21  12  4  1 
Percentage  63.7%  19.8%  7.7%  3.6%  2.0%  1.1%  0.5%  0.9%  0.5%  0.2%  0.0% 
Controlled Unpooled Effect Sizes (Voice)  
Taxon  0–2  2–4  4–6  6–8  8–10  10–12  12–14  14–16  16–18  18–20  20+ 
Geotrygon  1  1  
Adelomyia  10  
Hypnelus  4  1  
Myrmeciza  375  114  53  32  14  15  0  7  3  1  
Grallaricula  219  88  43  25  14  5  6  4  1  0  1 
Scytalopus 1  230  69  27  7  3  
Scytalopus 2  4  3  
Sirystes  29  12  3  
Basileuterus  717  151  34  14  6  2  
TOTAL  1589  439  160  78  37  22  6  11  4  1  1 
Percentage  67.7%  18.7%  6.8%  3.3%  1.6%  0.9%  0.3%  0.5%  0.2%  0.0%  0.0% 
Bare Pooled Effect Sizes (Voice)  
Taxon  0–2  2–4  4–6  6–8  8–10  10–12  12–14  14–16  16–18  18–20  20+ 
Geotrygon  1  1  
Adelomyia  10  
Hypnelus  3  2  
Myrmeciza  371  130  48  19  17  10  9  4  6  0  0 
Grallaricula  222  98  33  16  6  9  7  6  3  2  4 
Scytalopus 1  232  58  36  5  5  
Scytalopus 2  4  3  
Sirystes  28  13  2  0  0  0  0  1  
Basileuterus  687  151  40  16  11  8  5  2  1  1  2 
TOTAL  1558  456  159  56  39  27  21  13  10  3  6 
Percentage  66.4%  19.4%  6.8%  2.4%  1.7%  1.1%  0.9%  0.6%  0.4%  0.1%  0.3% 
Controlled Pooled Effect Sizes (Voice)  
Taxon  0–2  2–4  4–6  6–8  8–10  10–12  12–14  14–16  16–18  18–20  20+ 
Geotrygon  1  1  
Adelomyia  10  
Hypnelus  3  2  
Myrmeciza  374  127  49  21  19  6  8  4  6  
Grallaricula  226  99  31  17  8  8  6  5  2  2  2 
Scytalopus 1  233  58  35  5  5  
Scytalopus 2  4  3  
Sirystes  28  13  2  0  0  0  1  
Basileuterus  694  153  31  22  8  7  4  1  1  1  2 
TOTAL  1573  456  148  65  40  21  19  10  9  3  4 
Percentage  67.0%  19.4%  6.3%  2.8%  1.7%  0.9%  0.8%  0.4%  0.4%  0.1%  0.2% 
Results of the effects size study for biometrics, partitioning the data into 2 effect size intervals. The top two tables are based upon actual standard deviations for each set of data subjected to pairwise comparison. The lower two tables are based on pooled standard deviation data. In each case, “bare” effect sizes are shown first (above). The second and fourth tables use “controlled effect sizes” for the relevant pooling approach, calculated by taking into account tdistribution values for the relevant sample size (or pooled sample size).
Bare Unpooled Effect Sizes (Biometrics)  
Taxon  0–2  2–4  4–6  6–8  8–10  10–12  12–14  14–16  16–18  18–20  20+ 
Adelomyia  4  
Myrmeciza  58  25  4  
Grallaricula  170  63  21  14  5  3  1  2  0  2  
Scytalopus 1  67  33  7  1  1  
Scytalopus 2  2  2  0  1  
Anisognathus  154  26  5  0  1  
Basileuterus  116  27  7  
TOTAL  571  176  44  16  7  3  1  2  0  2  0 
Percentage  69.5%  21.4%  5.4%  1.9%  0.9%  0.4%  0.1%  0.2%  0.0%  0.2%  0.0% 
Controlled Unpooled Effect Sizes (Biometrics)  
Taxon  0–2  2–4  4–6  6–8  8–10  10–12  12–14  14–16  16–18  18–20  20+ 
Adelomyia  4  
Myrmeciza  73  10  4  
Grallaricula  192  51  20  12  3  0  0  1  2  
Scytalopus 1  84  22  2  1  
Scytalopus 2  3  1  1  
Anisognathus  163  19  3  1  
Basileuterus  127  21  2  
TOTAL  646  124  32  14  3  0  0  1  2  0  0 
Percentage  78.6%  15.1%  3.9%  1.7%  0.4%  0.0%  0.0%  0.1%  0.2%  0.0%  0.0% 
Bare Pooled Effect Sizes (Biometrics)  
Taxon  0–2  2–4  4–6  6–8  8–10  10–12  12–14  14–16  16–18  18–20  20+ 
Adelomyia  4  
Myrmeciza  57  28  2  
Grallaricula  177  54  23  14  3  5  2  2  1  
Scytalopus 1  67  36  6  
Scytalopus 2  2  2  1  
Anisognathus  158  23  4  0  1  
Basileuterus  127  17  6  
TOTAL  592  160  42  14  4  5  2  2  1  0  0 
Percentage  72.0%  19.5%  5.1%  1.7%  0.5%  0.6%  0.2%  0.2%  0.1%  0.0%  0.0% 
Controlled Pooled Effect Sizes (Biometrics)  
Taxon  0–2  2–4  4–6  6–8  8–10  10–12  12–14  14–16  16–18  18–20  20+ 
Adelomyia  4  
Myrmeciza  58  27  2  
Grallaricula  181  54  21  11  8  3  1  1  1  
Scytalopus 1  75  30  4  
Scytalopus 2  3  2  
Anisognathus  161  21  3  0  1  
Basileuterus  128  17  5  
TOTAL  610  151  35  11  9  3  1  1  1  0  0 
Percentage  74.2%  18.4%  4.3%  1.3%  1.1%  0.4%  0.1%  0.1%  0.1%  0.0%  0.0% 
A good portion (15–22%) of outcomes fell into the 2–4 effect sizes category, which, when using controlled unpooled effect sizes, corresponds to Level 2 in Tables
Table
Changes in effect size categories resulting from increasingly more conservative tests of effect size being applied. This table is based upon changes between the categories in Tables
Voice: changes into effect size category  0–2  2–4  4–6  6–8  8–10  10–12  12–14  14–16  16–18  18–20  20+ 
Bare Unpooled –> Bare Pooled  +63  8  22  29  9  +2  +9  8  2  1  +5 
Bare Pooled –> Controlled Pooled  +15  0  11  +9  +1  6  2  3  1  0  2 
Controlled Pooled –> Controlled Unpooled  +16  17  +12  +13  3  +1  13  +1  5  2  3 
Total change from Bare Unpooled –> Controlled Unpooled  +94  25  21  7  11  3  6  10  8  3  0 
As percentage of total  +4.0%  1.1%  0.9%  0.3%  0.5%  0.1%  0.3%  0.4%  0.3%  0.1%  0.0% 
Biometrics: changes into effect size category  0–2  2–4  4–6  6–8  8–10  10–12  12–14  14–16  16–18  18–20  20+ 
Bare Unpooled –> Bare Pooled  +21  16  2  2  3  +2  +1  0  +1  2  0 
Bare Pooled –> Controlled Pooled  +18  9  7  3  +5  2  1  1  0  0  0 
Controlled Pooled –> Controlled Unpooled  +36  27  3  +3  6  3  1  0  +1  0  0 
Total change from Bare Unpooled –> Controlled Unpooled  +75  52  12  2  4  3  1  1  +2  2  0 
As percentage of total  +9.1%  6.3%  1.5%  0.2%  0.5%  0.4%  0.1%  0.1%  +0.2%  0.2%  0.0% 
Results for effect sizes divided into
Results of the effects size study for voice using
Bare Unpooled Effect Sizes (Voice)  
Taxon  0–0.2  0.2–2  2–5  5–10  10+ 
Geotrygon  0  1  1  
Adelomyia  3  7  
Hypnelus  0  3  2  
Myrmeciza  70  291  152  66  35 
Grallaricula  34  174  119  45  34 
Scytalopus 1  31  196  83  26  
Scytalopus 2  0  4  3  
Sirystes  4  23  14  3  
Basileuterus  96  558  200  64  6 
TOTAL  239  1257  574  204  75 
Percentage  10.1%  53.5%  24.4%  8.7%  3.2% 
Controlled Unpooled Effect Sizes (Voice)  
Taxon  0–0.2  0.2–2  2–5  5–10  10+ 
Geotrygon  0  1  1  
Adelomyia  3  7  
Hypnelus  0  4  1  
Myrmeciza  73  302  144  69  26 
Grallaricula  37  182  121  49  17 
Scytalopus 1  33  197  85  21  
Scytalopus 2  0  4  3  
Sirystes  4  25  12  3  
Basileuterus  113  604  171  34  2 
TOTAL  263  1326  538  176  45 
Percentage  11.2%  56.5%  22.9%  7.5%  1.9% 
Bare Pooled Effect Sizes (Voice)  
Taxon  0–0.2  0.2–2  2–5  5–10  10+ 
Geotrygon  0  1  1  
Adelomyia  3  7  
Hypnelus  0  3  2  
Myrmeciza  71  300  157  57  29 
Grallaricula  35  187  118  35  31 
Scytalopus 1  31  201  78  26  
Scytalopus 2  0  4  3  
Sirystes  4  24  14  2  
Basileuterus  105  582  181  37  19 
TOTAL  249  1309  554  157  79 
Percentage  10.6%  55.7%  23.6%  6.7%  3.4% 
Controlled Pooled Effect Sizes (Voice)  
Taxon  0–0.2  0.2–2  2–5  5–10  10+ 
Geotrygon  0  1  1  
Adelomyia  3  7  
Hypnelus  0  3  2  
Myrmeciza  71  303  158  58  24 
Grallaricula  36  190  115  40  25 
Scytalopus 1  30  203  79  24  
Scytalopus 2  0  4  3  
Sirystes  4  24  14  2  
Basileuterus  109  585  177  37  16 
TOTAL  253  1320  549  161  65 
Percentage  10.8%  56.2%  23.4%  6.9%  2.8% 
Results of the effects size study for biometrics using
Bare Unpooled Effect Sizes (Biometrics)  
Taxon  0–0.2  0.2–2  2–5  5–10  10+ 
Adelomyia  1  3  
Myrmeciza  5  53  27  2  
Grallaricula  24  146  70  33  8 
Scytalopus 1  7  60  40  2  
Scytalopus 2  0  2  2  1  
Anisognathus  22  132  28  4  
Basileuterus  19  97  32  2  
TOTAL  78  493  199  44  8 
Percentage  9.5%  60.0%  24.2%  5.4%  1.0% 
Controlled Unpooled Effect Sizes (Biometrics)  
Taxon  0–0.2  0.2–2  2–5  5–10  10+ 
Adelomyia  1  3  
Myrmeciza  8  65  14  
Grallaricula  29  163  61  25  3 
Scytalopus 1  17  67  24  1  
Scytalopus 2  0  3  2  
Anisognathus  24  139  21  2  
Basileuterus  28  99  23  
TOTAL  107  539  145  28  3 
Percentage  13.0%  65.6%  17.6%  3.4%  0.4% 
Bare Pooled Effect Sizes (Biometrics)  
Taxon  0–0.2  0.2–2  2–5  5–10  10+ 
Adelomyia  1  3  
Myrmeciza  6  51  28  2  
Grallaricula  26  151  63  31  10 
Scytalopus 1  8  59  40  2  
Scytalopus 2  0  2  3  
Anisognathus  23  135  24  4  
Basileuterus  20  97  21  12  
TOTAL  84  498  179  51  10 
Percentage  10.2%  60.6%  21.8%  6.2%  1.2% 
Controlled Pooled Effect Sizes (Biometrics)  
Taxon  0–0.2  0.2–2  2–5  5–10  10+ 
Adelomyia  1  3  
Myrmeciza  7  51  28  1  
Grallaricula  30  151  63  31  6 
Scytalopus 1  11  64  32  2  
Scytalopus 2  0  2  3  
Anisognathus  23  138  22  3  
Basileuterus  21  107  22  
TOTAL  93  516  170  37  6 
Percentage  11.3%  62.8%  20.7%  4.5%  0.7% 
Changes between category (Table
Changes in
Voice: changes into effect size category  0–0.2  0.2–2  2–5  5–10  10+ 
Bare Unpooled –> Bare Pooled  +11  +52  20  47  +4 
Bare Pooled –> Controlled Pooled  +4  +11  5  +4  14 
Controlled Pooled –> Controlled Unpooled  +10  +6  11  +15  20 
Total change fromBare Unpooled –> Controlled Unpooled  +25  +69  36  28  30 
As percentage of total  1.1%  2.9%  1.5%  1.2%  1.3% 
Biometrics: changes into effect size category  0–0.2  0.2–2  2–5  5–10  10+ 
Bare Unpooled –> Bare Pooled  +6  +5  20  +7  +2 
Bare Pooled –> Controlled Pooled  +9  +18  9  14  4 
Controlled Pooled –> Controlled Unpooled  +14  +23  25  9  3 
Total change fromBare Unpooled –> Controlled Unpooled  +29  +46  54  16  5 
As percentage of total  3.5%  5.6%  6.6%  1.9%  0.6% 
Tables
Scattergraphs showing the effects of applying different corrections of effect size on the entire vocal data set. Each axis shows effect size, measured in a different way. A Controlling for sample size using unpooled data – xaxis: bare unpooled effect size; yaxis: controlled unpooled effect size B Controlling for sample size using pooled data – xaxis: bare pooled effect size; yaxis: controlled pooled effect size C Using pooled versus unpooled effect sizes without controlling for sample size – xaxis: bare unpooled effect size; yaxis: bare pooled effect size D Using pooled versus unpooled effect sizes and controlling for sample size – xaxis: controlled unpooled effect size; yaxis: controlled pooled effect size. A single data point of greater than 25 effect sizes was excluded to improve presentation of the results.
Scattergraphs showing the effects of applying different corrections of effect size on the entire biometric data set. Scattergraphs showing the effects of applying different corrections of effect size on the entire vocal data set. Each axis shows effect size, measured in a different way. A Controlling for sample size using unpooled data – xaxis: bare unpooled effect size; yaxis: controlled unpooled effect size B Controlling for sample size using pooled data – xaxis: bare pooled effect size; yaxis: controlled pooled effect size C Using pooled versus unpooled effect sizes without controlling for sample size – xaxis: bare unpooled effect size; yaxis: bare pooled effect size D Using pooled versus unpooled effect sizes and controlling for sample size – xaxis: controlled unpooled effect size; yaxis: controlled pooled effect size.
Spearman’s rank correlation coefficients as between the results of five statistical tests carried out on the vocal data set. Confidence for the values stated were given in PAST as zero or less than p<1x10^{100} for all tests.
Tests conducted on vocal data set  Bare Unpooled Effect Sizes  Controlled Unpooled Effect Sizes  Bare Pooled Effect Sizes  Controlled Pooled Effect Sizes 

Statistical significance (student’s t)  0.863  0.910  0.849  0.859 
Bare Unpooled Effect Sizes  0.987  0.988  0.908  
Controlled Unpooled Effect Sizes  0.894  0.980  
Bare Pooled Effect Sizes  0.999 
Spearman’s rank correlation coefficients as between the results of five statistical tests carried out on the biometric data set. Confidence for the values stated were given in PAST as zero or less than p<1×10^{100} for all tests.
Tests conducted on biometric data set  Bare Unpooled Effect Sizes  Controlled Unpooled Effect Sizes  Bare Pooled Effect Sizes  Controlled Pooled Effect Sizes 

Statistical significance (student’s t)  0.851  0.951  0.831  0.857 
Bare Unpooled Effect Sizes  0.936  0.990  0.983  
Controlled Unpooled Effect Sizes  0.920  0.936  
Bare Pooled Effect Sizes  0.994 
Changes in actual effect sizes resulting from changes between different methods of measuring effect sizes in the format actual mean ± standard deviation (minimum–maximum) [mean using absolute values ± standard deviation using absolute values], for voice. The figure in each cell demonstrates the outcomes of subtracting the effect size in the columns from the effect sizes in the rows, for each data point studied.
Tests conducted on vocal data set  Controlled Unpooled Effect Sizes  Bare Pooled Effect Sizes  Controlled Pooled Effect Sizes 

Bare Unpooled Effect Sizes  0.35 ± 1.15(0.20–24.26)[0.35 ± 1.15]  0.15 ± 0.91(11.73–7.13)[0.36 ± 0.84]  0.24 ± 1.07(11.32–19.28)[0.41 ± 1.02] 
Controlled Unpooled Effect Sizes  0.19 ± 1.55(20.70–4.40)[0.45 ± 1.50]  0.11 ± 1.36(20.26–5.63)[0.40 ± 1.31]  
Bare Pooled Effect Sizes  0.08 ± 0.46(0.27–15.40)[0.09 ± 0.46] 
Changes in actual effect sizes resulting from changes between different methods of measuring effect sizes in the format actual mean ± standard deviation (minimum–maximum) [mean using absolute values ± standard deviation using absolute values] for biometrics. The figure in each cell demonstrates the outcomes of subtracting the effect size in the columns from the effect sizes in the rows, for each data point studied.
Tests conducted on biometric data set  Controlled Unpooled Effect Sizes  Bare Pooled Effect Sizes  Controlled Pooled Effect Sizes 

Bare Unpooled Effect Sizes  0.41 ± 0.68(0.01–5.61)[0.41 ± 0.68]  0.09 ± 0.44(2.17–5.34)[0.19 ± 0.40]  0.21 ± 0.53(1.97–5.77)[0.26 ± 0.51] 
Controlled Unpooled Effect Sizes  0.32 ± 0.73(6.31–2.36)[0.39 ± 0.69]  0.20 ± 0.57(4.4–2.78)[0.30 ± 0.53]  
Bare Pooled Effect Sizes  0.12 ± 0.28(0.01–2.84)[0.12 ± 0.28] 
The largest shift was observed between bare unpooled effect sizes versus controlled unpooled effect sizes, where a 3.9% (voice) or 9.1% (biometrics) increase in the number of outcomes in the lowest category of differentiation (0–2) was observed.
The impact of using bare pooled versus bare unpooled effect sizes is illustrated in Figures
The overall magnitude of reduction of effect size measurements between bare pooled effect sizes and controlled pooled effect sizes was moderate. Degrees of freedom for pooled standard deviation are higher (the sum of the two samples’ sample sizes minus 2) than when using unpooled methods (where each sample is treated separately), resulting in lower tvalues when using pooled standard deviations. Application of tdistribution corrections on effect sizes using unpooled standard deviations resulted in the most conservative of all measures of effect sizes, linked to overall lowest degrees of freedom in corrections and overall higher tvalues.
Although these overall trends were observed, the impact of applying differing methods of measurement of effect sizes on actual pairwise comparisons was not uniform (see Figures
Statistical significance presented a weak negative correlation with most effect size measurements, but being most closely correlated with controlled unpooled effect sizes. In the case of biometrics, there was a strong negative correlation with controlled unpooled effect sizes (Tables
The variability between particular scores using different effect size measures are defined further in Tables
The relationship between each measurement of effect size and statistical significance is explored in Tables
Effect sizes under the four models studied here, grouped into the three “zones” of statistical significance illustrated in Figures
Statistical significance  Bare unpooled effect sizes  Bare pooled effect sizes  Controlled pooled effect sizes  Controlled unpooled effect sizes 

p>0.05  0.60 ± 1.31(0.00–11.56)  0.38 ± 0.33(0.00–2.21)  0.71 ± 2.13(0.00–22.92)  0.67 ± 2.02(0.00–22.47) 
0.05/n_{v}<p<0.05  1.82 ± 2.69(0.33–18.22)  1.29 ± 1.24(0.33–8.47)  1.82 ± 3.00(0.26–21.60)  1.67 ± 2.61(0.26–20.14) 
p<0.05/n_{v}  3.69 ± 3.30(0.36–45.33)  3.32 ± 2.73(0.37–21.07)  3.32 ± 3.05(0.37–41.45)  3.22 ± 2.81(0.37–26.05) 
Effect sizes under the four models studied here, grouped into the three “zones” of statistical significance illustrated in Figures
Statistical significance  Bare unpooled effect sizes  Bare pooled effect sizes  Controlled pooled effect sizes  Controlled unpooled effect sizes 

p>0.05  0.72 ± 0.70(0.00–3.97)  0.42 ± 0.29(0.00–1.95)  0.71 ± 0.72(0.00–3.92)  0.63 ± 0.60(0.00–3.41) 
0.05/n_{v}<p<0.05  1.45 ± 0.86(0.44–3.98)  1.06 ± 0.41(0.44–2.44)  1.37 ± 0.96(0.40–6.01)  1.26 ± 0.84(0.40–5.81) 
p<0.05/n_{v}  3.37 ± 2.64(0.61–19.80)  2.79 ± 2.07(0.61–17.09)  3.15 ± 2.44(0.58–16.04)  2.97 ± 2.23(0.58–15.57) 
The dataset studied here exhibits comparable overall levels of variation to
Several broader aspects of the results can be explained by considering the number of standard deviations’ difference required to satisfy various models (Figure
The overall lower differentiation levels in biometrics can in part be explained due to lower sample size (see Tables
The outcomes of using pooled versus unpooled and bare versus controlled effect sizes are substantial across the data set as a whole and can be drastic in individual cases (Tables
The distinction between using pooled versus unpooled standard deviations in taxonomy has passed by barely without discussion in ornithological taxonomic literature.
Usage of pooled standard deviations, as a matter of statistical methodology, should only be undertaken where the standard deviations of the two populations under comparison can be assumed to be equal. This does not necessarily mean that measured standard deviations of the two populations must be equal, or even close to one another, since these will usually differ for two measured populations as a result of the sampling. However, it must be reasonable to make this assumption in order to apply this method. The pooling formula attributes greater weight to the standard deviation of the population with higher sample size and produces a “weighted average” standard deviation which is closer to that of the population of which there is a larger sample. Degrees of freedom for the pooled standard deviation are greater due to summing those of the two separate populations. In practice, in taxonomy, we will usually have no idea as to whether or not the standard deviations of two populations under comparison are equal or not. Special care should be adopted in using pooled standard deviations where estimated population sizes, molecular or geographical attributes of the two populations vary greatly. For example, comparing an isolated, very small montane population with low intrapopulation molecular variation versus a very widespread lowland population which is known to exhibit substantial clinal variation and has higher intrapopulation molecular variation would be inappropriate, since assumptions underlying the usage of pooled standard deviations are likely not just to be unknown but incorrect. The greater correlation between statistical significance and controlled unpooled effect sizes (Tables
There are still likely to be “use cases” for pooled standard deviations to measure effect sizes in taxonomy.
Most papers concerning the application of statistical tests for determining the taxonomic rank of allopatric populations have noted that statistical significance is not a good measure, due to its potential for liberal satisfaction by increasing sample size, its failure to indicate higher levels of differentiation or false positives when sampling from different parts of a geographical cline; and then move quickly on to discuss better tests (
Although large samples sizes are cited as the basis to reject statistical significance as a useful measure in taxonomy, such problems are rarely faced by taxonomists. Having too few specimens (whether in museums or measured in the field) or sound recordings is likely a more material problem. For new species descriptions in birds published between 1935–2009, 332 of 477 (70%) were based on 0–5 specimens, only one was based on >100 specimens and the mean number of specimens was 6 (
The classic test of statistical significance between two populations of data is the Student’s ttest. This evaluates the probability of whether two normallydistributed data sets relate to two different populations, by considering whether or not their mean averages are likely to differ from one another. Various other similar tests can assess differences between mean, median, or modal averages, such as F, MannWhitney U, KolmorovSmirnov, Wilks’ Lamda, ANOVA, KruskalWallis, and TukeyKramer. Some of these tests are better suited to continuous variables which are nonnormally distributed, such as ratios or products of raw data.
Although the ttest will evaluate the likelihood that two populations are different, it tells us little about the extent of differences between the two populations. With a large enough sample, the two sample means may be very close to one another. Here, the lowest distance between statistically significant outcomes was 0.36 effect sizes. Tests of statistical significance can also be failed on data showing effects sizes as high as 18 (Table
Myrmeciza melanoceps: 2.42 ± 0.12 (1.98–2.62) (n = 143)
Myrmeciza goeldii: 2.29 ± 0.08 (2.01–2.49) (n = 173)
The ttest was passed at p<0.0002, yet these data reveal small differences between means and substantial overlaps in recorded values. The ttest result suggests that the two populations in question have begun to diverge from one another, which is interesting and makes it valid to discuss their relationship and possible isolation mechanisms. However, identification of a sound recording to one or the other species on the basis of these data would be impossible. The effect size here was 1.31, considerably in excess of the lowest (0.36) score, but fewer than 50% of individuals could be identified based on this variable and it would be useless for identification. Regularly, diagnosis is incorrectly asserted in the taxonomic literature based on data like those in this example (see citations above).
The ttest and similar tests demonstrate statistical significance of differences between means. Such differences may have some evolutionary significance. However, a positive ttest is not necessarily of much taxonomic significance (Fig.
In the field of medicine, the outcome of tests of statistical significance is widely understood and accepted to be just a first phase in demonstrating an interesting result. A variety of different approaches exist in medical science which must also be passed to show clinical significance, which, for example, would support the usage of drugs. In any taxonomic study, it is similarly important to move on from the ecology class, beyond statistical significance to consider the taxonomic significance of any results.
That all said, statistical significance can be a tougher one than some proposed measures of differentiation. Instances were found here of pairwise comparisons passing
Scattergraphs of controlled unpooled effect size (xaxis) versus statistical significance (p<x) (yaxis). All outcomes to the right of the two purple lines represent pairwise comparisons which are given scores of at least 1 under the
Introducing the DunnŠidák correction (as opposed to the simpler but overly conservative Bonferroni correction) had a virtually negligible effect (Tables
Bonferroni (and DunnŠidák) corrections are appropriately applied to “families” of variables. A middleground of treating voice and biometrics as separate “families” is recommended based on this study. This could be criticized, since certain aspects of voice and biometrics can be linked (e.g.,
Diagnosability was considered to be the most frequently applied criterion to assess rank in a review of over 1000 taxonomic revisions (
In this study, 14.5% of the vocal data set and 6.3% of the biometric data set passed the
There is no consensus as to whether any differentiation below diagnosability for particular characters ought to be recognized in taxonomy.
Originally, 75% diagnosis tests for subspecies used one of two approaches: (i) a 75%/75% test (i.e., 75% of population 1 is diagnosable from 75% of population 2; or (ii) a 75%/99+% test (i.e., 75% of population 1 is diagnosable from essentially all of population 2).
The application of sharp, seemingly arbitrary, tests such as these to classify normally distributed data into segments to which scores are attributed is a situation not unique to taxonomy. Similar hard boundaries are also rife in most education and examination systems. In UK universities, a student scoring 60.1% or 69.9% in an examination will be given the same award (an upper second class degree) but a student attaining 70.1% will get a different award, a first class degree. This is despite the students scoring 69.9% and 70.1% having attained more similar levels of achievement to one another. Whilst any cutoff may be criticized as arbitrarily generous or harsh to outcomes falling close to the line on either side, the application of cutoffs is something that humans tend to do in their quest to categorize things. Where cutoffs are applied, a test of whether the cutoff is a valid one should best be based upon: (a) differentiation of a meaningful number of outcomes; and (b) the setting of boundaries at statistically, mathematically, or biologicallymeaningful positions.
In this data set, with very large numbers of pairwise comparisons, necessarily many individual cases fall very close to each of the cutoff boundaries proposed by previous models for attributing taxonomic significance, whether at or below diagnosability. Two populations differing by 95% using
In contrast to the 75% (Level 3) test, 50%/95% differentiation (Level 2) measures a mathematically relevant point of differentiation, when the mean of one population moves outside the normal distribution of the other. It also signifies the point at which a population has moved half way towards diagnosability. The number of pairwise comparisons meeting the Level 2 test (but not falling in other buckets) was material but not enormous. Only 30.6% for voice and 20.6% for biometrics of outcomes passed this test at all (Tables
In addition to the diagnosis formula for Level 5,
Traditional interpretations of effect sizes may be appropriately used in other fields but are inappropriate for taxonomic study. It should be borne in mind that the traditional subjective descriptors for effect sizes starting at 0.2 have been developed largely in the fields of social and behavioral science (Cohen 1998,
Overall, this study suggests that: (i)
Solution A:
1 point: Level 1 statistical significance only.
2 points: Level 1 plus Level 2 50% diagnosability.
3 points: Level 1 plus Level 5 full diagnosability (3 points).
4 points: Level 1 plus a new measure of a “species and a half” worth of diagnosability (equivalent to 6 controlled effect sizes).
Solution B: would use more proportionate scoring, eliminate the weighting for statistical significance and allow only three scores:
1.5 points: Level 1 plus Level 2 (2 controlled effect sizes).
3 points: Level 1 plus Level 5 (4 controlled effect sizes).
4.5 points: Level 1 plus 6 controlled effect sizes.
Solution C: would abandon these various cutoffs and instead use controlled unpooled effect sizes, calibrated by a scale factor such that no difference = 0 and full diagnosability = 3 and capped at a score of 4.
As has been argued elsewhere (e.g.,
For the reasons above, the Level 4 nonoverlap test should be abandoned from this framework in order to positively score the 2.5% of vocal outcomes which were diagnosable to 95% but actually overlapped due to very large sample sizes (Table
A disadvantage of the
Logarithmic plot of the same data as in Figure
Graph showing relationship between sample size (xaxis) and numbers of effective SD differences between means or effect sizes (yaxis) required in order to pass a test of diagnosability shown in the legend. Dashed lines represent the four boundaries for affording scores under
Graph illustrating the “hard cutoff” approaches of
One could adapt the
∑[n=1 IF (min s_{1}>max s_{2} OR min s_{2}>max s_{1}) AND ((x̄_{1}–x̄_{2}) > s_{1}(t_{1 @ 97.5%}) + s_{2}(t_{2 @ 97.5%}))]
≤
∑[n=1 IF (min s_{3}>max s_{4} OR min s_{3}>max s_{4}) AND ((x̄_{3}–x̄_{4}) > s_{3}(t_{3 @ 97.5%}) + s_{4}(t_{4 @ 97.5%}))].
Such a hardedged statistical framework would go beyond the recommendations of
The
There is a relatively simple solution to this shortcoming: with continuous data, to move away from models which attribute cutoffs and instead to apply precise scoring under a system which only uses a hard cutoff at the very final point of determining species or subspecies rank. Such an approach was effectively attempted in a recent study (e.g.,
Immediately prior to going to press,
In this and the next section, a new, universal measure of differentiation is developed. It is potentially usable in any taxonomic group where continuous variables are studied and in other contexts to measure effect sizes.
Step 1: identify a comparison group.
For an assessment of the rank of allopatric populations, this method compares: (i) two sympatric and closely related populations which are demonstrably good species and broadly accepted as such (Species 1 and Species 2) as well as (ii) two allopatric populations under study (Population 3 and Population 4). Ideally, Species 1 and Species 2 should also be sister taxa or be known or suspected to be very closely related through molecular studies, such that they represent a good benchmark. However, this may not always be known for certain. Preferably, Species 1 and 2 and Populations 3 and 4 should all be congeneric, but this might not be possible and they might be merely a good example from the same family or order, depending on how speciose the relevant higherlevel taxonomy is. Either (but not both) of Population 3 or Population 4 might be the same as Species 1 or Species 2 or they may all be different populations.
Step 2: collect data for relevant variables using continuous measurements.
It is critical to ensure a fair identification of variables, which adequately and honestly document the maximum possible observed variations between all populations (i.e., not just the allopatric pair, but also the sympatric pair). Variables differentiating sympatric Species 1 and 2 should not be overlooked, even if more time is spent studying allopatric Populations 3 and 4. Returning to the theme of taxonomic significance and not simply statistical significance, it is important that the variables under study are likely to be taxonomically relevant. Field experience or knowledge of the organisms concerned is important to avoid splits or lumps being published based on statistical tests applied to inappropriately selected variables.
Unlike in multivariate statistics, the technique presented here will not require each data set to have the same measures from the same individuals. This means that a biometric data set based on museum specimens and a vocal data set based on a different set of individuals and with different sampling can be combined, so data from all possible sources can be collated and combined. The broadest possible geographical and numerical sampling is important (e.g.,
Step 3: undertake pairwise comparisons using controlled unpooled effect sizes.
The following formula should be applied to measure controlled unpooled effect sizes on a pairwise basis, separately for each population/variable combination under study, e.g., for Species 1 and Species 2:
(x̄_{1}–x̄_{2}) / ¼[s_{1}(t_{1 @ 97.5%}) + s_{2}(t_{2 @ 97.5%})]
Step 4: exclude all the statistically insignificant data.
Comparisons showing no statistical significance should be eliminated and scored as 0. This process needs conducting separately for each population/variable combination under study: a variable might be scored as zero as between Species 1 and Species 2, but may be scored positively as between Population 3 and Population 4. Bonferroni correction is applied here, in order to keep the formula simple and due to the nearnil impact of using less conservative “type 1” error corrections. It is recommended that different sets or “families” of data (biometric, vocal, colorimetric) are treated separately for purposes of determining the appropriate Bonferroni correction. Other more complex “type 1 error” corrections such as DunnŠidák should be considered for situations where very large numbers of variables are compared. The exclusion of statistically insignificant data results in the following modification to the effect size formula above, e.g., for Species 1 and Species 2:
p<0.05/n_{v} → (x̄_{1}–x̄_{2}) / ¼[s_{1}(t_{1 @ 97.5%}) + s_{2}(t_{2 @ 97.5%})]
Step 5: add up all the results of the above calculations (using a Euclidian approach).
It would be simple then to add up all the effect sizes, as follows, and see whether Species 1 vs Species 2 or Population 3 vs Population 4 had the better score. This would apply the formula:
∑ [p<0.05/n_{v} → (x̄_{1}–x̄_{2}) / ¼[s_{1}(t_{1 @ 97.5%}) + s_{2}(t_{2 @ 97.5%})]]
≤
∑ [p<0.05/n_{v} → (x̄_{1}–x̄_{2}) / ¼[s_{3}(t_{3 @ 97.5%}) + s_{4}(t_{4 @ 97.5%})]]
However, this would be suboptimal statistically. Applying such a formula would reflect the underlying conceptual approach of existing systems to rank allopatric populations (including
(a_{1}, a_{2}, a_{3}, a_{4} …. a_{n}) and (b_{1}, b_{2}, b_{3}, b_{4} … b_{n}),
then Pythagorian principles result in the following calculation of distance between points a and b in multidimensional space:
√[(a_{1}–b_{1})^{2} + (a_{2}–b_{2})^{2} + (a_{3}–b_{3})^{2} … + (a_{n}–b_{n})^{2}]
And this can be simplified to:
√(∑ (a_{n}–b_{n})^{2})
This approach cannot perfectly be applied to a series of effect size measures based on multiple pairwise comparisons, in that such data are not necessarily linked to one another as a set of corresponding coordinates. However, assuming that the variables studied are independent, it is valid to measure distance this way. Independence of variables can be verified through correlation tests and promoted in variable selection by seeking to capture the maximum possible observed variation efficiently.
Each controlled unpooled effect size (that has not been eliminated to zero using the statistical significance filter) can be considered to represent the equivalent of a distance a_{n}b_{n}. The distance in multidimensional space between the two populations is better approximated than through simple addition by taking the square of each controlled unpooled effect size (which has not been excluded due to nonsignificance), adding those up, and then calculating the square root of the sum of all of them.
In studies using continuous variables, allopatric populations should be ranked as species if they show equal to or greater variation than that shown between closely related sympatric species (
Viz, an allopatric population will be a candidate for species rank if:
√(∑ [p<0.05/n_{v} → (x̄_{1}–x̄_{2}) / ¼[s_{1}(t_{1 @ 97.5%}) + s_{2}(t_{2 @ 97.5%})]] ^{2})
≤
√(∑ [p<0.05/n_{v} → (x̄_{3}–x̄_{4}) / ¼[s_{3}(t_{3 @ 97.5%}) + s_{4}(t_{4 @ 97.5%})]]^{2})
Where:
Species 1 and Species 2 are two sympatric species that are closely related to one another (preferably known to be sisters) and which are related to Population 3 and Population 4.
Population 3 and Population 4 are two allopatric populations whose rank is being determined.
p: the probability using Welch’s unequal variance ttest (or other similar technique for nonnormally distributed data), as set out under Level 1 in Methods that the means of the populations differ.
n_{v}: the number of continuous variables of a particular “family” considered in the study, so as to apply a Bonferroni correction.
1, 2, 3, and 4 refer to relevant data for Species 1, Species 2, Population 3 and Population 4 respectively.
x̄_{1}, x̄_{2}, x̄_{3}, and x̄_{4} are the sample means of a relevant data set for a particular variable for Species 1, Species 2, Population 3 and Population 4, respectively.
s_{1,}s_{2,}s_{3,} and s_{4} are the standard deviations for a relevant data set for a particular variable for Species 1, Species 2, Population 3 and Population 4, respectively.
t refers to the tvalue (based on tdistribution) using a onesided confidence interval at the percentage specified for the relevant population and variable, with t_{1,}t_{2,}t_{3} and t_{4,} referring to such value for Species 1, Species 2, Population 3 and Population 4 respectively.
Because this formula is not simple to calculate, a spreadsheet is being published alongside this paper on the author’s researchgate.net site, which facilitates rapid calculations.
In some ways, when one looks carefully at the outcomes of different statistical tests undertaken here, the formula is a statement of the obvious. The statistics underlying it are basic. It merely relies upon the good aspects of longestablished statistical methods for comparing continuous variables of previous authors (e.g.,
As regards
Some important recommendations should be borne in mind when using this method, in addition to those set out above under “Steps”:
(i) Continuous versus noncontinuous variables: Some issues arose in the case studies here due to data gaps. In the studies of Sirystes and Basileuterus, nonhomologous vocalizations were not compared with one another. There, populations with different measures for the same sorts of variables are recovered as more differentiated than those populations whose variables cannot validly be compared at all. Diagnosability based on the comparison of nonhomologous vocalizations can be important, but it can also have pitfalls (notably, Chaves et al. 2000 claimed discrete variation in calls to claim sufficient differentiation under the
(ii) Sample sizes: If either Species 1 or Species 2 have very low sample sizes, then: (i) for many variables under study, data may not meet the threshold test of statistical significance; and (ii) those which do could be affected by low standard deviations and inflated effect sizes caused by clustering. These issues apply in reverse where Population 3 or Population 4 suffers from such constraints. Although the test above is in principle sample sizeneutral, caution should be exercised in interpreting results based on smaller samples (see Myrmeciza biometrics discussion below and Table
(iii) Scale factoring and manipulation through overloading: Where there are 15 vocal measures and 5 biometric measures, it should be considered whether to weight scoring on a 50:50 basis, following
(iv) Not going over the top: The formula presented here is proposed for usage in more difficult, borderline, or complicated cases. Where simpler studies can show allopatric populations or newly discovered populations to be very different indeed from one another, then there should be no need for a litany of statistical analyses to be undertaken. It should be appropriate in some cases simply for an author to publish photographs of specimens or sonograms or a brief subjective text to describe the differences observed. A good example of a situation in this category would be the allopatric Western and Eastern Woodhaunters Automolus virgatus and A. subulatus, whose vocalizations resemble one another not one iota (
(v) Possible usage of controlled pooled effect size. The formula above does not use pooled standard deviations and so makes no assumptions about the comparability of the variances of the different populations under study. As discussed above, there may be use cases for controlled pooled effect size, especially as a hedge for small sample sizes, but this should be applied only with caution. In any cases where assumptions of equal SD may be made among all four populations 1, 2, 3 and 4, then a more complicated formula using controlled pooled standard deviations might be used instead (see Materials and methods for details of equations that may be substituted in).
Myrmeciza was chosen here as an example because the recommendations of the relevant paper (
Table
Workedthrough example of the new formula proposed herein, assessing the rank of two allopatric Myrmeciza antbirds (M. immaculata vs M. zeledoni) by comparison to a pair of sympatric sister taxa in the same genus (M. goeldii vs. M. melanoceps), using vocal data only.
Variable  M. goeldii vs. M. melanoceps  M. immaculata vs. M. zeledoni  

Controlled unpooled effect size  p value  Score  Controlled unpooled effect size  p value  Score  
Male song  
No. of notes  1.41  5.2 × 10^{29}  1.41  4.54  6.92 × 10^{33}  4.54 
Song length  0.37  0.0015  0.37  0.71  0.0022  0 
Song speed  2.05  3.9 × 10^{49}  2.05  7.65  1.03 × 10^{89}  7.65 
Max. acoustic frequency second note  1.03  2.7 × 10^{16}  1.03  3.30  7.20 × 10^{18}  3.30 
Max. acoustic frequency of last note  1.31  3.9 × 10^{23}  1.31  2.88  2.03 × 10^{15}  2.88 
Change in acoustic frequency  0.52  3.7 × 10^{5}  0.52  1.36  2.62 × 10^{8}  1.36 
Position of peak of frequency  4.75  4.5 × 10^{74}  4.75  0.08  0.76  0 
Position of trough in frequency  3.83  9.5 × 10^{55}  3.83  0.03  0.91  0 
Single note call  
Call length  0.91  0.00270  0  0.99  0.040  0 
Maximum acoustic frequency  0.99  0.00103  0.99  0.37  0.16  0 
Multinote call  
No. of notes  0.11  0.770  0  0.01  0.99  0 
Song length  0.06  0.880  0  0.22  0.57  0 
Song speed  0.21  0.494  0  0.21  0.56  0 
Max. acoustic frequency  0.31  0.371  0  1.71  0.00049  1.71 
Min. acoustic frequency  0.19  0.529  0  2.29  0.00013  2.29 
Change in acoustic frequency  0.24  0.447  0  0.45  0.25  0 
Position of peak of frequency  0.49  0.396  0  0.65  0.12  0 
Position of trough in frequency  0.02  0.946  0  0.12  0.74  0 
Female song  
No. of notes  0.78  0.0073  0  5.52  2.41 × 10^{15}  5.52 
Song length  0.60  0.034  0  1.00  0.023  0 
Song speed  1.18  7.07 × 10^{5}  1.18  7.00  1.50 × 10^{19}  7.00 
Max. acoustic frequency second note  0.61  0.031  0  1.62  0.0034  0 
Max. acoustic frequency of last note  1.14  0.00020  1.14  2.61  3.86 × 10^{7}  2.61 
Change in acoustic frequency  0.23  0.40  0  0.71  0.18  0 
Position of peak of frequency  0.04  0.89  0  0.96  0.11  0 
Position of trough in frequency  0.27  0.33  0  0.47  0.41  0 
Euclidian distance (square root of sum of the squares)  7.09 (7.14 using data to more s.f.)  13.95 (13.75 using data to more s.f.) 
Table
Full scores across the Myrmeciza data set for vocal data only. Bold denotes a sympatric pair of sister taxa. Bold italics denote other sympatric pairs. Denote pairs (with asterisk) which are subspecies based on overall scoring and discrete characters (see also Table
M. i. concepcion  M. z. macrorhyncha  M. z. zeledoni  M. fortis  M. goeldii  M. melanoceps  

M. i. immaculata  3.56*  12.22  11.23  28.00  18.15  14.60 
M. i. concepcion  13.75  15.64  19.51  21.49  15.50  
M. z. macrorhyncha  4.08*  23.74  21.13  17.85  
M. z. zeledoni  28.45  28.15  24.39  
M. fortis  22.81  12.73  
M. goeldii  7.14 
Biometric scores are shown in Table
Scores across the Myrmeciza data set for biometrics. All goeldii scores (sample size n=2 specimens) were actually zero. Square bracketed figures showing alongside M. goeldii are based on controlled effect sizes without deleting insignificant data and are presented for reference only Bold denotes a sympatric pair of sister taxa. Bold italics denote other sympatric pairs. Denote subspecies (with asterisk) based on overall scoring (see also Table
Taxon  M. i. concepcion  M. z. macrorhyncha  M. z. zeledoni  M. fortis  M. goeldii  M. melanoceps 

M. i. immaculata  0*  3.02  1.72  3.35  [5.36]  5.96 
M. i. concepcion  2.49  0  3.01  [5.13]  5.72  
M. z. macrorhyncha  1.64*  2.22  [3.01]  5.07  
M. z. zeledoni  2.54  [4.59]  5.35  
M. fortis  [3.47]  3.58  
M. goeldii  [3.52] 
As above, although a universal formula is proposed here, no universal score is proposed here for ranking species, since the differentiation required to rank a species is likely to vary depending on the number of variables studied and by taxonomic group (
There are however some parameters and examples available from the case studies (Table
Scores of examples from the data set which are both (i) sister species (or relevant sympatric subspecies of sister species) as shown by molecular studies; and (ii) sympatric, to show ranges of scores. Also presented are examples of scores passing other authors’ species tests. Note the
Sympatric pair or proposed score for species rank  Type of data  Score 

Myrmeciza goeldii vs Myrmeciza melanoceps  Voice  7.14 
Scytalopus griseicollis griseicollis vs. Scytalopus spillmanni undescribed East Andes population  Voice + biometrics = total  9.16 + 0 = 9.16 
Scytalopus griseicollis gilesi vs. Scytalopus spillmanni undescribed East Andes population  Voice + biometrics = total  10.59 + 0 = 10.59 
Scytalopus griseicollis morenoi vs. Scytalopus spillmanni undescribed East Andes population  Voice + biometrics = total  8.79 + 0 = 8.79 
Grallaricula ferrugineipectus Venezuela vs G. nana nanitaea Merida Andes  Voice + biometrics = total  7.90 + 8.01 = 15.91 
Average ± s.d. (min.–max.) (n=sample number)  10.32 ± 3.36 (7.14–15.91) (n=5)  
Basis for 
Voice: diagnosability of three characters = 3 × 4 SD  6.92 
A basis for 
Voice or biometrics: 1 × 10 SD (score 4), 1 × 5 SD (score 3)  11.18 
A basis for 
Voice and biometrics: 1 × 10 SD (score 4), 1 × 2 SD (score 2), 1 x 0.2 SD (score 1)  10.20 
A basis for 
Voice and biometrics: 1 × 10 SD (score 4), 3 × 0.2 SD (score 1 each)  10.01 
A basis for 
Voice and biometrics: 2 × 5 SD (score 3 each), 1 × 0.2 SD (score 1)  7.07 
A basis for 
Voice and biometrics: 1 × 5 SD (score 3), 2 × 2 SD (score 2 each)  5.74 
A basis for 
Voice and biometrics: 1 × 5 SD (score 3), 1 × 2 SD (score 2), 2 × 0.2 SD (score 1 each)  5.39 
A basis for 
Voice and biometrics: 3 × 2 SD (2 each), 1 × 0.2 SD (1 each)  3.47 
As regards subspecies or PSC species, any score of 4 or more (i.e., allowing full diagnosis in multidimensional space) would be a supportable benchmark. There may however be cases of valid subspecies which achieve lower scores than this, such as in the Adelomyia melanogenys study where the pair discussed above scored only 2.10 for voice and 0 for biometrics, based on a fairly exhaustive attempt at measuring biometric and vocal variables. However, this is probably an exceptionally lowscoring example.
Among the unnamed populations in the study group, only the “Apurímac south” population of Basileuterus tristriatus in Peru was recovered as diagnosable versus all proximate subspecies (6.33 versus Marañon to Apurímac population, 14.79 versus Bolivia) and therefore requires formal description. Other notable unnamed populations include the Tamá population of Grallaricula nana (scores 3.32 versus Mérida) and the West Andes populations of the same species (scores 2.86 against Central Andes). The two new Grallaricula taxa described in
Probably the most difficult taxonomic decision in this series of papers was that of how to rank Scytalopus rodriguezi, whose allopatric subspecies scored 5.40 for biometrics and 5.01 for voice, total 10.41 and so was more diagnosable than some sympatric tapaculos. A large component of this score (compared to sympatric pairs) was in biometrics and the two populations were found to respond to one another’s playback. In borderline cases such as this, where different kinds of variables differ between the sympatric pair and allopatric pair, then scale factoring may be appropriate. Voice is a very important character for tapaculos and in this case, the vocal score, whilst showing full diagnosis, did not attain the differentiation shown between known sympatric comparators.
The
The method proposed here involves no universal score for species rank. However, it would still be interesting to see how other pairwise situations involving sympatric sister species measure up under this system, and then possibly to revisit the philosophy underlying
Thanks to Michael Patten, Mort Isler, and George Sangster for helpful comments on this paper. The inclusion of a Bonferroni correction in the Level 1 test was suggested by F. Gary Stiles (in litt. 2007). Michael Patten made various insightful comments which resulted in addition of all the analyses and methods involving pooled standard deviation data and the various additional “type 1” corrections. Some of the methods underlying this paper were developed in papers published together with Jorge Avendaño, Paul Salaman, and others. Thanks as always to Blanca and Lucas for their love, good cheer, and encouragement.