Compare means across three or more groups. Results include F-statistic, p-value, eta-squared, and Bonferroni post-hoc tests in APA format.
ANOVA, which stands for Analysis of Variance, is a foundational statistical method used to compare means across three or more independent groups and determine whether any of those group means are statistically different from one another. While a t-test is limited to comparing two groups at a time, ANOVA generalizes this comparison to any number of groups in a single, unified test—critically controlling the Type I error rate that would inflate if you ran multiple pairwise t-tests instead.
The technique was pioneered by Sir Ronald A. Fisher in the 1920s while working at the Rothamsted Experimental Station in England. Fisher developed ANOVA to analyze agricultural experiments involving multiple treatments applied to crop yields. His 1925 book, Statistical Methods for Research Workers, introduced the F-distribution and the F-test—named in his honor—which remain the mathematical backbone of every ANOVA today. Over the following century, ANOVA became the workhorse of experimental research in psychology, medicine, education, biology, marketing, and virtually every empirical discipline.
At its core, ANOVA works by partitioning the total variability in your data into two components: between-group variance (the variability due to differences among group means) and within-group variance (the variability due to individual differences within each group, also called error or residual variance). The ratio of these two variance estimates produces the F-statistic. When the between-group variance is substantially larger than the within-group variance, the F-value will be large, and the corresponding p-value will be small—indicating that at least one group mean differs significantly from the others.
A one-way ANOVA (also called a single-factor ANOVA) tests whether the means of three or more independent groups differ when there is only one independent variable (factor). For example, a clinical researcher might compare pain-relief scores across three drug treatments, or an educator might compare exam performance across four different teaching methods. The "one-way" label indicates that only one grouping variable is being examined. If you have two or more factors (e.g., drug type and dosage), you would need a two-way or factorial ANOVA, which is beyond the scope of this calculator.
The one-way ANOVA produces a single F-statistic with two degrees of freedom: dfbetween (the number of groups minus one) and dfwithin (the total sample size minus the number of groups). A significant F-value tells you that at least one group mean is different, but it does not tell you which groups differ from each other. That is the job of post-hoc tests.
When the omnibus ANOVA F-test is statistically significant, you know that the group means are not all equal—but you need post-hoc (Latin for "after this") comparisons to pinpoint exactly which pairs of groups differ. This calculator uses the Bonferroni correction, one of the most widely used and conservative post-hoc methods. The Bonferroni procedure divides the desired alpha level (typically .05) by the number of pairwise comparisons, ensuring that the overall familywise error rate stays below .05 even when multiple comparisons are made. For three groups there are three pairwise comparisons, so each comparison is evaluated at α = .05 / 3 = .0167. This conservatism protects against false positives, though it can be slightly less powerful than alternatives like Tukey's HSD when you have many groups.
A pharmaceutical researcher wants to compare the effectiveness of two active drugs against a placebo on pain reduction (measured on a 0–100 visual analog scale). Thirty patients are randomly assigned to one of three groups (n = 10 per group).
Drug A (n = 10)
72, 68, 75, 71, 69, 74, 70, 73, 67, 71
M = 71.00, SD = 2.58
Drug B (n = 10)
65, 60, 63, 62, 67, 64, 61, 66, 63, 59
M = 63.00, SD = 2.62
Placebo (n = 10)
55, 58, 52, 57, 54, 59, 53, 56, 51, 55
M = 55.00, SD = 2.62
ANOVA Summary Table
| Source | SS | df | MS | F | p |
|---|---|---|---|---|---|
| Between Groups | 1280.00 | 2 | 640.00 | 93.18 | < .001 |
| Within Groups | 185.40 | 27 | 6.87 |
Results
F(2, 27) = 93.18, p < .001, η² = .87
The effect size (η² = .87) is very large, indicating that approximately 87% of the total variance in pain scores is accounted for by group membership.
Bonferroni Post-Hoc Comparisons
| Comparison | Mean Diff | p (adjusted) | Significant? |
|---|---|---|---|
| Drug A vs. Drug B | 8.00 | < .001 | Yes |
| Drug A vs. Placebo | 16.00 | < .001 | Yes |
| Drug B vs. Placebo | 8.00 | < .001 | Yes |
All three pairwise comparisons were statistically significant after Bonferroni correction. Drug A produced the highest pain reduction, followed by Drug B, with the Placebo group reporting the least improvement.
Choosing the right statistical test depends on the number of groups, the nature of your data, and whether your measurements are independent or repeated. The table below summarizes the most common scenarios and the recommended test for each.
| Situation | Groups | Recommended Test | Notes |
|---|---|---|---|
| Comparing 2 independent group means | 2 | Independent samples t-test | Welch's t-test recommended as default |
| Comparing 3+ independent group means | 3+ | One-way ANOVA | Follow up with post-hoc tests if significant |
| Non-normal data, 3+ independent groups | 3+ | Kruskal-Wallis H test | Non-parametric alternative to one-way ANOVA |
| Same subjects measured across 3+ conditions | 3+ | Repeated Measures ANOVA | Accounts for within-subject correlation |
| Non-normal repeated measures, 3+ conditions | 3+ | Friedman test | Non-parametric alternative to RM-ANOVA |
| Two or more factors simultaneously | Varies | Two-way / Factorial ANOVA | Tests main effects and interactions |
A common mistake is to run multiple t-tests instead of ANOVA when you have three or more groups. With three groups, you would need three pairwise t-tests, each at α = .05. The probability of at least one false positive rises to approximately 1 − (1 − .05)3 = .14, nearly three times the intended error rate. ANOVA avoids this problem by testing all groups in a single omnibus test.
Before interpreting your ANOVA results, you should verify that the following four assumptions are reasonably met. Violating these assumptions can lead to inaccurate p-values and unreliable conclusions.
1. Independence of Observations
Each observation must be independent of every other observation. This means one participant's score should not influence another's. Independence is guaranteed by proper experimental design—random assignment to groups and no clustering or nesting of participants. Violations are common in classroom studies (students in the same class are not independent) and longitudinal designs. If observations are not independent, consider a mixed-effects model or repeated measures ANOVA instead.
2. Normality
The dependent variable should be approximately normally distributed within each group. You can assess normality visually using histograms or Q-Q plots, or formally using the Shapiro-Wilk test. However, ANOVA is remarkably robust to violations of normality when sample sizes are moderate to large (roughly n ≥ 20 per group) thanks to the Central Limit Theorem. For severely skewed data with small samples, use the Kruskal-Wallis H test as a non-parametric alternative.
3. Homogeneity of Variance (Homoscedasticity)
The variance of the dependent variable should be approximately equal across all groups. This assumption is tested using Levene's test: a non-significant Levene's test (p > .05) suggests that variances are sufficiently equal. As a rule of thumb, ANOVA is robust to unequal variances when group sizes are equal. When group sizes are unequal and Levene's test is significant, consider using Welch's ANOVA (which does not assume equal variances) or the Brown-Forsythe test as alternatives.
4. Interval or Ratio Scale of Measurement
The dependent variable must be measured on a continuous scale (interval or ratio). ANOVA relies on computing means and variances, which are only meaningful for continuous data. If your dependent variable is ordinal (e.g., rankings or Likert-scale items), use the Kruskal-Wallis test. If your outcome is categorical (e.g., pass/fail), use the chi-square test instead.
While the p-value tells you whether the group differences are statistically significant, eta-squared (η²) tells you how large those differences are in practical terms. Eta-squared represents the proportion of total variance in the dependent variable that is explained by group membership. It is calculated as η² = SSbetween / SStotal. An η² of .14, for example, means that 14% of the variability in scores is attributable to the grouping variable.
Reporting effect sizes is essential because with large enough samples, even trivially small differences can yield significant p-values. Cohen (1988) provided the following widely used benchmarks for interpreting η²:
| η² Value | Interpretation | Practical Meaning |
|---|---|---|
| 0.01 | Small | ~1% of variance explained; groups differ only slightly |
| 0.06 | Medium | ~6% of variance explained; a meaningful, noticeable difference |
| 0.14 | Large | ~14%+ of variance explained; a substantial, important difference |
Note: Some researchers prefer partial eta-squared (ηp²) or omega-squared (ω²) as less biased alternatives, especially for complex factorial designs. For one-way ANOVA with a single factor, eta-squared and partial eta-squared are identical. Omega-squared provides a slightly more conservative estimate and is preferred by some journals.
According to APA 7th edition guidelines, ANOVA results should include the F-statistic, both degrees of freedom, the p-value, and an effect size measure. Descriptive statistics (means and standard deviations) for each group should also be reported. Here are templates with worked examples:
Omnibus F-Test (One-Way ANOVA)
A one-way ANOVA revealed a statistically significant difference in pain-reduction scores across the three treatment conditions, F(2, 27) = 93.18, p < .001, η² = .87. Drug A (M = 71.00, SD = 2.58) produced significantly higher scores than Drug B (M = 63.00, SD = 2.62) and Placebo (M = 55.00, SD = 2.62).
Post-Hoc Comparisons (Bonferroni)
Bonferroni-corrected post-hoc comparisons indicated that Drug A (M = 71.00, SD = 2.58) produced significantly greater pain reduction than Drug B (M = 63.00, SD = 2.62), p < .001, mean difference = 8.00, 95% CI [5.26, 10.74], and significantly greater pain reduction than Placebo (M = 55.00, SD = 2.62), p < .001, mean difference = 16.00, 95% CI [13.26, 18.74]. Drug B also produced significantly higher scores than Placebo, p < .001, mean difference = 8.00, 95% CI [5.26, 10.74].
Note: Report F-values to two decimal places. Report p-values to three decimal places, except use p < .001 when the value is below .001. Always italicize statistical symbols (F, p, M, SD, η²). Report exact p-values (e.g., p = .034) rather than inequalities (e.g., p < .05) whenever possible, except for values below .001.
StatMate's one-way ANOVA calculations have been validated against R's aov() and summary() functions as well as SPSS GLM output. We use the jstat library for the F-distribution and compute Bonferroni-corrected pairwise comparisons using pooled within-group variance. All F-statistics, p-values, eta-squared values, and post-hoc results match R and SPSS output to at least 4 decimal places. Degrees of freedom are computed using standard formulas: dfbetween = k − 1 and dfwithin = N − k, where k is the number of groups and N is the total sample size.
T-Test
Compare means between two groups
Chi-Square
Test categorical associations
Correlation
Measure relationship strength
Descriptive
Summarize your data
Sample Size
Power analysis & sample planning
One-Sample T
Test against a known value
Mann-Whitney U
Non-parametric group comparison
Wilcoxon
Non-parametric paired test
Regression
Model X-Y relationships
Multiple Regression
Multiple predictors
Cronbach's Alpha
Scale reliability
Logistic Regression
Binary outcome prediction
Factor Analysis
Explore latent factor structure
Kruskal-Wallis
Non-parametric 3+ group comparison
Repeated Measures
Within-subjects ANOVA
Two-Way ANOVA
Factorial design analysis
Friedman Test
Non-parametric repeated measures
Fisher's Exact
Exact test for 2×2 tables
McNemar Test
Paired nominal data test
Paste from Excel/Sheets or drop a CSV file
Paste from Excel/Sheets or drop a CSV file
Paste from Excel/Sheets or drop a CSV file
Enter your data and click Calculate
or click "Load Example" to try it out