Why the Kruskal-Wallis Test Matters
The Kruskal-Wallis H test is one of the most widely used nonparametric procedures in the social and health sciences. It serves as the rank-based alternative to the one-way ANOVA, allowing researchers to compare the distributions of a continuous or ordinal dependent variable across three or more independent groups — without requiring the assumption of normality.
This matters because real-world data frequently violate the assumptions that parametric tests demand. Pain ratings on a 0-10 scale are often skewed. Likert-type survey responses are ordinal by nature. Clinical outcome measures in small-sample pilot studies rarely produce textbook-normal distributions. In all these situations, the Kruskal-Wallis test provides a valid and robust inferential framework.
Despite its popularity, many researchers struggle with the reporting side. APA 7th edition has clear expectations for how to report the Kruskal-Wallis test: the H statistic with degrees of freedom, exact p values, an appropriate effect size, and — when the omnibus test is significant — post-hoc pairwise comparisons with correction for multiple testing. Omitting any of these elements is a common reason for reviewer criticism and revision requests.
This guide walks through every component of a complete Kruskal-Wallis report, from justification to post-hoc analysis, with copy-paste APA templates you can adapt to your own data.
When to Use Kruskal-Wallis vs One-Way ANOVA
Choosing between the Kruskal-Wallis test and the one-way ANOVA is not a matter of preference — it depends on whether your data satisfy the parametric assumptions. Understanding when each test is appropriate protects both the validity and the statistical power of your analysis.
Non-Normal Distributions Across Three or More Groups
The one-way ANOVA assumes that the dependent variable is approximately normally distributed within each group. When a Shapiro-Wilk test yields p < .05 in one or more groups, or when Q-Q plots reveal substantial departures from normality, the Kruskal-Wallis test is the appropriate alternative. This is especially important with small samples (n < 20 per group), where the central limit theorem provides less protection against non-normality.
Ordinal Dependent Variable
If your outcome variable is measured on an ordinal scale — individual Likert items, pain severity ratings, satisfaction categories — the Kruskal-Wallis test is the correct choice regardless of the distribution shape. Ordinal data lack the equal-interval property that the ANOVA's mean-based calculations require. The Kruskal-Wallis test operates on ranks, which respects the ordinal nature of the data.
Unequal Variances and Outliers
Even when distributions are roughly normal, severely unequal variances across groups (Levene's test p < .05) can undermine the ANOVA's validity. Similarly, extreme outliers distort group means and inflate error terms. The Kruskal-Wallis test is resistant to both problems because it converts raw scores to ranks, which compresses the influence of extreme values.
Decision Flowchart
Use this decision sequence to select the appropriate test:
- Three or more independent groups? If no, use a Mann-Whitney U test (two groups) or Wilcoxon signed-rank test (paired data).
- Ordinal dependent variable? If yes, use Kruskal-Wallis.
- Normality assumption met in all groups? Test with Shapiro-Wilk. If violated in any group, use Kruskal-Wallis.
- Homogeneity of variances satisfied? Test with Levene's test. If violated, use Kruskal-Wallis (or Welch's ANOVA as a parametric alternative).
- Severe outliers present? If yes, use Kruskal-Wallis.
- All assumptions met? Use one-way ANOVA for greater statistical power.
When all ANOVA assumptions are satisfied, the Kruskal-Wallis test has approximately 95% of the power of the parametric test. This means that defaulting to Kruskal-Wallis as a "safe" choice sacrifices about 5% of your ability to detect real differences. Always check assumptions first and select the test that matches your data.
The Basic APA Format for Kruskal-Wallis
The standard APA 7th edition template for reporting a Kruskal-Wallis result is:
H(df) = X.XX, p = .XXX, ε^2^ = .XX
Here is what each component represents:
- H: the Kruskal-Wallis test statistic, which follows an approximate chi-square distribution
- df: degrees of freedom, equal to the number of groups minus one (k - 1)
- p: the exact p value reported to three decimal places; use p < .001 when the value falls below .001
- ε^2^: epsilon squared, the most common effect size for the Kruskal-Wallis test
Formatting rules to remember:
- Italicize H, p, and the effect size symbol
- Place degrees of freedom in parentheses directly after H with no space: H(2), not H (2)
- Never write p = .000; use p < .001 instead
- Always include an effect size — a p value alone is insufficient under APA 7th edition guidelines
Reporting Kruskal-Wallis: Step by Step
Research Scenario
Imagine a clinical study comparing pain ratings (0-10 numerical rating scale) across three treatment groups: placebo (n = 20), Drug A (n = 20), and Drug B (n = 20). The researcher chose the Kruskal-Wallis test because pain ratings violated the normality assumption in two of the three groups (Shapiro-Wilk p = .008 and p = .021, respectively).
Step 1: Report Descriptive Statistics with Medians and IQR
For nonparametric tests, the primary descriptive statistics are medians (Mdn) and interquartile ranges (IQR) — not means and standard deviations. Means assume a symmetric distribution, which is the very assumption you have already acknowledged is violated.
| Group | n | Mdn | IQR | |-------|-----|-------|-----| | Placebo | 20 | 7.00 | 5.25-8.00 | | Drug A | 20 | 5.00 | 3.00-6.75 | | Drug B | 20 | 3.50 | 2.00-5.00 |
Median pain ratings were highest in the placebo group (Mdn = 7.00, IQR = 5.25-8.00), followed by Drug A (Mdn = 5.00, IQR = 3.00-6.75) and Drug B (Mdn = 3.50, IQR = 2.00-5.00).
Step 2: Report the Omnibus H Test (Significant Result)
A Kruskal-Wallis H test was conducted to compare pain ratings across the three treatment groups. The nonparametric test was selected because pain ratings violated the normality assumption in the placebo and Drug B groups (Shapiro-Wilk p = .008 and p = .021, respectively). The test revealed a statistically significant difference in pain ratings across groups, H(2) = 18.42, p < .001, ε^2^ = .31.
Breaking down each component:
- H(2) = 18.42: The test statistic is 18.42 with 2 degrees of freedom (3 groups - 1).
- p < .001: The result is highly significant. The exact p was below .001, so we use the floor convention.
- ε^2^ = .31: A large effect size, indicating that 31% of the variance in ranks is explained by group membership.
Step 3: Report a Non-Significant Result
When the Kruskal-Wallis test is not significant, report it with the same level of detail:
A Kruskal-Wallis H test indicated no statistically significant difference in anxiety scores across the three treatment groups, H(2) = 3.17, p = .205, ε^2^ = .05. Median anxiety scores were similar for the placebo (Mdn = 5.00, IQR = 3.00-7.00), Drug A (Mdn = 4.50, IQR = 3.00-6.00), and Drug B (Mdn = 4.00, IQR = 3.00-6.50) groups.
Note: even with a non-significant result, report the effect size and descriptive statistics. Do not conduct post-hoc tests when the omnibus test is non-significant.
Effect Size: Epsilon Squared
APA 7th edition explicitly recommends reporting effect sizes for all statistical tests. For the Kruskal-Wallis test, epsilon squared (ε^2^) is the most widely used measure. It estimates the proportion of variance in ranks explained by the grouping variable.
Calculation
The formula for epsilon squared is:
ε^2^ = H / (N - 1)
where H is the Kruskal-Wallis test statistic and N is the total sample size across all groups.
This is a simplification of the more formal definition:
ε^2^ = H / ((N^2^ - 1) / (N + 1))
Both expressions yield the same value. The simplified form is easier to compute and less error-prone.
Worked Example
With H = 18.42 and N = 60 (three groups of 20):
ε^2^ = 18.42 / (60 - 1) = 18.42 / 59 = .31
Interpretation Benchmarks
The benchmarks for epsilon squared follow the same conventions as eta squared:
| ε^2^ | Interpretation | |----------------|---------------| | .01 | Small effect | | .06 | Medium effect | | .14 | Large effect |
In our example, ε^2^ = .31 is a large effect, indicating that treatment group membership accounts for approximately 31% of the variability in pain rating ranks. This gives the reader far more information than the p value alone.
Alternative: Eta Squared for H
An alternative effect size is eta squared based on the H statistic (η^2^~H~), which adjusts for the number of groups:
η^2^~H~ = (H - k + 1) / (N - k)
where k is the number of groups.
For our example: η^2^~H~ = (18.42 - 3 + 1) / (60 - 3) = 16.42 / 57 = .29
Both epsilon squared and eta squared share the same benchmarks. Choose one and use it consistently throughout your paper. Epsilon squared is more common in published research, but either is acceptable.
APA format for reporting effect size:
The effect was large, ε^2^ = .31, indicating that approximately 31% of the variability in pain rating ranks was accounted for by treatment group.
Post-Hoc Tests: Dunn's Test with Bonferroni Correction
A significant Kruskal-Wallis result tells you that at least one group differs from at least one other group. It does not tell you which groups differ. When the omnibus test is significant and you have three or more groups, you must conduct pairwise post-hoc comparisons.
When to Conduct Post-Hoc Tests
Post-hoc comparisons are appropriate only when the omnibus Kruskal-Wallis test is significant (p < .05). If the omnibus test is non-significant, do not proceed with pairwise tests — doing so inflates the Type I error rate and produces uninterpretable results.
Dunn's Test
Dunn's test is the standard post-hoc procedure for the Kruskal-Wallis test. It compares all possible pairs of groups using the rank sums from the original omnibus ranking (rather than re-ranking each pair separately). This consistency with the omnibus test makes Dunn's test more appropriate than running separate Mann-Whitney U tests.
For k groups, the number of pairwise comparisons is k(k - 1) / 2. With three groups, that gives three comparisons: Placebo vs. Drug A, Placebo vs. Drug B, and Drug A vs. Drug B.
Bonferroni Correction
The Bonferroni correction controls the familywise Type I error rate by dividing the significance level by the number of comparisons. With three comparisons and alpha = .05, the adjusted threshold is .05 / 3 = .017. Alternatively, many software packages multiply the raw p values by the number of comparisons and compare them to the original alpha. Both approaches yield identical decisions.
APA Format for Pairwise Comparisons
Example with three groups (all pairwise results reported):
Post-hoc pairwise comparisons using Dunn's test with Bonferroni correction revealed that Drug B produced significantly lower pain ratings than the placebo group (z = -4.12, p < .001) and Drug A (z = -2.54, p = .033). The difference between Drug A and the placebo group was also significant (z = -2.08, p = .038, adjusted).
Example with mixed significant and non-significant results:
Dunn's post-hoc comparisons with Bonferroni correction indicated that Drug B reported significantly lower pain ratings than the placebo group (z = -4.12, p < .001), but the differences between Drug A and placebo (z = -1.58, p = .342) and between Drug A and Drug B (z = -1.54, p = .371) were not statistically significant.
Always report both significant and non-significant pairwise comparisons. Selectively reporting only significant pairs is a form of reporting bias.
Complete APA Paragraph
Combining all elements into a single results paragraph:
A Kruskal-Wallis H test was conducted to examine differences in pain ratings (0-10 scale) across three treatment conditions: placebo (n = 20), Drug A (n = 20), and Drug B (n = 20). The nonparametric test was selected because pain ratings violated the normality assumption in two groups (Shapiro-Wilk p = .008 and p = .021). Median pain ratings were 7.00 (IQR = 5.25-8.00) for the placebo group, 5.00 (IQR = 3.00-6.75) for Drug A, and 3.50 (IQR = 2.00-5.00) for Drug B. The Kruskal-Wallis test indicated a statistically significant difference in pain ratings across groups, H(2) = 18.42, p < .001, ε^2^ = .31. Post-hoc pairwise comparisons using Dunn's test with Bonferroni correction revealed significant differences between Drug B and placebo (z = -4.12, p < .001), Drug B and Drug A (z = -2.54, p = .033), and Drug A and placebo (z = -2.08, p = .038).
This paragraph contains every element required by APA 7th edition: test justification, descriptive statistics with medians and IQRs, the omnibus H statistic with degrees of freedom and p value, effect size with interpretation benchmark, and post-hoc comparisons with named correction method and adjusted p values.
Common Mistakes to Avoid
1. Reporting Means Instead of Medians
This is the single most frequent error in Kruskal-Wallis reporting. If you chose a nonparametric test because the distribution is non-normal, reporting means and standard deviations as your primary descriptive statistics directly contradicts your rationale. The median is the appropriate measure of central tendency for ranked data, and the interquartile range is the appropriate measure of spread.
Incorrect:
Group A (M = 5.23, SD = 2.14), Group B (M = 3.87, SD = 1.92), Group C (M = 4.56, SD = 2.31)
Correct:
Group A (Mdn = 5.00, IQR = 3.50-7.00), Group B (Mdn = 3.50, IQR = 2.00-5.50), Group C (Mdn = 4.50, IQR = 3.00-6.00)
You may report means alongside medians for additional context, but the medians must be the primary statistics discussed in the text.
2. Running Multiple Mann-Whitney Tests Instead of Kruskal-Wallis
With three groups, some researchers skip the omnibus test entirely and run three separate Mann-Whitney U tests. This is methodologically incorrect because it inflates the familywise Type I error rate. With three comparisons at alpha = .05, the probability of at least one false positive rises to approximately .14 — nearly three times the intended rate.
Always start with the Kruskal-Wallis omnibus test. Only proceed to pairwise comparisons if the omnibus H is significant.
3. Forgetting Bonferroni Correction on Post-Hoc Tests
When conducting pairwise post-hoc comparisons after a significant Kruskal-Wallis test, failing to correct for multiple comparisons is a serious error. Without correction, each comparison uses the full alpha = .05, which inflates the familywise error rate. Always specify the correction method (Bonferroni, Holm, or Benjamini-Hochberg) and report the adjusted p values.
4. Not Reporting Effect Size
APA 7th edition requires effect sizes for all inferential tests. A p value alone tells you whether a difference exists; it does not tell you whether the difference is large enough to be practically meaningful. A study with N = 500 might find p = .03 with ε^2^ = .01 — statistically significant but trivially small. Without the effect size, readers cannot evaluate practical significance.
5. Omitting the Justification for Using a Nonparametric Test
Reviewers expect a brief explanation of why you chose the Kruskal-Wallis test over the one-way ANOVA. One sentence referencing the specific violated assumption is sufficient:
Weak: "A Kruskal-Wallis test was conducted."
Strong: "A Kruskal-Wallis H test was used because pain ratings were not normally distributed in two of the three groups (Shapiro-Wilk p = .008 and p = .021)."
6. Reporting Post-Hoc Results Without Naming the Method
Always specify which post-hoc procedure you used (Dunn's test, pairwise Mann-Whitney U) and which correction was applied (Bonferroni, Holm). Without this information, reviewers cannot evaluate or replicate your analysis.
Kruskal-Wallis vs Friedman Test
Both the Kruskal-Wallis and Friedman tests are nonparametric methods for comparing three or more groups, but they apply to fundamentally different research designs. Confusing them is not merely a reporting error — it is a design-level mistake that invalidates the analysis.
| Feature | Kruskal-Wallis | Friedman | |---------|---------------|----------| | Design | Independent groups (between-subjects) | Repeated measures (within-subjects) | | Participants | Different participants in each group | Same participants measured under all conditions | | Parametric equivalent | One-way ANOVA | Repeated measures ANOVA | | Ranking method | Ranks all observations together | Ranks within each participant separately | | Post-hoc procedure | Dunn's test | Nemenyi test or Conover test | | Effect size | Epsilon squared (ε^2^) | Kendall's W |
Use Kruskal-Wallis when different participants are assigned to different groups (e.g., three separate treatment groups).
Use Friedman when the same participants are measured under multiple conditions (e.g., patients rated on pain at three different time points).
The distinction comes down to one question: are the observations independent across groups? If participants appear in only one group, use Kruskal-Wallis. If the same participants contribute data to every condition, use Friedman.
Calculation Accuracy
Formatting Kruskal-Wallis results by hand is tedious and error-prone — especially the post-hoc comparisons with Bonferroni correction. StatMate's free Kruskal-Wallis calculator automates the entire process:
- Enter your group data and get the H statistic, exact p value, and epsilon squared instantly
- Automatic Dunn's post-hoc test with Bonferroni correction when the omnibus test is significant
- Copy-ready APA formatted results paragraph with one click
- Free PDF export of your complete analysis including effect sizes
- Visual box plots comparing each group's distribution
No manual rank calculations, no formula errors, no forgotten correction factors. Paste your data, review the results, and copy the APA-formatted output directly into your manuscript.
Try the Kruskal-Wallis Calculator
Frequently Asked Questions
What is the correct APA format for reporting a Kruskal-Wallis test?
The standard format is: H(df) = X.XX, p = .XXX, ε^2^ = .XX. For example: H(2) = 18.42, p < .001, ε^2^ = .31. Include the test justification, descriptive statistics with medians and IQRs, and post-hoc pairwise comparisons when the omnibus test is significant.
What descriptive statistics should I report alongside Kruskal-Wallis results?
Report medians (Mdn) and interquartile ranges (IQR) as your primary descriptive statistics. You may also report mean ranks if they help clarify the group ordering. Avoid reporting means and standard deviations as the sole descriptive statistics because they assume a symmetric distribution, which contradicts your rationale for choosing a nonparametric test.
How do I calculate epsilon squared for a Kruskal-Wallis test?
Use the formula ε^2^ = H / (N - 1), where H is the test statistic and N is the total sample size. Interpret the result using these benchmarks: .01 = small effect, .06 = medium effect, .14 = large effect. For example, with H = 18.42 and N = 60, ε^2^ = 18.42 / 59 = .31 (large effect).
When should I use Dunn's test vs pairwise Mann-Whitney U tests?
Dunn's test is preferred because it uses the same ranking from the original Kruskal-Wallis omnibus test, maintaining statistical consistency. Pairwise Mann-Whitney U tests re-rank the data for each comparison, which can lead to different rank orderings. Both require a correction for multiple comparisons (Bonferroni or Holm), but Dunn's test is the more methodologically sound choice.
What is the difference between Kruskal-Wallis and Friedman tests?
The Kruskal-Wallis test compares three or more independent groups (different participants in each group), while the Friedman test compares three or more related groups (the same participants measured under all conditions). The Kruskal-Wallis test is the nonparametric equivalent of the one-way ANOVA; the Friedman test is the nonparametric equivalent of repeated measures ANOVA.