What is the Hodges-Lehmann estimator and should I report it?

The Hodges-Lehmann estimator is the nonparametric equivalent of the mean difference. For paired data, it equals the median of all Walsh averages of the difference scores. Reporting it with a confidence interval is recommended because it provides a robust point estimate of the typical shift between conditions, supplementing the median difference and effect size with a measure of precision.

How to Report Wilcoxon Signed-Rank Test in APA 7th Edition — Effect Size & Examples

Q: What is the difference between the Wilcoxon signed-rank test and the Wilcoxon rank-sum test?

The Wilcoxon signed-rank test is for paired (related) samples, such as pre-test and post-test measurements from the same participants. The Wilcoxon rank-sum test (also called the Mann-Whitney U test) is for two independent groups. Despite sharing the Wilcoxon name, they test different hypotheses: the signed-rank test evaluates whether the median of paired differences is zero, while the rank-sum test evaluates whether one group tends to have larger values than the other.

Q: Can I use the Wilcoxon signed-rank test with Likert scale data?

Yes. The Wilcoxon signed-rank test is appropriate for ordinal data, including individual Likert-type items. Because it operates on ranks rather than raw values, it does not require the equal-interval assumption that the paired t-test needs. However, if you have a composite scale computed from multiple Likert items (which approximates a continuous distribution), a paired t-test may be acceptable if differences are approximately normal.

Q: What sample size do I need for the Wilcoxon signed-rank test?

There is no strict minimum, but at least 5-6 pairs are needed for the exact test to produce a significant result at alpha = .05. For adequate power to detect a medium effect (r = .30), aim for at least 25-30 pairs. The Z approximation becomes reliable with approximately 20 or more pairs. Always conduct a power analysis for your specific effect size and desired power level.

Q: Should I report the exact or asymptotic p-value?

For small samples (fewer than approximately 20-25 pairs), report the exact p-value because the normal approximation may not be accurate. For larger samples, the asymptotic (Z-based) p-value is acceptable and is what most software outputs by default. If your software provides both, report the exact value for small samples and note which method was used.

Q: How do I handle zero differences (ties with zero)?

Pairs with zero differences contribute no information about the direction of change and are excluded from the analysis by most software. Report the number of excluded pairs. The effective sample size for computing the effect size should reflect the number of non-zero pairs, though practices vary across sources.

Q: Can I use the Wilcoxon test for more than two time points?

Not directly. The Wilcoxon signed-rank test compares exactly two related conditions. For three or more time points, use the Friedman test as the omnibus test, followed by pairwise Wilcoxon signed-rank tests with Bonferroni correction as post-hoc comparisons. Alternatively, conduct pairwise comparisons directly with an adjusted significance level.

Q: Is the Wilcoxon test assumption-free?

No. While the Wilcoxon signed-rank test does not assume normality of differences, it does assume that the paired differences are independent of each other, the differences are measured on at least an ordinal scale, and the distribution of differences is symmetric around the median (though this assumption is debated and the test is fairly robust to mild asymmetry). Violations of independence are more problematic than violations of symmetry.

Why Correct Wilcoxon Reporting Matters

The Wilcoxon signed-rank test is the most frequently used nonparametric alternative to the paired samples t-test. Developed by Frank Wilcoxon in 1945, it evaluates whether the distribution of differences between two related measurements is symmetric around zero, without requiring those differences to follow a normal distribution.

Despite its widespread adoption in clinical trials, educational interventions, and behavioral research, the Wilcoxon signed-rank test remains one of the most inconsistently reported statistics in published literature. Common errors include reporting means instead of medians, omitting effect sizes entirely, confusing the signed-rank test with the rank-sum test, and failing to specify whether exact or asymptotic p-values were used.

APA 7th edition demands that every inferential test include a test statistic, a p-value, and an effect size measure. For the Wilcoxon signed-rank test, meeting these requirements involves understanding multiple notation conventions, choosing between exact and approximate methods, and computing an appropriate effect size. This guide provides a definitive template for every component, with copy-paste examples and a complete worked scenario.

Try it yourself with the Wilcoxon signed-rank calculator.

When to Use Wilcoxon vs. the Paired t-Test

The paired samples t-test assumes that the differences between paired observations are normally distributed. When that assumption fails, the Wilcoxon signed-rank test is the correct alternative. Use it when any of the following apply:

Ordinal dependent variable. Your outcome is measured on an ordinal scale such as Likert-type items, pain severity rankings, or satisfaction ratings. Means are not meaningful for ordinal data; ranks are.
Non-normal paired differences. A Shapiro-Wilk test on the difference scores yields p < .05, or a Q-Q plot reveals heavy tails, skewness, or outliers.
Small sample sizes. With fewer than 20-25 pairs, the Central Limit Theorem may not adequately normalize the sampling distribution of the mean difference.
Floor or ceiling effects. Scores cluster at the extremes of the measurement range, producing a distribution that the t-test cannot handle reliably.

The paired t-test has slightly higher statistical power under perfect normality, achieving approximately 95.5% asymptotic relative efficiency compared to Wilcoxon. However, when normality is violated, Wilcoxon often outperforms the t-test because it is not distorted by outliers or skewness.

APA Justification Template

When you choose the Wilcoxon test, briefly justify the choice in your results section:

The Wilcoxon signed-rank test was used because the Shapiro-Wilk test indicated significant non-normality in the distribution of paired differences (W = 0.89, p = .003), and visual inspection of the histogram revealed positive skewness with two extreme outliers.

| Decision Factor | Choose Paired t-Test | Choose Wilcoxon | |----------------|---------------------|----------------| | Differences normally distributed | Yes | -- | | Differences skewed or heavy-tailed | -- | Yes | | Ordinal measurement scale | -- | Yes | | Continuous interval/ratio scale | Yes | -- | | Outliers present in differences | -- | Yes | | Sample size > 30 pairs, mild non-normality | Yes (robust) | Either | | Sample size < 20 pairs, normality uncertain | -- | Yes |

Understanding the Test Statistics: T, W, and Z

One of the most confusing aspects of Wilcoxon reporting is the inconsistent notation across software packages and textbooks. Three symbols appear in practice, and knowing which one your software outputs is essential.

T (or W): The Sum of Signed Ranks

The core statistic is the sum of ranks for either the positive or negative differences:

| Symbol | Convention | Used By | |--------|-----------|---------| | T | Sum of positive (or smaller) ranks | Many statistics textbooks | | W | Sum of signed ranks | R (wilcox.test), some textbooks | | T+ | Sum of positive ranks specifically | Siegel & Castellan notation |

For small samples (typically n < 20), the exact test statistic T (or W) is reported because exact p-values can be computed from the Wilcoxon distribution.

Z: The Standardized Approximation

For larger samples, software converts the rank sum into a Z-statistic using a normal approximation:

Z = (T - Expected Value) / Standard Error

This Z follows an approximately normal distribution and is the statistic most commonly reported in published research.

Software Conventions

| Software | Default Output | Symbol | |----------|---------------|--------| | SPSS | Standardized test statistic | Z | | R (wilcox.test) | Sum of ranks | V | | Stata | Sum of ranks + Z approximation | z | | jamovi | Test statistic + Z | W and Z | | StatMate | Both rank sum and Z | W and Z |

Always check your software documentation to confirm what the reported value represents before writing your results section.

The APA Reporting Template

APA 7th edition does not prescribe a single rigid format for the Wilcoxon test, but the following templates reflect current best practice.

For Small Samples (Exact Test)

A Wilcoxon signed-rank test indicated that post-intervention scores (Mdn = 4.50) were significantly higher than pre-intervention scores (Mdn = 3.00), T = 45, p = .012, r = .48.

For Larger Samples (Z Approximation)

A Wilcoxon signed-rank test showed a statistically significant change in pain ratings from baseline (Mdn = 7.00, IQR = 5.00-8.00) to follow-up (Mdn = 4.00, IQR = 3.00-6.00), Z = -3.41, p < .001, r = .54.

Essential Components Checklist

Every Wilcoxon APA report must include:

Full test name on first mention (Wilcoxon signed-rank test).
Descriptive statistics: Medians and interquartile ranges for each condition, not means.
Test statistic: T, W, or Z depending on sample size and software.
Exact p-value (or p < .001 for very small values).
Effect size: Rank-biserial correlation (r).
Direction of difference stated explicitly.

Effect Size: Rank-Biserial Correlation (r)

Reporting a p-value alone tells you whether the difference is statistically significant but not whether it is practically meaningful. The standard effect size for the Wilcoxon signed-rank test is the rank-biserial correlation, symbolized as r.

Method 1: From the Z-Statistic

The most widely used formula:

r = Z / sqrt(N)

where N is the total number of paired observations.

Example: With Z = -3.41 and N = 40 pairs:

r = |-3.41| / sqrt(40) = 3.41 / 6.32 = 0.54

Method 2: From the Rank Sums

When Z is not available:

r = (R+ - R-) / (R+ + R-)

Where R+ is the sum of ranks for positive differences and R- is the sum of ranks for negative differences. This gives an intuitive interpretation: r = 1.0 means all differences favored one direction.

Interpreting the Effect Size

Cohen's conventional benchmarks for r:

| r Value | Interpretation | |-----------|---------------| | .10 | Small effect | | .30 | Medium effect | | .50 | Large effect |

Always interpret effect size in context. In clinical research, r = .20 may represent a clinically meaningful change. In educational research, r = .40 might be a strong intervention effect. Do not rely solely on arbitrary benchmarks.

APA Format for Effect Size

Report the effect size immediately after the p-value:

Z = -3.41, p < .001, r = .54

If you want to be explicit about the type:

Z = -3.41, p < .001, rank-biserial r = .54

Step-by-Step Reporting Example: Pre-Post Intervention (N = 20)

Scenario

A health psychologist measures sleep quality (1-10 ordinal scale) in 20 patients before and after a 6-week cognitive behavioral therapy for insomnia (CBT-I) program.

Step 1: Report Descriptive Statistics

Present medians and interquartile ranges for both conditions:

Pre-intervention sleep quality had a median of 4.00 (IQR = 3.00-5.00), while post-intervention sleep quality had a median of 7.00 (IQR = 5.75-8.00).

Step 2: Justify the Nonparametric Choice

Because sleep quality was measured on an ordinal scale and the Shapiro-Wilk test indicated that the distribution of paired differences deviated significantly from normality (W = 0.88, p = .021), the Wilcoxon signed-rank test was used instead of a paired samples t-test.

Step 3: Report the Test Results

A Wilcoxon signed-rank test indicated that sleep quality scores were significantly higher after CBT-I (Mdn = 7.00, IQR = 5.75-8.00) compared to baseline (Mdn = 4.00, IQR = 3.00-5.00), Z = -3.72, p < .001, r = .83. This represents a large effect.

Step 4: Add Contextual Detail

Of the 20 participants, 17 showed an increase in sleep quality scores, 2 showed a decrease, and 1 showed no change. The large effect size (r = .83) indicates that CBT-I produced a substantial improvement in self-reported sleep quality.

Complete APA Paragraph

The Wilcoxon signed-rank test was used to evaluate the effect of a 6-week CBT-I program on self-reported sleep quality (N = 20). The nonparametric test was selected because sleep quality was measured on an ordinal scale and paired differences were not normally distributed (Shapiro-Wilk W = 0.88, p = .021). Pre-intervention sleep quality had a median of 4.00 (IQR = 3.00-5.00) and post-intervention sleep quality had a median of 7.00 (IQR = 5.75-8.00). The Wilcoxon signed-rank test indicated a statistically significant improvement in sleep quality, Z = -3.72, p < .001, r = .83. Of the 20 participants, 17 showed improved scores, 2 showed decreased scores, and 1 showed no change. The effect size indicates a large practical effect of the intervention.

Reporting Non-Significant Results

Non-significant results should be reported with the same level of detail:

A Wilcoxon signed-rank test was conducted to compare self-efficacy ratings before (Mdn = 5.00, IQR = 4.00-6.00) and after (Mdn = 5.00, IQR = 4.00-7.00) the training workshop. The test did not reveal a statistically significant change, Z = -1.34, p = .180, r = .21. The small effect size suggests that the workshop had minimal impact on participants' self-efficacy beliefs.

Key principles:

Report the exact p-value (do not write "p = n.s." or "p > .05").
Still include and interpret the effect size.
Describe the direction of any observed trend.
Avoid language implying the intervention "had no effect." State that the test did not detect a significant effect.

Exact vs. Asymptotic P-Values: When to Use Which

For small samples (typically n < 20-25 pairs), the exact p-value should be reported because the normal approximation may not be accurate. For larger samples, the asymptotic (Z-based) p-value is acceptable.

Small samples (exact test):

T = 12, p_exact = .023

Larger samples (Z approximation):

Z = -2.87, p = .004

If your software provides both, report the exact value for small samples and note which method was used:

A Wilcoxon signed-rank test with exact significance was used due to the small sample size (N = 15).

Confidence Intervals: The Hodges-Lehmann Estimator

APA 7th edition increasingly recommends confidence intervals. For the Wilcoxon test, the relevant confidence interval is constructed around the Hodges-Lehmann estimator, the nonparametric analogue of the mean difference.

How It Works

Compute the difference score for each pair (d_i = post - pre).
For each pair of difference scores (d_i, d_j), compute (d_i + d_j) / 2 (Walsh averages).
The Hodges-Lehmann estimator is the median of all Walsh averages.

In R: wilcox.test(post, pre, paired = TRUE, conf.int = TRUE)

APA Reporting with Confidence Intervals

A Wilcoxon signed-rank test indicated a statistically significant reduction in pain scores from baseline (Mdn = 7.00) to post-treatment (Mdn = 4.00), Z = -3.41, p < .001, r = .54. The Hodges-Lehmann estimate of the median difference was -2.50, 95% CI [-3.50, -1.75].

Handling Ties and Zero Differences

When paired differences equal zero, these observations are typically excluded from the analysis, reducing the effective sample size. Report the number of ties:

Of the 40 pairs, 3 had zero differences and were excluded, leaving 37 pairs for analysis.

When multiple pairs share the same non-zero absolute difference, tied ranks receive average rank values. If ties are extensive (more than 15-20% of observations), mention this:

Tied ranks were present for 22% of non-zero differences. The analysis used average ranks for tied observations with a continuity correction applied to the Z approximation.

Pre-Post Designs: Additional Considerations

Pre-post designs are the most common application of the Wilcoxon signed-rank test. Additional reporting elements for these designs include:

Handling Multiple Time Points

The Wilcoxon test compares only two conditions. For three or more time points, use either:

Option 1: Pairwise Wilcoxon tests with Bonferroni correction.

Pairwise Wilcoxon signed-rank tests with Bonferroni-corrected alpha (adjusted p = .017) revealed significant improvements from baseline to mid-treatment (Z = -3.12, p = .002, r = .46) and from baseline to post-treatment (Z = -4.87, p < .001, r = .73), but not from mid-treatment to post-treatment (Z = -1.89, p = .059, r = .28).

Option 2: Friedman test as omnibus, followed by Wilcoxon post-hoc.

A Friedman test indicated a significant effect of time on depression scores, chi-sq(2) = 34.56, p < .001, W = .38. Post-hoc Wilcoxon signed-rank tests with Bonferroni correction were conducted to identify which time points differed.

Clinical Significance

In health research, report the proportion achieving clinically meaningful change:

Of the 45 participants, 33 (73.3%) demonstrated reliable clinical improvement (BDI-II decrease of 8 or more points), 9 (20.0%) showed no reliable change, and 3 (6.7%) showed reliable deterioration.

Handling Dropouts

Of the 52 participants enrolled, 45 completed both baseline and post-treatment assessments (86.5% retention rate). Participants who dropped out did not differ significantly from completers on baseline scores (Mann-Whitney U = 134, p = .312).

Common Mistakes in Wilcoxon Reporting

1. Reporting Means Instead of Medians

The most frequent error. The Wilcoxon test operates on ranks, so medians and IQRs are the appropriate descriptive statistics.

Incorrect: "Scores increased from pre-test (M = 3.42, SD = 1.21) to post-test (M = 4.87, SD = 1.35)."

Correct: "Scores increased from pre-test (Mdn = 3.50, IQR = 2.75-4.25) to post-test (Mdn = 5.00, IQR = 4.00-5.50)."

2. Confusing Signed-Rank with Rank-Sum

The Wilcoxon signed-rank test is for paired samples. The Wilcoxon rank-sum test (Mann-Whitney U) is for independent groups. Always specify the full name on first mention.

3. Incorrect Effect Size Calculations

Using total individuals (N = 60) instead of number of pairs (N = 30) when computing r = Z/sqrt(N).
Reporting Cohen's d instead of rank-biserial r.
Forgetting to use the absolute value of Z when interpreting magnitude.

4. Ignoring Ties and Zero Differences

Report excluded zero-difference pairs and acknowledge extensive ties when present.

5. Missing the Exact vs. Asymptotic Distinction

For small samples (n < 20-25), use exact p-values. For larger samples, the Z approximation is acceptable. Always specify which method was used.

6. Omitting the Effect Size

APA 7th edition requires an effect size for every inferential test. The rank-biserial correlation r is the standard measure for the Wilcoxon signed-rank test.

Wilcoxon APA Checklist

Before submitting your manuscript, verify your Wilcoxon results section includes:

Full test name on first mention (Wilcoxon signed-rank test)
Sample size (N or number of pairs)
Medians for each condition (not means)
Interquartile ranges (IQR) for each condition
Test statistic (T, W, or Z) clearly labeled
Exact p-value (or p < .001)
Effect size: rank-biserial correlation (r)
Effect size interpretation (small, medium, or large)
Direction of the difference stated explicitly
Justification for choosing the nonparametric test
Ties addressed if numerous
Confidence interval for the Hodges-Lehmann estimate (if applicable)
Number of participants showing improvement, decline, and no change

Frequently Asked Questions

What is the difference between the Wilcoxon signed-rank test and the Wilcoxon rank-sum test?

The Wilcoxon signed-rank test is for paired (related) samples, such as pre-post measurements from the same participants. The Wilcoxon rank-sum test (Mann-Whitney U test) is for two independent groups. The signed-rank test evaluates whether the median of paired differences is zero; the rank-sum test evaluates whether one group tends to have larger values.

Can I use the Wilcoxon signed-rank test with Likert scale data?

Yes. The Wilcoxon signed-rank test is appropriate for ordinal data, including individual Likert-type items, because it operates on ranks. However, if you have a composite scale from multiple Likert items that approximates a continuous distribution, a paired t-test may be acceptable when differences are approximately normal.

What sample size do I need for the Wilcoxon signed-rank test?

At least 5-6 pairs are needed for the exact test to produce a significant result at alpha = .05. For adequate power to detect a medium effect (r = .30), aim for 25-30 pairs. The Z approximation becomes reliable with approximately 20 or more pairs.

Should I report the exact or asymptotic p-value?

For small samples (fewer than 20-25 pairs), report the exact p-value. For larger samples, the asymptotic Z-based p-value is acceptable. If your software provides both, use the exact value for small samples and note the method.

How do I handle zero differences (ties with zero)?

Pairs with zero differences are excluded from the analysis by most software. Report the number of excluded pairs: "Of the 30 pairs, 4 had zero differences and were excluded, leaving 26 pairs for analysis."

Can I use the Wilcoxon test for more than two time points?

Not directly. For three or more related conditions, use the Friedman test as the omnibus test, followed by pairwise Wilcoxon signed-rank tests with Bonferroni correction as post-hoc comparisons.

Is the Wilcoxon test assumption-free?

No. It assumes that (1) paired differences are independent of each other, (2) differences are measured on at least an ordinal scale, and (3) the distribution of differences is symmetric around the median. The test is fairly robust to mild asymmetry, but violations of independence are problematic.

Try StatMate's Free Wilcoxon Calculator

Formatting Wilcoxon results manually is tedious and error-prone. StatMate's Wilcoxon signed-rank calculator automates the entire process:

Instant APA output. Enter your paired data and get a publication-ready results paragraph with Z, p, and r values formatted to APA 7th edition standards.
Automatic effect size. The rank-biserial correlation is computed and interpreted for you.
Assumption checks. Shapiro-Wilk normality test on the paired differences with clear pass/fail indicators.
Visual output. Paired difference charts show the direction and magnitude of changes.
One-click export. Copy to clipboard, export to PDF, or generate an APA-formatted Word document (Pro).

No formulas to look up, no notation to decode, no formatting to second-guess.

Open the Wilcoxon Calculator