Why the Mann-Whitney U Test Matters
The Mann-Whitney U test is the most widely used nonparametric alternative to the independent samples t-test. Named after Henry B. Mann and Donald R. Whitney (1947), it evaluates whether one of two independent groups tends to produce larger values than the other — without assuming that the data follow a normal distribution.
This matters for three practical reasons. First, real-world research data frequently violate the normality assumption that parametric tests require. Patient satisfaction ratings, pain severity scores, behavioral frequency counts, and Likert-scale items rarely produce the symmetric, bell-shaped distributions that a t-test assumes. Second, many outcome variables in social science, education, and health research are measured on ordinal scales where means and standard deviations are not meaningful. Third, small clinical studies and pilot experiments often lack the sample sizes needed for the central limit theorem to rescue a parametric approach.
The Mann-Whitney U test handles all of these situations by converting raw data to ranks before analysis. Instead of comparing group means, it tests whether observations from one group are systematically larger or smaller than observations from the other. This rank-based approach makes the test robust to outliers, skewed distributions, and non-interval measurement scales.
Despite its popularity, the Mann-Whitney U test is one of the most frequently misreported statistics in published research. Common errors include reporting means instead of medians, omitting effect sizes, confusing it with the Wilcoxon signed-rank test, and failing to specify whether exact or asymptotic p-values were used. This guide provides the definitive template for reporting Mann-Whitney U test results in APA 7th edition format, with step-by-step instructions and copy-paste examples.
When to Use Mann-Whitney U vs Independent t-Test
Choosing between the Mann-Whitney U test and the independent samples t-test is one of the most common decisions in two-group research designs. The right choice depends on your data characteristics.
Non-Normal Distributions
When a Shapiro-Wilk test yields p < .05 or Q-Q plots reveal substantial departures from normality, the Mann-Whitney U test is the appropriate choice. The t-test assumes approximately normal distributions within each group; when this assumption is violated (particularly with skewed or multimodal distributions), the t-test can produce misleading p-values and inflated Type I error rates.
A common misconception is that the t-test is "robust" to non-normality. While moderate departures from normality have limited impact when sample sizes are large and equal (n > 30 per group), severe skewness, heavy tails, or floor/ceiling effects can distort results regardless of sample size.
Ordinal Data
If your dependent variable is measured on an ordinal scale — such as Likert items (1-5 agreement scales), pain severity ratings (none/mild/moderate/severe), or educational attainment levels — the Mann-Whitney U test is the correct choice. Means and standard deviations are not meaningful for ordinal data because the intervals between scale points are not guaranteed to be equal. The Mann-Whitney U test operates entirely on ranks, making it appropriate for any data that can be meaningfully ordered.
Small Samples With Skewed Distributions
When group sizes are small (n < 15-20 per group) and the distribution shape is unknown or clearly non-normal, the Mann-Whitney U test provides more reliable inference than the t-test. With small samples, normality is difficult to verify statistically (the Shapiro-Wilk test has low power), and a single outlier can dramatically affect the mean and inflate the standard error.
Decision Flowchart
Use this decision process to choose between the two tests:
- Is the dependent variable ordinal? Yes → Mann-Whitney U
- Is the dependent variable continuous? Continue to step 3.
- Does the Shapiro-Wilk test indicate non-normality (p < .05) in either group? Yes → Mann-Whitney U
- Are there severe outliers that cannot be justified or removed? Yes → Mann-Whitney U
- Is n < 15 per group with unknown distribution shape? Yes → Mann-Whitney U
- None of the above? → Independent samples t-test (higher statistical power)
When assumptions are met, the t-test has greater statistical power. The asymptotic relative efficiency of the Mann-Whitney U compared to the t-test is approximately 0.955 for normally distributed data, meaning it requires roughly 5% more observations to achieve the same power. However, when distributions are skewed or contaminated with outliers, the Mann-Whitney U can be substantially more powerful.
The Basic APA Format for Mann-Whitney U
APA 7th edition requires that every inferential test include the test statistic, degrees of freedom or sample information, p-value, and an effect size measure. For the Mann-Whitney U test, the standard reporting template is:
U = X, z = X.XX, p = .XXX, r = .XX
Each component serves a specific purpose:
- U: The Mann-Whitney U statistic — the core test statistic based on rank sums
- z: The standardized z-score — necessary for computing the effect size and interpretable on a standard normal scale
- p: The probability value — reported to three decimal places, or as p < .001 for very small values
- r: The effect size — typically the rank-biserial correlation or r = z / sqrt(N)
In addition to the test statistics, always report descriptive statistics using medians (Mdn) and interquartile ranges (IQR), not means and standard deviations. The Mann-Whitney U test evaluates rank distributions, not means, so medians are the appropriate measure of central tendency.
Reporting Mann-Whitney U: Step by Step
Research Scenario
A researcher investigates whether a mindfulness-based intervention improves patient satisfaction with hospital care. The satisfaction questionnaire uses a 7-point Likert scale (1 = very dissatisfied to 7 = very satisfied). Fifteen patients received the mindfulness intervention (treatment group) and 15 patients received standard care (control group). Because satisfaction is measured on an ordinal scale with a small sample, the researcher selects the Mann-Whitney U test.
Step 1: Report Descriptive Statistics With Medians and IQR
Always present group-level descriptive statistics before the inferential results. For the Mann-Whitney U test, report medians and interquartile ranges:
| Group | n | Mdn | IQR | |-------|-----|-------|-----| | Mindfulness | 15 | 6.00 | 5.00-7.00 | | Standard care | 15 | 4.00 | 3.00-5.00 |
In running text:
Patients in the mindfulness group reported higher satisfaction (Mdn = 6.00, IQR = 5.00-7.00) compared to the standard care group (Mdn = 4.00, IQR = 3.00-5.00).
Step 2: Report the Significant Result
A Mann-Whitney U test indicated that patient satisfaction was significantly higher in the mindfulness group (Mdn = 6.00) than in the standard care group (Mdn = 4.00), U = 42.50, z = -3.12, p = .002, rrb = .62.
Step 3: Report a Non-Significant Result
If the same study had produced non-significant results:
A Mann-Whitney U test revealed no statistically significant difference in patient satisfaction between the mindfulness group (Mdn = 5.00) and the standard care group (Mdn = 4.00), U = 89.00, z = -1.21, p = .226, rrb = .21. The small effect size suggests that the mindfulness intervention did not produce a meaningful difference in satisfaction.
Breaking Down Each Component
The U statistic. The raw test statistic calculated from rank sums. When reporting, use the U value provided by your software. Some packages report the smaller of two possible U values; others report U1 or U2. Be consistent and note which convention your software uses.
The z-score. The standardized test statistic computed as:
z = (U - n1n2/2) / sqrt(n1 * n2 * (n1 + n2 + 1) / 12)
Include the sign (positive or negative) because it indicates the direction of the group difference. The z-score is essential for computing the effect size r.
The p-value. Report the exact p-value to three decimal places (e.g., p = .002). When p is below .001, write p < .001. Never write p = .000 — a p-value is never exactly zero. For small samples (n < 20 per group), use the exact p-value rather than the asymptotic approximation.
The effect size. Report the rank-biserial correlation (rrb) or r = z / sqrt(N). APA 7th edition mandates effect sizes for all inferential tests. Without an effect size, the reader cannot judge whether a statistically significant result is practically meaningful.
Complete Write-Up
Results
Patient satisfaction ratings were compared between the mindfulness intervention group (n = 15) and the standard care group (n = 15) using a Mann-Whitney U test. A Shapiro-Wilk test indicated that satisfaction ratings deviated significantly from normality in the standard care group, W = 0.88, p = .047, justifying the use of a nonparametric test. The mindfulness group reported significantly higher satisfaction (Mdn = 6.00, IQR = 5.00-7.00) than the standard care group (Mdn = 4.00, IQR = 3.00-5.00), U = 42.50, z = -3.12, p = .002, rrb = .62. The large effect size indicates a substantial difference in satisfaction between the two groups.
Effect Size: Rank-Biserial Correlation
APA 7th edition requires an effect size alongside every inferential test. For the Mann-Whitney U test, the rank-biserial correlation (rrb) is the preferred measure because it has a direct, intuitive interpretation.
Calculation
The rank-biserial correlation is calculated directly from the U statistic:
rrb = 1 - (2U) / (n1 * n2)
This formula yields a value between -1 and +1. A positive value indicates that Group 1 tends to produce larger values; a negative value indicates Group 2 tends to produce larger values. The magnitude tells you the degree of separation between the two rank distributions.
An alternative calculation uses the z-score:
r = z / sqrt(N)
where N is the total sample size across both groups. This method is simpler when only the z-score is available, but may differ slightly from rrb in the presence of tied ranks.
Interpretation Benchmarks
The rank-biserial correlation follows standard effect size benchmarks adapted from Cohen (1988):
| rrb | Interpretation | Practical meaning | |-------------------|----------------|-------------------| | .10 | Small effect | Minimal practical difference between groups | | .30 | Medium effect | Noticeable difference that may be practically meaningful | | .50 | Large effect | Substantial difference with clear practical significance |
In our example, rrb = .62 exceeds the .50 threshold, indicating a large effect: patients receiving the mindfulness intervention reported substantially higher satisfaction than those receiving standard care.
Probabilistic Interpretation
The rank-biserial correlation also has a probabilistic interpretation. It can be transformed into the common language effect size (CLES), which represents the probability that a randomly selected observation from one group exceeds a randomly selected observation from the other:
CLES = (rrb + 1) / 2
For rrb = .62, the CLES = .81, meaning there is an 81% probability that a randomly chosen patient from the mindfulness group reported higher satisfaction than a randomly chosen patient from the standard care group.
Confidence Intervals for Effect Size
APA 7th edition recommends reporting confidence intervals when available. For the rank-biserial correlation, confidence intervals can be computed via bootstrapping:
U = 42.50, z = -3.12, p = .002, rrb = .62, 95% CI [.28, .82]
Including confidence intervals conveys the precision of the effect size estimate and allows readers to evaluate the plausible range of the true effect.
Exact vs Asymptotic P-Values
Statistical software for the Mann-Whitney U test typically provides two types of p-values, and choosing the correct one matters for accurate reporting.
When to Use Exact P-Values
The exact p-value is computed by enumerating all possible permutations of the rank assignments under the null hypothesis. It provides the true probability of observing a U statistic as extreme as (or more extreme than) the one obtained, without relying on any distributional approximation.
Use exact p-values when:
- Small samples (n < 20 per group): The normal approximation is unreliable with small samples. Exact p-values are the gold standard in this range.
- Many tied values: When ties are extensive, the asymptotic approximation may be inaccurate even with moderate sample sizes.
- Conservative reporting is important: In clinical trials, regulatory submissions, or high-stakes research where Type I error control is critical.
Report the exact p-value explicitly:
An exact Mann-Whitney U test indicated that scores differed significantly between groups, U = 18.00, exact p = .014, rrb = .52.
When to Use Asymptotic P-Values
The asymptotic p-value uses the normal distribution (z-score) as an approximation to the exact permutation distribution. This approximation improves as sample sizes increase.
Use asymptotic p-values when:
- Large samples (n >= 20 per group): The normal approximation is highly accurate and the exact computation becomes unnecessary.
- Computational constraints: Exact p-values for very large samples can be computationally intensive, though modern software handles this well.
With asymptotic p-values, always report the z-score:
U = 156.50, z = -3.24, p = .001, r = .46
Small Sample Considerations
For samples with fewer than 20 observations per group, the asymptotic z-test can produce p-values that differ meaningfully from the exact p-value. This discrepancy is most pronounced when the sample sizes are very small (e.g., n < 10) or highly unequal. In these situations, the exact test protects against both liberal and conservative errors in significance testing.
Some software applies a continuity correction to the z-score for small samples. If your software does this, note it in your report:
A Mann-Whitney U test with continuity correction was conducted...
Common Mistakes to Avoid
Mistake 1: Reporting Means Instead of Medians
The most pervasive error in Mann-Whitney U reporting is presenting means and standard deviations as the descriptive statistics. Because the Mann-Whitney U test operates on ranks, the median is the appropriate measure of central tendency and the interquartile range is the appropriate measure of spread.
Wrong:
The treatment group (M = 5.67, SD = 1.45) scored higher than the control group (M = 3.89, SD = 1.72), U = 42.50, p = .002.
Correct:
The treatment group (Mdn = 6.00, IQR = 5.00-7.00) scored higher than the control group (Mdn = 4.00, IQR = 3.00-5.00), U = 42.50, z = -3.12, p = .002, rrb = .62.
You may report means alongside medians for additional context, but always make clear that the Mann-Whitney U test evaluates rank distributions, not mean differences.
Mistake 2: Not Reporting Effect Size
Reporting only U and p without an effect size is incomplete under APA 7th edition. Every inferential test requires an accompanying effect size measure. A statistically significant p-value tells you that a difference exists; the effect size tells you whether that difference matters.
Incomplete: U = 42.50, z = -3.12, p = .002
Complete: U = 42.50, z = -3.12, p = .002, rrb = .62
Mistake 3: Using Mann-Whitney When the t-Test Is Appropriate
Some researchers default to nonparametric tests as a "safe" choice, reasoning that avoiding distributional assumptions is always better. This is incorrect. When the data are continuous, approximately normally distributed, and have similar variances across groups, the independent samples t-test has greater statistical power. Using the Mann-Whitney U test unnecessarily sacrifices about 5% of your power to detect true effects.
Always justify your test choice. State the specific assumption violation that led you to select the Mann-Whitney U test:
A Shapiro-Wilk test indicated significant non-normality in the control group, W = 0.84, p = .003. Therefore, a Mann-Whitney U test was used instead of an independent samples t-test.
Mistake 4: Confusing Mann-Whitney U With Wilcoxon Signed-Rank
The Mann-Whitney U test and the Wilcoxon signed-rank test are both nonparametric, but they serve entirely different research designs:
| Test | Design | Parametric equivalent | |------|--------|----------------------| | Mann-Whitney U | Two independent groups | Independent samples t-test | | Wilcoxon signed-rank | Paired/repeated measures | Paired samples t-test |
Use Mann-Whitney U when different participants are in each group (between-subjects). Use the Wilcoxon signed-rank test when the same participants are measured twice or when observations are naturally paired (within-subjects).
Mistake 5: Omitting the Z-Score
The U statistic alone is difficult to interpret because its magnitude depends on sample sizes. A U of 150 means something very different with n = 10 per group versus n = 50 per group. The z-score standardizes U and is necessary for computing the effect size r. Always include both U and z in your report.
Mistake 6: Failing to Specify Exact vs Asymptotic
For small samples, exact and asymptotic p-values can differ meaningfully. Readers need to know which you are reporting to evaluate the accuracy of your significance test. When using exact p-values, state this explicitly (e.g., "exact p = .023"). When using asymptotic p-values, the inclusion of the z-score implicitly signals this, but being explicit is better practice.
Mistake 7: Ignoring Assumptions About Distribution Shape
While the Mann-Whitney U test does not assume normality, it does assume that both groups have similarly shaped distributions if you want to interpret the result as a difference in medians. If the distributions have different shapes (e.g., one is skewed right and the other is symmetric), the test evaluates stochastic dominance rather than a median difference. In this case, report mean ranks rather than medians, or note the distributional difference in your write-up.
APA Table Format
When reporting multiple Mann-Whitney U comparisons or multiple outcome variables, an APA-formatted table is more efficient than inline text:
Table 1
Mann-Whitney U Test Results for Patient Outcomes by Treatment Condition
| Outcome | Mindfulness Mdn (IQR) | Standard Care Mdn (IQR) | U | z | p | rrb | |---------|--------------------------|----------------------------|-----|-----|-----|-------------------| | Satisfaction (1-7) | 6.00 (5.00-7.00) | 4.00 (3.00-5.00) | 42.50 | -3.12 | .002 | .62 | | Pain (0-10) | 3.00 (2.00-4.00) | 5.00 (3.00-7.00) | 51.00 | -2.78 | .005 | .55 | | Anxiety (0-10) | 4.00 (3.00-6.00) | 5.00 (3.50-6.50) | 92.00 | -0.98 | .329 | .18 |
Note. N = 30 (15 per group). Effect size is rank-biserial correlation. Significance is two-tailed.
Calculation Accuracy
Getting the U statistic, z-score, exact p-value, and effect size right requires careful computation — especially with tied ranks and small samples. Manual calculation is tedious and error-prone.
Our free Mann-Whitney U Test Calculator computes all the components you need for APA 7th edition reporting:
- The U statistic with automatic tie correction
- Both exact and asymptotic p-values
- Rank-biserial correlation effect size with interpretation
- Medians and interquartile ranges for each group
- A ready-to-copy APA results sentence
Enter your data, click calculate, and copy the formatted result directly into your manuscript. The calculator also generates publication-quality box plots for visual comparison of the two groups.
Frequently Asked Questions
Is Mann-Whitney U the same as the Wilcoxon rank-sum test?
Yes. The Mann-Whitney U test and the Wilcoxon rank-sum test are mathematically equivalent — they produce identical p-values and test the same null hypothesis. The naming difference is historical. Do not confuse the Wilcoxon rank-sum test (for independent groups) with the Wilcoxon signed-rank test (for paired samples).
Should I report one-tailed or two-tailed p-values?
Use two-tailed p-values unless you specified a directional hypothesis before data collection. APA 7th edition recommends two-tailed tests as the default. If you use a one-tailed test, state this explicitly and justify why a directional prediction was warranted.
What is the minimum sample size for a Mann-Whitney U test?
The test can be performed with as few as 4 observations per group, but power will be very low. For adequate power (80%) to detect a medium effect (r = .30), aim for at least 20-30 observations per group. Use an a priori power analysis to determine the sample size needed for your specific research context.
Can I report both means and medians alongside the Mann-Whitney U result?
You may report means for additional context, but the primary descriptive statistics must be medians and interquartile ranges. If you include means, clarify that the Mann-Whitney U test does not evaluate mean differences and that the means are provided for descriptive completeness only.
How do I handle ties when reporting Mann-Whitney U results?
Most statistical software applies a tie correction to the z-score automatically. If ties are extensive (more than 15-20% of observations), mention the correction in your report: "A Mann-Whitney U test with tie correction was used." For small samples with many ties, prefer exact p-values over the asymptotic approximation, as ties affect the accuracy of the normal approximation more than the exact permutation distribution.
Try It With Your Own Data
Ready to report your Mann-Whitney U test results in perfect APA 7th edition format? Use our free Mann-Whitney U Test Calculator to compute the U statistic, z-score, effect size, and get a copy-paste APA sentence — no manual calculations needed.