When to Use T-Test vs Mann-Whitney U Test

Q: What effect size should I report for Mann-Whitney U?

Report the rank-biserial correlation r, calculated as r = Z / sqrt(N). Benchmarks are the same as Pearson's r: .10 small, .30 medium, .50 large.

Q: Is Mann-Whitney the same as Wilcoxon rank-sum test?

Yes. The Mann-Whitney U test and the Wilcoxon rank-sum test are mathematically equivalent tests for comparing two independent groups. They differ from the Wilcoxon signed-rank test, which is for paired samples.

Q: Should I always run a normality test before choosing?

Not necessarily. With large samples (n > 30), normality tests often reject normality for trivial deviations. Consider the research context, examine Q-Q plots visually, and base your decision on the distribution shape rather than relying solely on test p-values.

Q: Is Welch's t-test a good compromise between t-test and Mann-Whitney?

Welch's t-test handles unequal variances but still assumes approximate normality. It is a better default than Student's t-test but does not address non-normality. For non-normal data, Mann-Whitney remains the appropriate choice.

Q: Can I use Mann-Whitney with ordinal data?

Yes. Mann-Whitney works with ordinal data because it operates on ranks rather than raw values. This makes it suitable for Likert-scale items and other ordered categorical variables.

The Fundamental Question

You have two independent groups and want to know if they differ. The most common options are the independent samples t-test and the Mann-Whitney U test. Choosing between them is not a matter of preference — it depends on your data characteristics, sample size, and research goals.

This guide provides a systematic comparison so you can make an informed decision every time.

Quick Comparison Table

| Feature | Independent t-Test | Mann-Whitney U Test | |---------|-------------------|-------------------| | What it compares | Means | Rank distributions | | Data type required | Continuous (interval/ratio) | Ordinal or continuous | | Normality assumption | Yes | No | | Equal variance assumption | Yes (Welch's corrects this) | Similar shape (for median comparison) | | Sensitivity to outliers | High | Low | | Statistical power | Higher (when assumptions met) | Slightly lower | | Sample size guidance | n >= 30 per group (robust) | Any size; best with n >= 5 | | Effect size | Cohen's d | Rank-biserial r |

How Each Test Works

The Independent Samples t-Test

The t-test calculates the difference between two group means and divides it by the standard error of the difference. The resulting t-statistic measures how many standard errors the observed difference is from zero.

Formula (simplified):

t = (Mean1 - Mean2) / SE(difference)

The test assumes that the sampling distribution of the mean difference is approximately normal. This holds when the data themselves are normal, or when sample sizes are large enough for the Central Limit Theorem to apply.

Try it yourself with the independent t-test calculator.

The Mann-Whitney U Test

The Mann-Whitney U test combines both groups, ranks all observations from lowest to highest, and then checks whether one group's ranks are systematically higher or lower than the other's.

Core logic: If the two groups come from the same distribution, you would expect the ranks to be evenly mixed. If one group's values are consistently higher, its ranks will be disproportionately large, producing a significant U statistic.

Try it yourself with the Mann-Whitney U calculator.

Assumptions: The Decision Starts Here

The most important factor in choosing between these tests is whether your data meet the assumptions of the t-test.

t-Test Assumptions

Independence: Observations are independent within and between groups.
Normality: The dependent variable is approximately normally distributed in each group.
Homogeneity of variance: The two groups have similar variances (addressed by Welch's correction).
Interval or ratio scale: The dependent variable is measured on a scale where differences between values are meaningful and equal.

Mann-Whitney Assumptions

Independence: Same as the t-test.
Ordinal or continuous scale: Data must be at least rankable.
Similar distribution shape: If you want to interpret results as a comparison of medians, both groups should have similarly shaped distributions (similar skewness and spread).

The Mann-Whitney has fewer and weaker assumptions, which is precisely why it exists — as a fallback when the t-test's assumptions are not met.

When to Use the t-Test

Choose the t-test when all of the following are true:

Your Data Are Continuous and Measured on an Interval or Ratio Scale

Test scores, reaction times, blood pressure measurements, weights, and monetary values are all appropriate for the t-test.

Normality Is Reasonably Met

Check normality using:

Visual inspection: Histograms and Q-Q plots should show an approximately symmetric, bell-shaped distribution.
Shapiro-Wilk test: A non-significant result (p > 0.05) supports the normality assumption.

Important nuance: The t-test is robust to moderate normality violations when sample sizes are at least 30 per group. With large samples, even moderately skewed data produce reliable t-test results thanks to the Central Limit Theorem.

There Are No Extreme Outliers

The mean is sensitive to outliers. A single extreme value can shift the mean substantially and inflate or deflate the t-statistic. If outliers are present, consider whether they represent data errors or genuine extreme values before choosing your test.

You Want Maximum Statistical Power

When assumptions are met, the t-test is more powerful than the Mann-Whitney U test, meaning it is more likely to detect a real difference. The power advantage is typically around 5% (the asymptotic relative efficiency of the Mann-Whitney compared to the t-test is approximately 0.955).

When to Use the Mann-Whitney U Test

Choose the Mann-Whitney when any of the following apply:

Your Data Are Ordinal

Likert scale ratings, severity rankings, or any variable where the intervals between values are not guaranteed to be equal. A satisfaction rating of 4 versus 5 may not represent the same psychological difference as 1 versus 2.

Normality Is Clearly Violated and Samples Are Small

If the Shapiro-Wilk test is significant (p < 0.05) and your sample sizes are below 30, the t-test may produce unreliable results. The Mann-Whitney does not require normality.

Outliers Are Present and Cannot Be Removed

Because the Mann-Whitney works with ranks, an extreme outlier has the same influence as a value that is merely the largest — both receive the highest rank. This makes the test naturally robust.

Distributions Are Heavily Skewed

Income data, response time data, and many biological measurements are naturally skewed. When skewness is severe and transformations are not appropriate, the Mann-Whitney is the safer choice.

Side-by-Side Example

Let us apply both tests to the same dataset and compare the outcomes.

Dataset

A company tests two training methods and measures employee performance scores (0-100) after training.

Method A: 78, 82, 85, 88, 76, 91, 84, 79, 87, 83, 80, 86

Method B: 72, 68, 75, 80, 65, 77, 71, 74, 69, 73, 76, 70

t-Test Results

| Statistic | Value | |-----------|-------| | Mean (Method A) | 83.25 | | Mean (Method B) | 72.50 | | Mean difference | 10.75 | | t-statistic | 5.32 | | df | 22 | | p-value | 0.000023 | | Cohen's d | 2.17 |

Mann-Whitney U Results

| Statistic | Value | |-----------|-------| | Median (Method A) | 83.5 | | Median (Method B) | 72.5 | | U statistic | 10.0 | | z-score | -3.58 | | p-value | 0.0003 | | Rank-biserial r | 0.861 |

Comparing the Outcomes

Both tests reach the same conclusion: Method A produces significantly higher scores. However, notice that:

The t-test p-value is smaller (0.000023 vs 0.0003), reflecting its greater power with well-behaved data.
Both effect sizes indicate a large effect, but they use different scales (d = 2.17 vs r = 0.861).
The t-test reports means (83.25 vs 72.50) while the Mann-Whitney reports medians (83.5 vs 72.5).

In this case, where the data are approximately normal and there are no outliers, the t-test is the better choice because it extracts more information from the data.

Now Add an Outlier

Suppose Method A had one unusual score: replace 91 with 150 (perhaps a data entry error that was not caught).

Modified Method A: 78, 82, 85, 88, 76, 150, 84, 79, 87, 83, 80, 86

| Test | p-value | Effect size | |------|---------|-------------| | t-test | 0.011 | d = 1.29 | | Mann-Whitney U | 0.0003 | r = 0.861 |

The outlier weakened the t-test substantially (p went from 0.000023 to 0.011 and d dropped from 2.17 to 1.29) because the mean and standard deviation of Method A were both inflated. The Mann-Whitney result barely changed because ranking the outlier simply places it at the top — it still gets the highest rank whether the value is 91 or 150.

A Decision Framework

Use this flowchart-style approach:

Step 1: What is your data type?

Ordinal data → Use Mann-Whitney.
Continuous data → Go to Step 2.

Step 2: Check sample size.

n >= 30 per group → The t-test is robust to non-normality. Go to Step 3.
n < 30 per group → Go to Step 3 (normality matters more).

Step 3: Check normality.

Normal or approximately normal → Go to Step 4.
Clearly non-normal → Use Mann-Whitney.

Step 4: Check for outliers.

No extreme outliers → Use the t-test.
Extreme outliers present → Use Mann-Whitney (or remove outliers with justification and use the t-test).

Power Comparison

Statistical power is the probability of detecting a real effect when one exists. Under ideal conditions:

The t-test achieves the specified power (e.g., 0.80).
The Mann-Whitney achieves approximately 95.5% of the t-test's power (asymptotic relative efficiency = 3/pi approximately 0.955).

This means that if you need 64 participants per group for the t-test at 80% power, you would need approximately 67 per group for the Mann-Whitney to achieve the same power. The difference is small and often negligible in practice.

However, when data are non-normal, the Mann-Whitney can actually be more powerful than the t-test, sometimes substantially so. With heavy-tailed distributions or contaminated normal distributions, the Mann-Whitney may need fewer participants than the t-test to detect the same effect.

Effect Size Comparison

| Measure | Test | Scale | Benchmarks | |---------|------|-------|------------| | Cohen's d | t-test | -infinity to +infinity | 0.2 small, 0.5 medium, 0.8 large | | Rank-biserial r | Mann-Whitney | -1 to +1 | 0.1 small, 0.3 medium, 0.5 large |

Both effect sizes quantify the magnitude of the group difference, but they are not directly comparable. Cohen's d expresses the difference in standard deviation units, while the rank-biserial r expresses the probability of a randomly chosen value from one group exceeding a randomly chosen value from the other.

What Reviewers Expect

In academic publishing, reviewers generally expect you to:

Justify your test choice by reporting assumption checks (normality test, outlier inspection).
Report the appropriate test for your data characteristics.
Not use both tests and then report whichever gives the more favorable result (this is a form of p-hacking).
Pre-specify the test in your analysis plan when possible.

A common and defensible approach is: run normality checks first, then select the test based on the results. Report the assumption check results and explain your test choice.

Step-by-Step Decision Flowchart: T-Test or Mann-Whitney?

Choosing between the t-test and Mann-Whitney U test does not have to be complicated. Work through the following five questions in order, and your answer will become clear.

Question 1: Are your groups independent?

Before anything else, confirm that your two groups consist of different participants with no pairing or matching. If the same individuals are measured twice (pre-test and post-test, for example), you need a paired t-test or the Wilcoxon signed-rank test, not an independent-samples comparison. Repeated measures on the same subjects violate the independence assumption that both the t-test and Mann-Whitney require.

Question 2: Is your dependent variable continuous?

The t-test requires data measured on an interval or ratio scale where numerical differences are meaningful. If your outcome is categorical (yes/no, pass/fail), consider a chi-square test or Fisher's exact test instead. If the variable is ordinal (rankings, Likert-scale items), skip ahead to the Mann-Whitney — it handles ordinal data natively because it operates on ranks rather than raw values.

Question 3: Is the sample size greater than 30 per group?

Sample size matters because of the Central Limit Theorem. With 30 or more observations per group, the sampling distribution of the mean approaches normality regardless of the underlying data distribution. This means the t-test becomes robust to moderate non-normality at larger sample sizes. If your groups are smaller than 30, proceed to Question 4 and pay close attention to the distributional shape.

Question 4: Does the data pass normality tests?

Apply the Shapiro-Wilk test to each group separately. If both groups yield p > 0.05, the normality assumption is reasonably supported — use the t-test. If either group shows significant departure from normality (p < 0.05), combine this information with a visual inspection of Q-Q plots and histograms. For small samples with clear non-normality, the Mann-Whitney is the safer choice. For large samples (n > 30) with mild to moderate non-normality, the t-test remains acceptable.

Question 5: Are variances equal?

Run Levene's test for equality of variances. If variances are approximately equal (Levene's p > 0.05), Student's t-test is appropriate. If variances differ significantly (Levene's p < 0.05), use Welch's t-test, which does not assume equal variances. Note that unequal variances alone do not push you toward the Mann-Whitney — Welch's correction handles this effectively within the parametric framework.

Summary path:

Ordinal data → Mann-Whitney
Continuous, large sample, approximately normal → t-test (Student's or Welch's)
Continuous, small sample, non-normal → Mann-Whitney
Continuous, large sample, severely non-normal with extreme outliers → Mann-Whitney

Comparing APA Output: Side-by-Side Examples

When you report results, the format differs depending on which test you chose. Here is the same study reported both ways so you can see exactly how the APA output changes.

Study context: A researcher compares test anxiety scores between students who received mindfulness training (n = 30) and a control group (n = 30).

T-Test APA Report

An independent samples t-test revealed that students in the mindfulness group (M = 42.3, SD = 8.7) reported significantly lower test anxiety than the control group (M = 48.9, SD = 9.2), t(58) = 2.89, p = .005, d = 0.75.

Key elements: means, standard deviations, t-statistic with degrees of freedom, p-value, and Cohen's d.

Mann-Whitney U APA Report

A Mann-Whitney U test indicated that test anxiety scores were significantly lower for the mindfulness group (Mdn = 41.5) than for the control group (Mdn = 49.0), U = 287, z = -2.67, p = .008, r = .35.

Key elements: medians, U-statistic, z-score, p-value, and rank-biserial correlation r.

What Changes Between the Two Reports

| Element | T-Test | Mann-Whitney | |---------|--------|-------------| | Central tendency | Mean (M) and SD | Median (Mdn) | | Test statistic | t(df) | U and z | | Effect size | Cohen's d | Rank-biserial r | | Effect size scale | 0.2 / 0.5 / 0.8 | 0.1 / 0.3 / 0.5 | | Confidence interval | For mean difference | For median difference (optional) |

Notice that the p-values differ slightly (0.005 vs 0.008) even with the same data. The t-test extracts more information from normally distributed data, yielding a smaller p-value. The effect sizes also use different scales: d = 0.75 is a medium-to-large effect in Cohen's framework, while r = 0.35 is a medium effect in the rank-biserial framework. Both indicate a meaningful group difference, but direct numerical comparison between d and r is not valid.

Common Misconceptions About Non-Parametric Tests

Several widely held beliefs about non-parametric tests are either wrong or misleading. Understanding these misconceptions helps you make better methodological choices.

Myth 1: Non-parametric tests are always less powerful

Reality: Under ideal conditions (normal data, no outliers), the t-test has approximately 5% more power than the Mann-Whitney. But when data are non-normal — heavy-tailed, skewed, or contaminated with outliers — the Mann-Whitney can be substantially more powerful. With contaminated normal distributions (where even 5-10% of observations come from a different distribution), the Mann-Whitney frequently outperforms the t-test. The blanket statement that non-parametric tests sacrifice power is only true when parametric assumptions hold perfectly.

Myth 2: The Mann-Whitney U test compares medians

Reality: The Mann-Whitney tests whether one group's values tend to be larger than the other group's values. Technically, it assesses stochastic dominance — the probability that a randomly selected observation from one group exceeds a randomly selected observation from the other. It only simplifies to a median comparison when both groups have the same distributional shape (same skewness, same spread). If the distributions differ in shape, you can have equal medians with a significant Mann-Whitney result, or different medians with a non-significant result. Report it as a test of rank distributions, not a test of medians.

Myth 3: You must always test normality before choosing a test

Reality: Routine normality testing before every analysis is not always necessary or even desirable. With large samples (n > 30 per group), normality tests like Shapiro-Wilk are overpowered — they reject normality for trivial deviations that have no practical impact on t-test validity. In some fields, the choice of test is dictated by the research design or measurement scale, not by post-hoc normality testing. For example, if your dependent variable is a 5-point Likert scale, the Mann-Whitney is appropriate regardless of what a normality test says. Consider the design context, examine plots visually, and do not let a single normality test p-value drive your entire analytic strategy.

Myth 4: Large samples always justify the t-test

Reality: While large samples make the t-test robust to moderate non-normality, they do not eliminate all problems. Severe outliers can still distort the mean and inflate the standard deviation, even with n = 200 or more. If 5% of your data consists of extreme values (perhaps from measurement errors or a distinct subpopulation), the t-test's mean-based comparison may be misleading regardless of sample size. Large samples also do not fix fundamental measurement issues — if your variable is truly ordinal, using a t-test with 1000 observations does not make it more appropriate than using the Mann-Whitney.

When Both Tests Give Different Results

One of the most unsettling experiences in data analysis is running both tests on the same data and getting contradictory conclusions — one significant, one not. This happens more often than textbooks suggest, and understanding why it occurs will help you handle it properly.

Why Discrepancies Happen

The most common cause is that the effect is near the detection threshold for at least one test. The t-test and Mann-Whitney evaluate different aspects of the data (means vs rank distributions), and borderline effects may cross the significance threshold for one but not the other. Other causes include:

Outliers inflating or deflating the t-test. A few extreme values can push the t-test toward or away from significance while leaving the Mann-Whitney largely unaffected.
Distributional differences between groups. If one group is skewed and the other is symmetric, the tests are effectively asking different questions, and different answers are not surprising.
Tied values reducing Mann-Whitney power. Many identical values in the data reduce the variability of ranks, which can make the Mann-Whitney less sensitive.

What to Do

Report both results transparently. If you ran both tests, report both outcomes. Selectively reporting only the significant one is a form of p-hacking.
Discuss the discrepancy. Explain why the tests might disagree based on your data characteristics (outliers, skewness, ties).
Prioritize the test that matches your data. If normality is violated, the Mann-Whitney result is more trustworthy. If assumptions are met, the t-test result carries more weight.
Focus on effect sizes. When p-values give conflicting messages, effect sizes often tell a clearer story. A small to medium effect size with borderline significance simply means your study was not powered enough to definitively detect the effect.
Consider a sensitivity analysis. Run the analysis with and without outliers, or with a different transformation, to see how robust the conclusion is.

Frequently Asked Questions

Can I use a t-test if my data is slightly non-normal?

Yes. The t-test is robust to moderate normality violations, especially with sample sizes above 30 per group. The Central Limit Theorem ensures that the sampling distribution of the mean approaches normality even when the underlying data do not. However, for severe skewness, heavy tails, or small samples (n < 15), the Mann-Whitney U test is more appropriate because its validity does not depend on distributional shape.

What effect size should I report for Mann-Whitney U?

Report the rank-biserial correlation r, calculated as r = Z / sqrt(N), where Z is the standardized test statistic and N is the total sample size. Interpretation benchmarks follow the same guidelines as Pearson's r: .10 is a small effect, .30 is a medium effect, and .50 is a large effect. Some researchers also report the common language effect size (CLES), which expresses the probability that a randomly selected observation from one group exceeds a randomly selected observation from the other.

Is Mann-Whitney the same as the Wilcoxon rank-sum test?

Yes. The Mann-Whitney U test and the Wilcoxon rank-sum test are mathematically equivalent tests for comparing two independent groups. They produce the same p-value and lead to the same conclusion — they simply use different test statistics (U vs W) that can be converted to one another. Do not confuse either of these with the Wilcoxon signed-rank test, which is a different test designed for paired samples.

Should I always run a normality test before choosing?

Not necessarily. With large samples (n > 30 per group), formal normality tests like Shapiro-Wilk often reject normality for trivial deviations that have no practical impact on the t-test. Consider the research context and measurement scale first. Examine Q-Q plots and histograms visually to assess whether the distribution is reasonably symmetric. Base your decision on the overall shape of the data, the presence or absence of outliers, and the measurement level, rather than relying solely on a normality test p-value.

Can Mann-Whitney handle tied values?

Yes, but many ties reduce the test's power. The standard Mann-Whitney formula includes a tie correction factor that adjusts the variance of the U statistic. If more than 15-20% of all values across both groups are tied (common with discrete data or coarse measurement scales), the test loses sensitivity. In such cases, report the tie-corrected z-statistic and consider whether the measurement could be refined to reduce ties.

What sample size is needed for a Mann-Whitney test?

A minimum of 5 observations per group is needed for the test to produce a meaningful result. For the normal approximation (z-test) to be accurate, at least 8-10 per group is recommended. For adequate statistical power to detect medium effects (r = .30), aim for at least 20-30 per group. Use a formal power analysis calculator to determine the exact sample size needed for your expected effect size and desired power level.

Is Welch's t-test a good compromise between t-test and Mann-Whitney?

Welch's t-test addresses one specific assumption violation — unequal variances — but it still assumes that the data are approximately normally distributed. It is an excellent default over Student's t-test because it performs equally well when variances are equal and better when they are not. However, it does not solve non-normality problems. If your concern is about the shape of the distribution rather than unequal spreads, the Mann-Whitney remains the appropriate choice.

Can I use Mann-Whitney with ordinal data?

Yes. The Mann-Whitney U test is specifically designed to work with ordinal data because it operates on ranks rather than raw values. It does not assume equal intervals between data points, making it suitable for Likert-scale items, severity ratings, satisfaction rankings, and other ordered categorical variables. In fact, ordinal data is one of the strongest justifications for choosing the Mann-Whitney over the t-test.

Can I run both tests and report the one with the smaller p-value?

No. This is a form of p-hacking that inflates your false positive rate. Choose one test based on your data characteristics and report its results regardless of the outcome.

What if the two tests give different conclusions?

This can happen with borderline data. If the t-test is significant but the Mann-Whitney is not (or vice versa), it usually indicates that the effect is near the detection threshold. Report the result from the test that is most appropriate for your data, and note the discrepancy in your discussion.

Is the t-test ever appropriate for Likert scale data?

This is debated. Some researchers treat Likert data as interval and use the t-test; others insist it is ordinal and use Mann-Whitney. If you have 7-point or wider scales with data that are approximately normal, the t-test is often acceptable. For 3-point or 5-point scales, the Mann-Whitney is generally safer.

Should I transform my data instead of using Mann-Whitney?

Logarithmic or square root transformations can normalize skewed data, allowing you to use the t-test. This is a valid approach, but it changes what you are comparing (e.g., geometric means instead of arithmetic means). If the transformation makes substantive sense for your field, it can be a good option.

Does the Mann-Whitney require equal sample sizes?

No. The Mann-Whitney U test works with unequal group sizes. However, grossly unequal sizes (e.g., 10 vs 100) can affect the test's sensitivity. Aim for roughly balanced groups when possible.

Try Both in StatMate

StatMate makes it easy to compare the two approaches on your own data. Enter your values into the t-test calculator and the Mann-Whitney U calculator to see how the results compare. Both calculators include assumption checks, effect sizes, and APA-formatted output.