What Is a p-Value?
A p-value is the probability of obtaining a result at least as extreme as the one observed, assuming the null hypothesis is true. That definition is precise but not always intuitive, so consider this analogy.
Imagine you suspect a coin is unfair. You flip it 20 times and get 15 heads. The p-value answers the question: "If the coin were perfectly fair, how likely would I be to see 15 or more heads in 20 flips?" If that probability is very low (say, p = .021), you have reason to doubt the coin is fair. If it is relatively high (say, p = .41), the result is easily explained by normal chance.
The p-value does not tell you whether your hypothesis is correct. It tells you how surprising your data would be if nothing were actually going on. This distinction is critical, and misunderstanding it is the source of most p-value misinterpretations.
How to Interpret p-Values
The Basic Logic
Every hypothesis test begins with a null hypothesis (H0), which typically states there is no effect, no difference, or no relationship. The p-value quantifies how compatible your observed data are with that null hypothesis.
- A small p-value means your data are unlikely under H0. This gives you grounds to reject H0.
- A large p-value means your data are consistent with H0. You fail to reject H0 (but this does not prove H0 is true).
Interpretation Reference Table
| p-value range | Conventional label | Typical interpretation | |---------------|-------------------|----------------------| | p < .001 | Highly significant | Very strong evidence against H0 | | p < .01 | Significant | Strong evidence against H0 | | p < .05 | Significant | Sufficient evidence against H0 at the conventional threshold | | .05 < p < .10 | Marginally significant | Weak evidence; sometimes discussed but not conclusive | | p > .10 | Not significant | Insufficient evidence to reject H0 |
A Worked Example
Suppose you conduct an independent samples t-test comparing exam scores between a study-group condition (M = 78.4, SD = 9.2, n = 35) and a solo-study condition (M = 73.1, SD = 10.5, n = 35). The test yields t(68) = 2.25, p = .028.
Here is how to interpret this step by step:
- State the null hypothesis: There is no difference in exam scores between the two study conditions.
- Check the p-value against the threshold: p = .028 is less than .05.
- Make a decision: Reject the null hypothesis.
- Interpret in context: Students in the study-group condition scored significantly higher on the exam than those who studied alone.
The p-value of .028 means that if there were truly no difference between conditions, you would observe a difference this large or larger only about 2.8% of the time by chance alone.
The .05 Threshold: Why and When
The convention of using alpha = .05 as the significance threshold traces back to Ronald Fisher in the 1920s. Fisher suggested .05 as a convenient reference point, not as a rigid boundary. Over decades, however, it became treated as an absolute cutoff, which Fisher himself never intended.
When .05 Makes Sense
For most exploratory research in the social and behavioral sciences, alpha = .05 provides a reasonable balance between detecting real effects (power) and avoiding false positives (Type I error). It means you accept a 5% chance of concluding an effect exists when it actually does not.
When to Use a Different Threshold
Some situations call for stricter or more lenient thresholds:
- Multiple comparisons: When testing many hypotheses simultaneously, the family-wise error rate inflates. Bonferroni correction or false discovery rate adjustments lower the per-test alpha.
- High-stakes decisions: Clinical trials, drug approvals, and genomics studies often use p < .01 or p < .001 because the consequences of a false positive are severe.
- Exploratory research: Some fields accept p < .10 for preliminary findings that warrant further investigation.
The key point is that .05 is a convention, not a law of nature. Always consider the context and consequences of your decision.
Common Misinterpretations of p-Values
This section addresses the most widespread errors in p-value interpretation. If you take away one thing from this guide, let it be this: most researchers at some point have held at least one of these misconceptions.
Mistake 1: "p = .03 Means There Is a 97% Chance the Result Is True"
This is perhaps the single most common misinterpretation. The p-value is not the probability that your research hypothesis is true. It is the probability of obtaining your data (or more extreme data) given that the null hypothesis is true. These are fundamentally different statements.
The probability that a hypothesis is true given the data requires Bayesian analysis with prior probabilities. A frequentist p-value simply cannot answer that question.
Mistake 2: "Non-significant Means No Effect"
A result of p = .12 does not prove that no effect exists. It means you did not find sufficient evidence to reject the null hypothesis at your chosen alpha level. The study may have been underpowered (too few participants), the effect may be real but small, or measurement error may have obscured it.
Absence of evidence is not evidence of absence. This is especially important in studies with small sample sizes, where non-significant results are common even when real effects exist.
Mistake 3: "The p-Value Tells You the Size of the Effect"
A very small p-value (say, p < .001) does not mean the effect is large or important. With a large enough sample, even trivially small differences become statistically significant. A study with 50,000 participants might find a 0.5-point difference on a 100-point scale with p < .001. The effect is statistically significant but practically meaningless.
Always report and interpret an effect size alongside the p-value. Common effect size measures include Cohen's d, eta squared (partial eta squared), and R squared.
Mistake 4: "Smaller p = More Important Result"
A result with p = .001 is not necessarily more important or more replicable than one with p = .04. The p-value is influenced by sample size, variance, and the magnitude of the effect. Two studies examining the same phenomenon can yield different p-values simply because they used different sample sizes.
Importance should be judged by effect size, practical significance, and how well the finding replicates, not by comparing p-values.
Mistake 5: "p = .049 and p = .051 Are Fundamentally Different"
Treating p = .049 as "significant" and p = .051 as "not significant" implies a sharp qualitative boundary that does not exist. The evidence against the null hypothesis is nearly identical for both values. Reporting one as a discovery and the other as a null result is an artifact of dichotomous thinking, not a reflection of the underlying data.
Many statisticians and journal editors now advocate for reporting exact p-values and interpreting them on a continuum rather than relying on pass/fail cutoffs.
Mistake 6: "A Significant p-Value Means the Results Will Replicate"
Statistical significance in a single study does not guarantee that the finding will replicate. A p = .04 result has a meaningful chance of failing to reach significance in an exact replication, particularly if the original study was underpowered or if the true effect is small.
Replication depends on effect size, sample size, and study design. The p-value from a single study is one piece of evidence, not proof.
How to Report p-Values in APA Format
APA 7th edition has specific rules for reporting p-values. Following these conventions signals methodological rigor and helps readers interpret your results consistently.
Rule 1: Report Exact p-Values
Report the exact p-value to two or three decimal places. Do not simply write "p < .05" when you have a more precise value.
- Correct: p = .034
- Correct: p = .007
- Avoid: p < .05 (when you know the exact value)
Rule 2: Use p < .001 for Very Small Values
When the p-value is less than .001, report it as p < .001 rather than writing out many decimal places. Do not write p = .000, as a p-value is never exactly zero.
- Correct: p < .001
- Incorrect: p = .000
- Incorrect: p = .0003
Rule 3: No Leading Zero
Because p-values cannot exceed 1.0, APA style omits the leading zero. The same rule applies to other statistics bounded by 1, such as r and R squared.
- Correct: p = .034
- Incorrect: p = 0.034
APA Reporting Examples by Test
Independent samples t-test:
The treatment group (M = 24.50, SD = 4.80) scored significantly higher than the control group (M = 20.10, SD = 5.30), t(58) = 3.45, p = .001, d = 0.89.
One-way ANOVA:
There was a statistically significant difference in satisfaction ratings across the three conditions, F(2, 87) = 4.92, p = .009, partial eta squared = .10.
Pearson correlation:
Study hours and GPA were positively correlated, r(98) = .37, p < .001.
Chi-square test of independence:
There was a significant association between department and turnover status, chi-square(3, N = 240) = 11.85, p = .008, V = .22.
Non-significant result (still report the exact p-value):
The difference between groups was not statistically significant, t(44) = 1.38, p = .175, d = 0.41.
Note that even when results are not significant, you still report the exact p-value and effect size. This information is valuable for meta-analyses and future power analyses.
p-Value vs Effect Size: Why Both Matter
The p-value and effect size answer different questions. The p-value asks: "Is there evidence that an effect exists?" The effect size asks: "How large is that effect?"
| | p-value | Effect size | |---|---------|-------------| | Question answered | Is the effect likely real? | How large is the effect? | | Influenced by sample size | Heavily | Minimally | | Can be misleading alone | Yes | Yes | | APA 7th edition requirement | Yes | Yes |
Consider two studies on a new teaching method:
- Study A (N = 500): t(498) = 2.10, p = .036, d = 0.19
- Study B (N = 40): t(38) = 2.85, p = .007, d = 0.90
Study A has a significant result but a tiny effect size. The teaching method produces a barely noticeable improvement. Study B has a smaller p-value and a large effect size, suggesting a substantial and meaningful improvement. Reporting only p-values would obscure this important distinction.
APA 7th edition requires both for good reason. Together, they give a complete picture of your findings.
Statistical Significance vs Practical Significance
Statistical significance means the result is unlikely under the null hypothesis. Practical significance means the result matters in the real world. These are not the same thing.
A pharmaceutical trial might find that a new drug lowers blood pressure by 0.5 mmHg more than a placebo, with p < .001 and N = 20,000. Statistically significant? Yes. Clinically meaningful? Probably not, since doctors consider a change of at least 5 mmHg necessary for practical benefit.
When interpreting your results, always ask three questions:
- Is the effect statistically significant? (Check the p-value against your alpha level.)
- How large is the effect? (Check the effect size against benchmarks and prior research.)
- Does the effect matter in practice? (Consider the real-world implications in your specific domain.)
A finding that satisfies all three is the strongest kind of evidence. A finding that satisfies only the first is the weakest.
The P-Value Controversy: ASA Statement and Beyond
The debate over p-values reached a turning point in 2016 when the American Statistical Association (ASA) published its first-ever formal statement on statistical significance and p-values. This was unprecedented in the ASA's 177-year history and reflected deep concern about the widespread misuse of p-values across the sciences.
The ASA's Six Principles
The ASA statement articulated six principles that every researcher should understand:
-
P-values can indicate how incompatible the data are with a specified statistical model. The p-value quantifies the mismatch between the data and the null hypothesis, but it is conditional on the model being correct.
-
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone. This addresses the most common misinterpretation directly.
-
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold. A result should not be dismissed solely because p = .06, nor accepted solely because p = .04.
-
Proper inference requires full reporting and transparency. Cherry-picking significant results, running analyses until significance is achieved (p-hacking), and selectively reporting outcomes all undermine the validity of p-values.
-
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result. Small p-values can arise from trivial effects in large samples, and large p-values can occur with important effects in small samples.
-
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis. Other approaches such as confidence intervals, Bayesian methods, and effect sizes should accompany the p-value.
Why p < .05 Is Arbitrary
The .05 threshold has no mathematical derivation or scientific justification. Ronald Fisher initially proposed it as a loose guideline, writing that results below this level were worth a second look. Jerzy Neyman and Egon Pearson later formalized hypothesis testing with fixed error rates, and the two frameworks became conflated over time. The .05 cutoff is the result of historical convention, not scientific optimization.
Several consequences of this arbitrary threshold are well documented. Researchers engage in p-hacking, adjusting analyses, sample sizes, or variables until the p-value crosses below .05. Publication bias favors significant results, leaving non-significant findings in the file drawer. And the replication crisis in psychology, medicine, and other fields has been attributed in part to the uncritical application of this threshold.
In 2019, a group of more than 800 scientists published a call to abandon the term "statistically significant" entirely. They argued that the binary classification of results as significant or not significant leads to overconfident claims and overlooked evidence.
The Movement Toward Confidence Intervals and Effect Sizes
In response to these concerns, many journals and professional organizations now require or strongly recommend reporting confidence intervals and effect sizes alongside (or instead of) p-values. The reasoning is straightforward:
- Confidence intervals show the range of plausible values for the parameter of interest, providing information about both the direction and precision of the estimate. A 95% CI of [0.2, 4.8] tells you more than p = .03 alone.
- Effect sizes quantify the magnitude of the observed phenomenon, independent of sample size. Cohen's d = 0.15 versus d = 1.20 tells you far more about practical importance than comparing p-values.
- The distinction between statistical and practical significance is increasingly emphasized. A drug that lowers cholesterol by 0.1 mg/dL with p < .001 is statistically significant but clinically irrelevant. Conversely, a treatment with d = 0.80 and p = .07 in a small pilot study represents a large, potentially meaningful effect that warrants further investigation.
The ASA statement did not call for abandoning p-values. Rather, it urged that p-values be used as one tool among many, never as the sole basis for scientific conclusions.
P-Values in Different Statistical Tests
While the underlying concept is the same across all hypothesis tests, the mechanics of calculating a p-value differ depending on the test statistic and its reference distribution. Understanding these differences helps you interpret p-values more accurately and recognize what each one is actually testing.
T-Test: P-Value From the T-Distribution
In a t-test, the test statistic is calculated as the difference between means divided by the standard error of that difference. This produces a t value, which follows a t-distribution with degrees of freedom determined by the sample size.
The p-value is the area under the t-distribution curve at or beyond the observed t value. For a two-tailed test, this is the combined area in both tails. With large samples, the t-distribution approaches the standard normal distribution, and the p-values converge accordingly.
For example, if t(28) = 2.45, the p-value is the probability of observing a t value of 2.45 or more extreme (in either direction) from a t-distribution with 28 degrees of freedom. This gives approximately p = .021.
ANOVA: P-Value From the F-Distribution
In analysis of variance, the test statistic is the F-ratio, which compares between-group variance to within-group variance. If the groups have truly equal means, this ratio should be close to 1. Larger F-values indicate greater differences among group means relative to the variability within groups.
The F-distribution is right-skewed and bounded at zero, meaning it only produces positive values. The p-value is the area under the F-distribution curve to the right of the observed F-value. Unlike the t-distribution, there is no "left tail" concern because the F-test is inherently directional (larger F = more evidence against H0).
For example, F(3, 96) = 4.15 means the between-group variance is 4.15 times the within-group variance, with degrees of freedom 3 (groups minus 1) and 96 (total N minus number of groups). The resulting p = .008 indicates this ratio is unlikely if all group means are truly equal.
Chi-Square: P-Value From the Chi-Square Distribution
The chi-square test compares observed frequencies in a contingency table to the frequencies expected under independence (or under a specified distribution for goodness-of-fit tests). The test statistic sums the squared differences between observed and expected values, each divided by the expected value.
Like the F-distribution, the chi-square distribution is right-skewed and non-negative. Larger chi-square values reflect greater discrepancies between observed and expected data. The p-value is the probability of obtaining a chi-square value as large or larger than the one observed, given the degrees of freedom.
For a 3x2 contingency table, df = (3 - 1)(2 - 1) = 2. If chi-square = 9.21, the p-value from the chi-square distribution with 2 df is approximately p = .010.
Correlation: P-Value From a T-Distribution Transformation
For Pearson's correlation coefficient r, the p-value is not read directly from a correlation-specific distribution. Instead, r is transformed into a t statistic using the formula:
t = r * sqrt((n - 2) / (1 - r squared))
This transformation follows a t-distribution with n - 2 degrees of freedom under the null hypothesis that the population correlation is zero. The p-value is then obtained from this t-distribution, just as in a regular t-test.
This explains why the same correlation coefficient can be significant in one study and not in another. With r = .25 and n = 100, t = 2.55 and p = .012. But with r = .25 and n = 20, t = 1.08 and p = .295. The correlation is identical, but the evidence against H0 depends heavily on sample size.
The Common Principle
Despite these mechanical differences, every p-value answers the same fundamental question: What is the probability of obtaining a result this extreme, or more extreme, if the null hypothesis is true? The test statistic quantifies "how extreme" in a way appropriate to each test, and the reference distribution provides the probability scale. Whether you are comparing means, proportions, variances, or correlations, the logical framework is the same.
One-Tailed vs Two-Tailed P-Values
The distinction between one-tailed and two-tailed tests is a common source of confusion, and the choice between them has real implications for your p-value and your conclusions.
What Is a Two-Tailed Test?
A two-tailed test evaluates whether the observed effect differs from zero in either direction. It considers the possibility that Group A could score higher than Group B, or that Group B could score higher than Group A. The p-value includes the probability of obtaining the observed result or more extreme in both tails of the distribution.
If your t-test yields t = 2.10, the two-tailed p-value counts the probability of observing t greater than or equal to 2.10 and the probability of observing t less than or equal to -2.10. This makes the two-tailed test more conservative.
What Is a One-Tailed Test?
A one-tailed test evaluates whether the observed effect is in a specific, pre-specified direction. For example, you might predict that a new drug will lower blood pressure (not just change it). The p-value then only considers the probability in one tail of the distribution.
The one-tailed p-value is exactly half the two-tailed p-value:
One-tailed p = Two-tailed p / 2
So if the two-tailed p = .06, the one-tailed p = .03. This means a result that is non-significant under a two-tailed test can become significant under a one-tailed test.
When to Use Each
Two-tailed tests are the default in most research, and for good reason:
- They are more conservative, reducing false positives.
- They do not require you to specify a direction before data collection.
- Most journals and reviewers expect two-tailed tests unless a strong justification is provided.
- They protect against unexpected effects in the opposite direction.
One-tailed tests are appropriate only when:
- There is a strong theoretical or empirical basis for predicting the direction of the effect before seeing the data.
- Effects in the opposite direction would be treated identically to null results (i.e., you genuinely do not care about the other direction).
- The directional hypothesis was pre-registered before data collection.
APA Reporting Conventions
APA 7th edition does not mandate one approach over the other, but it requires transparency. If you use a one-tailed test, state this explicitly in your method section and justify the directional prediction. Report the p-value as one-tailed.
Two-tailed example:
The treatment group scored significantly higher, t(48) = 2.15, p = .037 (two-tailed), d = 0.61.
One-tailed example:
As predicted, the treatment group scored significantly higher, t(48) = 2.15, p = .018 (one-tailed), d = 0.61.
Using a one-tailed test after seeing the data to convert a non-significant result into a significant one is considered methodologically inappropriate and is a form of p-hacking.
Multiple Comparisons and P-Value Adjustment
When you conduct a single hypothesis test at alpha = .05, you accept a 5% chance of a false positive. But what happens when you run 20 tests in the same study? The probability of at least one false positive rises dramatically, and this is the multiple comparisons problem.
The Familywise Error Rate Problem
If each test has a 5% false positive rate and the tests are independent, the probability of making at least one Type I error across k tests is:
Familywise error rate = 1 - (1 - 0.05)^k
For 20 independent tests: 1 - (0.95)^20 = .64. That means a 64% chance of at least one false positive even when all null hypotheses are true. This is why running many uncorrected tests and reporting only the significant ones is misleading.
Bonferroni Correction
The simplest and most widely known correction divides the per-test alpha by the number of comparisons:
Adjusted alpha = 0.05 / k
For 10 comparisons, each individual test would use alpha = .005. This strictly controls the familywise error rate but can be very conservative, especially with many tests, increasing the risk of missing real effects (Type II errors).
When to use Bonferroni:
- Small number of planned comparisons (3-10)
- When controlling the familywise error rate is critical
- Post-hoc pairwise comparisons in ANOVA
False Discovery Rate (Benjamini-Hochberg)
For studies with many simultaneous tests (e.g., genomics with thousands of genes), Bonferroni becomes impractically strict. The Benjamini-Hochberg (BH) procedure controls the false discovery rate (FDR), which is the expected proportion of false positives among all rejected hypotheses, rather than the probability of any false positive at all.
The BH procedure:
- Rank all p-values from smallest to largest.
- For each ranked p-value, calculate: (rank / total number of tests) * desired FDR (e.g., .05).
- Starting from the largest rank, find the first p-value that is less than or equal to its BH threshold. All p-values with smaller ranks are also considered significant.
FDR control is less conservative than Bonferroni and is now standard in high-dimensional research such as gene expression studies, neuroimaging, and large-scale survey analyses.
When to Apply Corrections vs When Not To
Not every situation with multiple tests requires a correction:
- Apply corrections when testing multiple hypotheses on the same dataset and the tests address the same research question (e.g., pairwise comparisons after ANOVA, testing multiple dependent variables).
- Do not apply corrections when the tests address genuinely independent research questions that happen to be in the same study. For example, testing the main effects and interaction in a factorial ANOVA does not require Bonferroni correction because each test addresses a distinct hypothesis.
- Pre-registration of specific planned comparisons can justify not applying corrections, provided the number of comparisons is small and theory-driven.
The key question is whether a false positive on one test would be interpreted in the context of the other tests. If so, correct. If not, correction may not be necessary.
Visualizing P-Values: What They Really Show
One of the best ways to build intuition about p-values is to think visually. The p-value is fundamentally about where your observed result falls in a distribution of possible results, and how much of that distribution lies at or beyond your observation.
The Sampling Distribution Concept
Before interpreting a p-value, you need to understand the sampling distribution. This is not the distribution of your raw data. It is the theoretical distribution of a test statistic (such as t, F, or chi-square) that you would obtain if you repeated the study infinitely many times when the null hypothesis is true.
For a t-test with 30 degrees of freedom, the sampling distribution of t under H0 is a bell-shaped curve centered at 0. Most values cluster near zero (indicating no difference), with values far from zero becoming increasingly rare.
Where the Observed Statistic Falls
Your actual study produces one test statistic, one point on this distribution. If the null hypothesis is true, you would expect this value to fall near the center. If it falls far into the tail, your data are inconsistent with H0.
Consider these scenarios for a two-tailed t-test:
- t = 0.5 falls well within the center of the distribution. This is an unremarkable result. The p-value is large.
- t = 2.0 falls in the outer portion of the distribution. Fewer than 5% of random samples would produce a t this extreme under H0. The p-value is small.
- t = 3.5 falls deep in the tail. This is an extremely unusual result under H0. The p-value is very small.
Area Under the Curve Equals the P-Value
The p-value is literally the shaded area under the sampling distribution curve at and beyond your observed test statistic. For a two-tailed test, the shaded area includes both tails.
This is why:
- A t value closer to zero gives a larger shaded area and a larger p-value.
- A t value farther from zero gives a smaller shaded area and a smaller p-value.
- The alpha level (.05) defines a critical boundary: test statistics beyond this boundary are in the rejection region.
Why Extreme Values Give Small P-Values
The tails of probability distributions contain very little area. In a standard normal distribution, only about 5% of the area lies beyond plus or minus 1.96, and only about 1% lies beyond plus or minus 2.58. Test statistics in these regions are rare under H0, which is precisely why they provide evidence against it.
This visual framework also explains why sample size matters. Larger samples produce sampling distributions with less spread (smaller standard errors), meaning even modest differences between groups push the test statistic into the tails. This is why large-sample studies can find statistical significance for trivially small effects.
Frequently Asked Questions
Does p < .05 mean there is a 95% chance my result is true?
No. This is one of the most common misunderstandings. The p-value is the probability of observing your data (or more extreme) if the null hypothesis is true. It does not tell you the probability that your hypothesis is correct. The probability of a hypothesis being true requires Bayesian analysis with prior probabilities, which is a fundamentally different framework.
What does p = .049 vs p = .051 really mean?
Practically, there is no meaningful difference. The .05 threshold is an arbitrary convention. A p-value of .051 does not mean "no effect" while .049 means "real effect." Both indicate similar levels of evidence against the null hypothesis. The ASA and many leading statisticians recommend treating p-values as continuous measures of evidence rather than as pass/fail cutoffs.
Can a p-value be exactly 0?
No. A p-value represents a probability and can never be exactly zero. When statistical software displays p = .000, it means the value is too small to display at the given decimal precision. In your manuscript, report it as p < .001. There is always some non-zero probability of observing the data under the null hypothesis, no matter how small.
Why do some journals require p < .01 instead of p < .05?
Stricter thresholds reduce false positive rates. Fields with significant replication concerns (such as social psychology) or disciplines where multiple testing is common (such as genomics) may adopt more conservative thresholds. Some researchers have proposed p < .005 as a new default for claims of new discoveries, arguing this would reduce false positive rates from approximately 33% to 5%.
Should I report exact p-values or just p < .05?
APA 7th edition requires exact p-values (e.g., p = .034) rather than inequality statements (p < .05). Exact values allow readers and meta-analysts to evaluate the strength of evidence for themselves. The only exception is for very small values, which should be reported as p < .001 rather than listing many decimal places.
What is the relationship between p-values and confidence intervals?
They are complementary. If a 95% confidence interval for a mean difference does not include zero, the corresponding two-tailed p-value will be less than .05. Conversely, if the CI includes zero, the p-value will be greater than .05. Confidence intervals provide additional information that p-values alone cannot: the estimated magnitude of the effect and the precision of that estimate.
Can I compare p-values across different studies?
No. P-values depend on sample size, effect size, variability, and study design. A p = .001 from a study with 10,000 participants does not necessarily indicate a larger or more important effect than p = .04 from a study with 30 participants. To compare findings across studies, use effect sizes (such as Cohen's d or r) and consider meta-analytic techniques.
What should I do if my p-value is .06?
Report it honestly as non-significant at the .05 level. Discuss the effect size, confidence interval, and practical implications. Do not characterize the result as "marginally significant," "approaching significance," or "trending toward significance," as these phrases are widely viewed as euphemisms for non-significance and are considered a mild form of p-hacking. Instead, interpret the evidence as ambiguous and suggest that future research with greater statistical power may clarify the finding.
Try StatMate's Free Calculators
Every one of StatMate's 20 free calculators automatically computes p-values and formats them in APA 7th edition style. You do not need to look up formatting rules or worry about leading zeros, decimal places, or when to use p < .001. The output is ready to paste into your manuscript.
Here are a few calculators particularly relevant to the concepts in this guide:
- StatMate's free t-test calculator reports t, df, exact p, and Cohen's d in a single output.
- StatMate's free ANOVA calculator provides F, p, and both eta squared and partial eta squared.
- StatMate's free correlation calculator outputs r, p, and R squared together.
- StatMate's free chi-square calculator computes the chi-square statistic, exact p, and Cramer's V automatically.
- StatMate's free sample size calculator helps you plan studies with adequate power so your p-values are meaningful.
All results include both significance testing and effect sizes, so you never have to report one without the other.