Fit a linear model to your data. Results include R², F-test, regression coefficients, scatter plot, and APA-formatted output.
Simple linear regression is a statistical method used to model the relationship between a single independent variable (X) and a dependent variable (Y) by fitting a straight line to the observed data. The regression equation takes the form ŷ = b₀ + b₁x, where b₀ is the y-intercept and b₁ is the slope of the regression line. This method estimates the parameters using ordinary least squares (OLS), which minimizes the sum of squared differences between observed and predicted values.
Regression analysis was pioneered by Sir Francis Galton in the 1880s during his studies of hereditary stature, where he observed that children's heights tended to "regress" toward the population mean. The mathematical framework was later formalized by Karl Pearson and Ronald Fisher, who developed the inferential statistics (F-test, t-tests for coefficients) used in modern regression analysis. Today, simple linear regression is one of the most fundamental tools in statistics, serving as the foundation for multiple regression, ANOVA, and many machine learning algorithms.
Slope (b₁)
The slope represents the expected change in Y for a one-unit increase in X. A positive slope indicates a positive relationship (as X increases, Y increases), while a negative slope indicates an inverse relationship. The slope is tested for significance using a t-test with n - 2 degrees of freedom.
Intercept (b₀)
The intercept is the predicted value of Y when X equals zero. In many practical situations, X = 0 may not be meaningful (e.g., predicting weight from height), so the intercept should be interpreted cautiously. Its primary role is to position the regression line correctly.
Standard Error of the Estimate
The standard error of the estimate (SEE) measures the average distance between observed values and the regression line. Smaller values indicate that the data points cluster more tightly around the line, suggesting better prediction accuracy.
R² represents the proportion of variance in the dependent variable that is explained by the independent variable. It ranges from 0 to 1, where 0 means the model explains none of the variability and 1 means it explains all of the variability. Adjusted R² accounts for the number of predictors and is particularly useful when comparing models.
| R² Value | Interpretation | Practical Meaning |
|---|---|---|
| < 0.10 | Very Weak | Model explains very little variance; X is a poor predictor |
| 0.10 – 0.30 | Weak | Small but potentially meaningful predictive power |
| 0.30 – 0.50 | Moderate | Meaningful prediction; useful for many social science applications |
| 0.50 – 0.70 | Strong | Substantial predictive accuracy; good model fit |
| > 0.70 | Very Strong | Excellent model fit; X is a strong predictor of Y |
Note: These thresholds are general guidelines. In fields like physics or engineering, R² values above 0.90 are common. In psychology and social sciences, R² values of 0.20–0.40 are often considered meaningful.
A researcher examines whether the number of hours spent studying predicts exam performance in a sample of 10 university students.
Study Hours (X)
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Exam Score (Y)
2.1, 4.0, 5.8, 8.2, 9.8, 12.1, 14.0, 15.9, 18.2, 19.8
Results
F(1, 8) = 2854.88, p < .001, R² = .997
ŷ = 0.04 + 1.97x
The model is statistically significant and explains 99.7% of the variance in exam scores. For each additional hour of study, the predicted exam score increases by approximately 1.97 points.
Before interpreting your regression results, verify that these assumptions are met. Violating assumptions can lead to biased estimates, incorrect standard errors, and invalid inference.
1. Linearity
The relationship between X and Y must be linear. Inspect a scatter plot of the data. If the relationship is curved (e.g., quadratic, logarithmic), consider transforming your variables or using polynomial regression. A residual plot showing a random scatter around zero supports linearity.
2. Independence of Errors
The residuals (errors) must be independent of each other. This is especially important with time-series data, where successive observations may be correlated (autocorrelation). The Durbin-Watson test can detect autocorrelation. Values near 2 indicate no autocorrelation.
3. Normality of Residuals
The residuals should be approximately normally distributed. This assumption is important for hypothesis testing and confidence intervals. Check normality using a Q-Q plot or the Shapiro-Wilk test. With large samples (n > 30), the Central Limit Theorem makes regression robust to mild non-normality.
4. Homoscedasticity (Constant Variance)
The variance of residuals should be approximately constant across all levels of X. In a residual vs. fitted values plot, the spread of residuals should remain roughly the same. If the spread fans out (heteroscedasticity), consider using weighted least squares or robust standard errors.
According to APA 7th edition guidelines, regression results should include the F-statistic with degrees of freedom, the p-value, R², the regression equation, and individual coefficient statistics. Here is a template you can adapt:
Simple Linear Regression
A simple linear regression was conducted to predict exam scores from study hours. The model was statistically significant, F(1, 8) = 2854.88, p < .001, R² = .997. Study hours significantly predicted exam scores, b = 1.97, t(8) = 53.43, p < .001, 95% CI [1.88, 2.05]. For each additional hour of study, exam scores increased by an average of 1.97 points.
Non-significant Result
A simple linear regression was conducted to predict happiness scores from daily screen time. The model was not statistically significant, F(1, 48) = 1.23, p = .274, R² = .025. Screen time did not significantly predict happiness scores, b = -0.15, t(48) = -1.11, p = .274, 95% CI [-0.42, 0.12].
Note: Report regression coefficients, t-values, and F-values to two decimal places. Report p-values to three decimal places, except use p < .001 when the value is below .001. Always include R² and the 95% confidence interval for key coefficients.
| Situation | Recommended Test |
|---|---|
| One predictor, one continuous outcome | Simple linear regression |
| Multiple predictors, one continuous outcome | Multiple linear regression |
| Relationship strength only (no prediction) | Pearson / Spearman correlation |
| Binary outcome variable | Logistic regression |
| Non-linear relationship | Polynomial regression or data transformation |
| Comparing group means (categorical predictor) | T-test or ANOVA |
StatMate's regression calculations have been validated against R's lm() and summary.lm() functions. We compute the OLS regression using the standard normal equations and derive F-statistics, t-statistics, and confidence intervals using the jstat library for probability distributions. All results match R output to at least 4 decimal places.
T-Test
Compare means between two groups
ANOVA
Compare means across 3+ groups
Chi-Square
Test categorical associations
Correlation
Measure relationship strength
Descriptive
Summarize your data
Sample Size
Power analysis & sample planning
One-Sample T
Test against a known value
Mann-Whitney U
Non-parametric group comparison
Wilcoxon
Non-parametric paired test
Multiple Regression
Multiple predictors
Cronbach's Alpha
Scale reliability
Logistic Regression
Binary outcome prediction
Factor Analysis
Explore latent factor structure
Kruskal-Wallis
Non-parametric 3+ group comparison
Repeated Measures
Within-subjects ANOVA
Two-Way ANOVA
Factorial design analysis
Friedman Test
Non-parametric repeated measures
Fisher's Exact
Exact test for 2×2 tables
McNemar Test
Paired nominal data test
Paste from Excel/Sheets or drop a CSV file
Paste from Excel/Sheets or drop a CSV file
Enter your data and click Calculate
or click "Load Example" to try it out