When should I use regression analysis?

Use it to predict or explain one variable (dependent) using another variable (independent). For example, analyzing how advertising spend affects sales.

How do I interpret R² (coefficient of determination)?

R² represents the proportion of variance in the dependent variable explained by the independent variable. R² = 0.65 means 65% of the variation is explained by the model.

What's the difference between regression and correlation?

Correlation only measures the strength of relationship between two variables. Regression provides a prediction equation (Y = a + bX) to predict the dependent variable from the independent variable.

Residuals are the differences between observed values and predicted values from the regression model. If residuals are normally distributed with constant variance, the model is considered adequate.

単回帰分析計算ツール

データに線形モデルを当てはめます。結果にはR²、F検定、回帰係数、散布図、APA形式の出力が含まれます。

What is Simple Linear Regression?

Simple linear regression is a statistical method used to model the relationship between a single independent variable (X) and a dependent variable (Y) by fitting a straight line to the observed data. The regression equation takes the form ŷ = b₀ + b₁x, where b₀ is the y-intercept and b₁ is the slope of the regression line. This method estimates the parameters using ordinary least squares (OLS), which minimizes the sum of squared differences between observed and predicted values.

Regression analysis was pioneered by Sir Francis Galton in the 1880s during his studies of hereditary stature, where he observed that children's heights tended to "regress" toward the population mean. The mathematical framework was later formalized by Karl Pearson and Ronald Fisher, who developed the inferential statistics (F-test, t-tests for coefficients) used in modern regression analysis. Today, simple linear regression is one of the most fundamental tools in statistics, serving as the foundation for multiple regression, ANOVA, and many machine learning algorithms.

Key Concepts in Linear Regression

Slope (b₁)

The slope represents the expected change in Y for a one-unit increase in X. A positive slope indicates a positive relationship (as X increases, Y increases), while a negative slope indicates an inverse relationship. The slope is tested for significance using a t-test with n - 2 degrees of freedom.

Intercept (b₀)

The intercept is the predicted value of Y when X equals zero. In many practical situations, X = 0 may not be meaningful (e.g., predicting weight from height), so the intercept should be interpreted cautiously. Its primary role is to position the regression line correctly.

Standard Error of the Estimate

The standard error of the estimate (SEE) measures the average distance between observed values and the regression line. Smaller values indicate that the data points cluster more tightly around the line, suggesting better prediction accuracy.

Understanding R² (Coefficient of Determination)

R² represents the proportion of variance in the dependent variable that is explained by the independent variable. It ranges from 0 to 1, where 0 means the model explains none of the variability and 1 means it explains all of the variability. Adjusted R² accounts for the number of predictors and is particularly useful when comparing models.

R² Value	Interpretation	Practical Meaning
< 0.10	Very Weak	Model explains very little variance; X is a poor predictor
0.10 – 0.30	Weak	Small but potentially meaningful predictive power
0.30 – 0.50	Moderate	Meaningful prediction; useful for many social science applications
0.50 – 0.70	Strong	Substantial predictive accuracy; good model fit
> 0.70	Very Strong	Excellent model fit; X is a strong predictor of Y

Note: These thresholds are general guidelines. In fields like physics or engineering, R² values above 0.90 are common. In psychology and social sciences, R² values of 0.20–0.40 are often considered meaningful.

Worked Example: Study Hours Predicting Exam Score

A researcher examines whether the number of hours spent studying predicts exam performance in a sample of 10 university students.

Study Hours (X)

1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Exam Score (Y)

2.1, 4.0, 5.8, 8.2, 9.8, 12.1, 14.0, 15.9, 18.2, 19.8

Results

F(1, 8) = 2854.88, p < .001, R² = .997

ŷ = 0.04 + 1.97x

The model is statistically significant and explains 99.7% of the variance in exam scores. For each additional hour of study, the predicted exam score increases by approximately 1.97 points.

Assumptions of Simple Linear Regression

Before interpreting your regression results, verify that these assumptions are met. Violating assumptions can lead to biased estimates, incorrect standard errors, and invalid inference.

1. Linearity

The relationship between X and Y must be linear. Inspect a scatter plot of the data. If the relationship is curved (e.g., quadratic, logarithmic), consider transforming your variables or using polynomial regression. A residual plot showing a random scatter around zero supports linearity.

2. Independence of Errors

The residuals (errors) must be independent of each other. This is especially important with time-series data, where successive observations may be correlated (autocorrelation). The Durbin-Watson test can detect autocorrelation. Values near 2 indicate no autocorrelation.

3. Normality of Residuals

The residuals should be approximately normally distributed. This assumption is important for hypothesis testing and confidence intervals. Check normality using a Q-Q plot or the Shapiro-Wilk test. With large samples (n > 30), the Central Limit Theorem makes regression robust to mild non-normality.

4. Homoscedasticity (Constant Variance)

The variance of residuals should be approximately constant across all levels of X. In a residual vs. fitted values plot, the spread of residuals should remain roughly the same. If the spread fans out (heteroscedasticity), consider using weighted least squares or robust standard errors.

How to Report Regression Results in APA Format

According to APA 7th edition guidelines, regression results should include the F-statistic with degrees of freedom, the p-value, R², the regression equation, and individual coefficient statistics. Here is a template you can adapt:

Simple Linear Regression

A simple linear regression was conducted to predict exam scores from study hours. The model was statistically significant, F(1, 8) = 2854.88, p < .001, R² = .997. Study hours significantly predicted exam scores, b = 1.97, t(8) = 53.43, p < .001, 95% CI [1.88, 2.05]. For each additional hour of study, exam scores increased by an average of 1.97 points.

Non-significant Result

A simple linear regression was conducted to predict happiness scores from daily screen time. The model was not statistically significant, F(1, 48) = 1.23, p = .274, R² = .025. Screen time did not significantly predict happiness scores, b = -0.15, t(48) = -1.11, p = .274, 95% CI [-0.42, 0.12].

Note: Report regression coefficients, t-values, and F-values to two decimal places. Report p-values to three decimal places, except use p < .001 when the value is below .001. Always include R² and the 95% confidence interval for key coefficients.

When to Use Simple Linear Regression vs. Other Tests

Situation	Recommended Test
One predictor, one continuous outcome	Simple linear regression
Multiple predictors, one continuous outcome	Multiple linear regression
Relationship strength only (no prediction)	Pearson / Spearman correlation
Binary outcome variable	Logistic regression
Non-linear relationship	Polynomial regression or data transformation
Comparing group means (categorical predictor)	T-test or ANOVA

Common Mistakes to Avoid

Extrapolating beyond the data range: The regression equation is only valid within the range of observed X values. Predicting Y for X values far outside this range (extrapolation) can produce unreliable and misleading results.
Ignoring assumptions: Regression results are only trustworthy when the assumptions of linearity, independence, normality, and homoscedasticity are met. Always check residual plots before interpreting your model.
Confusing correlation with causation: A significant regression does not prove that X causes Y. Use caution in causal language and consider confounding variables. Only randomized experiments can establish causation.
Over-interpreting R²: A high R² does not necessarily mean the model is correct or useful. The relationship could still be non-linear, or the model could be driven by outliers. Conversely, a low R² does not mean X is unimportant.
Reporting p = .000: Statistical software sometimes displays p = .000. Always report this as p < .001. A p-value is never exactly zero.

Calculation Accuracy

StatMate's regression calculations have been validated against R's lm() and summary.lm() functions. We compute the OLS regression using the standard normal equations and derive F-statistics, t-statistics, and confidence intervals using the jstat library for probability distributions. All results match R output to at least 4 decimal places.

他の計算ツールを試す

t検定

2群の平均値を比較

分散分析

3群以上の平均値を比較

カイ二乗検定

カテゴリ変数の関連を検定

相関分析

関係の強さを測定

記述統計

データを要約

サンプルサイズ

検出力分析・標本計画

1標本t検定

既知の値との比較

マン・ホイットニーU

ノンパラメトリック群間比較

ウィルコクソン検定

ノンパラメトリック対応検定

重回帰分析

複数の予測変数

クロンバックのα

尺度の信頼性

ロジスティック回帰

二値アウトカムの予測

因子分析

潜在因子構造の探索

クラスカル・ウォリス

ノンパラメトリック3群以上比較

反復測定

被験者内分散分析

二元配置分散分析

要因計画の分析

フリードマン検定

ノンパラメトリック反復測定

フィッシャーの正確検定

2×2表の正確検定

マクネマー検定

対応のある名義データの検定