Coefficient of determination

In statistics, the coefficient of determination, denoted R² or r² and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.^[1]^[2]^[3]

There are several definitions of R² that are only sometimes equivalent. One class of such cases includes that of simple linear regression where r² is used instead of R². When only an intercept is included, then r² is simply the square of the sample correlation coefficient (i.e., r) between the observed outcomes and the observed predictor values.^[4] If additional regressors are included, R² is the square of the coefficient of multiple correlation. In both such cases, the coefficient of determination normally ranges from 0 to 1.

There are cases where R² can yield negative values. This can arise when the predictions that are being compared to the corresponding outcomes have not been derived from a model-fitting procedure using those data. Even if a model-fitting procedure has been used, R² may still be negative, for example when linear regression is conducted without including an intercept,^[5] or when a non-linear function is used to fit the data.^[6] In cases where negative values arise, the mean of the data provides a better fit to the outcomes than do the fitted function values, according to this particular criterion.

The coefficient of determination can be more (intuitively) informative than MAE, MAPE, MSE, and RMSE in regression analysis evaluation, as the former can be expressed as a percentage, whereas the latter measures have arbitrary ranges. It also proved more robust for poor fits compared to SMAPE on the test datasets in the article.^[7]

When evaluating the goodness-of-fit of simulated (Y_pred) vs. measured (Y_obs) values, it is not appropriate to base this on the R² of the linear regression (i.e., Y_obs= m·Y_pred + b).^{[citation needed]} The R² quantifies the degree of any linear correlation between Y_obs and Y_pred, while for the goodness-of-fit evaluation only one specific linear correlation should be taken into consideration: Y_obs = 1·Y_pred + 0 (i.e., the 1:1 line).^[8]^[9]

Definitionsedit

A data set has n values marked y₁, ..., y_n (collectively known as y_i or as a vector y = y₁, ..., y_n^T), each associated with a fitted (or modeled, or predicted) value f₁, ..., f_n (known as f_i, or sometimes ŷ_i, as a vector f).

Define the residuals as e_i = y_i − f_i (forming a vector e).

If ${\bar {y}}$ is the mean of the observed data:

{\bar {y}}={\frac {1}{n}}\sum _{i=1}^{n}y_{i}

then the variability of the data set can be measured with two sums of squares formulas:

The sum of squares of residuals, also called the residual sum of squares: $SS_{\text{res}}=\sum _{i}(y_{i}-f_{i})^{2}=\sum _{i}e_{i}^{2}\,$
The total sum of squares (proportional to the variance of the data): $SS_{\text{tot}}=\sum _{i}(y_{i}-{\bar {y}})^{2}$

The most general definition of the coefficient of determination is

R^{2}=1-{SS_{\rm {res}} \over SS_{\rm {tot}}}

In the best case, the modeled values exactly match the observed values, which results in $SS_{\text{res}}=0$ and R² = 1. A baseline model, which always predicts y, will have R² = 0. Models that have worse predictions than this baseline will have a negative R².

Relation to unexplained varianceedit

In a general form, R² can be seen to be related to the fraction of variance unexplained (FVU), since the second term compares the unexplained variance (variance of the model's errors) with the total variance (of the data):

R^{2}=1-{\text{FVU}}

As explained varianceedit

A larger value of R² implies a more successful regression model.^[4]^: 463 Suppose R² = 0.49. This implies that 49% of the variability of the dependent variable in the data set has been accounted for, and the remaining 51% of the variability is still unaccounted for. For regression models, the regression sum of squares, also called the explained sum of squares, is defined as

SS_{\text{reg}}=\sum _{i}(f_{i}-{\bar {y}})^{2}

In some cases, as in simple linear regression, the total sum of squares equals the sum of the two other sums of squares defined above:

SS_{\text{res}}+SS_{\text{reg}}=SS_{\text{tot}}

See Partitioning in the general OLS model for a derivation of this result for one case where the relation holds. When this relation does hold, the above definition of R² is equivalent to

R^{2}={\frac {SS_{\text{reg}}}{SS_{\text{tot}}}}={\frac {SS_{\text{reg}}/n}{SS_{\text{tot}}/n}}

where n is the number of observations (cases) on the variables.

In this form R² is expressed as the ratio of the explained variance (variance of the model's predictions, which is SS_reg / n) to the total variance (sample variance of the dependent variable, which is SS_tot / n).

This partition of the sum of squares holds for instance when the model values ƒ_i have been obtained by linear regression. A milder sufficient condition reads as follows: The model has the form

f_{i}={\widehat {\alpha }}+{\widehat {\beta }}q_{i}

where the q_i are arbitrary values that may or may not depend on i or on other free parameters (the common choice q_i = x_i is just one special case), and the coefficient estimates ${\widehat {\alpha }}$ and ${\widehat {\beta }}$ are obtained by minimizing the residual sum of squares.

This set of conditions is an important one and it has a number of implications for the properties of the fitted residuals and the modelled values. In particular, under these conditions:

{\bar {f}}={\bar {y}}.\,

As squared correlation coefficientedit

In linear least squares multiple regression with an estimated intercept term, R² equals the square of the Pearson correlation coefficient between the observed $y$ and modeled (predicted) $f$ data values of the dependent variable.

In a linear least squares regression with a single explanator but without an intercept term, this is also equal to the squared Pearson correlation coefficient of the dependent variable $y$ and explanatory variable $x.$

It should not be confused with the correlation coefficient between two explanatory variables, defined as

\rho _{{\widehat {\alpha }},{\widehat {\beta }}}={\operatorname {cov} \left({\widehat {\alpha }},{\widehat {\beta }}\right) \over \sigma _{\widehat {\alpha }}\sigma _{\widehat {\beta }}},

Navigácia: Veda >

Analytika
Antropológia
Aplikované vedy
Bibliometria
Dejiny vedy
Encyklopédie
Filozofia vedy
Forenzné vedy
Humanitné vedy
Knižničná veda
Kryogenika
Kryptológia
Kulturológia
Literárna veda
Medzidisciplinárne oblasti
Metódy kvantitatívnej analýzy
Metavedy
Metodika

Metodológia vedy
Náboženstvo a veda
Náučná literatúra
Podvody vo vede
Popularizácia vedy
Potravinárstvo
Prírodné vedy
Pseudoveda
Scientometria
Spoločenské vedy
Teórie
Teatrológia
Technické vedy
Technika
Terminológia
Umenie
Výskum

Veda
Veda a technika podľa štátu
Veda a technika podľa kontinentu
Veda a technika podľa roka
Veda v kozme
Vedci
Vedecká literatúra
Vedecké databázy
Vedecké experimenty
Vedecké konferencie
Vedecké metódy
Vedecké ocenenia
Vedecké organizácie
Vedecké parky
Vedeckí spisovatelia
Vzdelávanie
Záhady

Príbuzné výrazy:

Text je dostupný za podmienok Creative Commons Attribution/Share-Alike License 3.0 Unported; prípadne za ďalších podmienok.
Podrobnejšie informácie nájdete na stránke Podmienky použitia.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]