Title: | Collinearity Detection in a Multiple Linear Regression Model |
---|---|
Description: | The detection of worrying approximate collinearity in a multiple linear regression model is a problem addressed in all existing statistical packages. However, we have detected deficits regarding to the incorrect treatment of qualitative independent variables and the role of the intercept of the model. The objective of this package is to correct these deficits. In this package will be available detection and treatment techniques traditionally used as the recently developed. D.A. Belsley (1982) <doi:10.1016/0304-4076(82)90020-3>. D. A. Belsley (1991, ISBN: 978-0471528890). C. Garcia, R. Salmeron and C.B. Garcia (2019) <doi:10.1080/00949655.2018.1543423>. R. Salmeron, C.B. Garcia and J. Garcia (2018) <doi:10.1080/00949655.2018.1463376>. G.W. Stewart (1987) <doi:10.1214/ss/1177013444>. |
Authors: | R. Salmeron, C.B. Garcia and J. Garcia |
Maintainer: | R. Salmeron <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0 |
Built: | 2024-10-31 21:08:37 UTC |
Source: | https://github.com/r-forge/colldetreat |
R package to detect collinearity in a multiple linear regression model.
The detection of worrying approximate multicollinearity in a multiple linear regression model is a problem addressed in all existing statistical packages. However, we have detected deficits regarding to the incorrect treatment of qualitative independent variables and the role of the intercept of the model. The objective of this package is to correct these deficits. In this package will be available detection and treatment techniques traditionally used as the recently developed.
Román Salmerón Gómez (University of Granada), Catalina García García (University of Granada) and José García García (University of Almería).
multiColl
: An R Package for Detecting Multicollinearity. Working paper.
This function returns the Condition Number (CN) of the independent variables in a multiple linear regression.
CN(X)
CN(X)
X |
A numeric design matrix that should contain more than one regressor (intercept included). |
Due to the CN takes into account the intercept, it allows to detect not only the essential but also the non-essential collinearity. It also allows to consider non-quantitative independent variables.
Its calculation is obtained from the function lu
, contrary to the function kappa
.
The condition number of a matrix, that is, the maximum condition index.
Values of CN between 20 and 30 indicate near moderate multicollinearity while values higher than 30 indicate near worrying collinearity.
R. Salmeron ([email protected]) and C. Garcia ([email protected]).
D. A. Belsley (1991). Conditioning diagnostics: collinearity and weak dara in regression. John Wiley & Sons, New York.
L. R. Klein and A.S. Goldberger (1964). An economic model of the United States, 1929-1952. North Holland Publishing Company, Amsterdan.
H. Theil (1971). Principles of Econometrics. John Wiley & Sons, New York.
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) CN(theil.X) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) CN(KG.X)
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) CN(theil.X) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) CN(KG.X)
This function returns the Condition Number (CN) of the independent variables of a multiple linear model considering the intercept and without considering it. It also returns the increase produced by going from not taking into account the intercept to having it.
CNs(X)
CNs(X)
X |
A numeric design matrix that should contain more than one regressor (intercept included). |
CN1 |
Condition Number without intercept. |
CN2 |
Condition Number with intercept. |
increment |
Increase (in percentage) in the CN from CN1 to CN2. |
R. Salmerón ([email protected]) and C. García ([email protected]).
D. A. Belsley (1991). Conditioning diagnostics: collinearity and weak data in regression. John Wiley & Sons, New York.
L. R. Klein and A.S. Goldberger (1964). An economic model of the United States, 1929-1952. North Holland Publishing Company, Amsterdan.
H. Theil (1971). Principles of Econometrics. John Wiley & Sons, New York.
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) CNs(theil.X) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) CNs(KG.X)
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) CNs(theil.X) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) CNs(KG.X)
The function calculates the Coeficient of Variation (CV) of a quantitative vector.
CV(x)
CV(x)
x |
A quantitative vector. |
The CV of x
.
R. Salmerón ([email protected]) and C. García ([email protected]).
# random x = sample(1:50, 25) x CV(x)
# random x = sample(1:50, 25) x CV(x)
The function returns the Coeficient of Variation (CV) of a matrix with quantitative columns.
CVs(X, dummy = FALSE, pos = NULL)
CVs(X, dummy = FALSE, pos = NULL)
X |
A numeric design matrix that should contain more than one regressor (intercept included). |
dummy |
A logical value that indicates if there are dummy variables in the design matrix |
pos |
A numeric vector that indicates the position of the dummy variables, if these exist, in the design matrix |
Due to the calculation of the CV only makes sense for quantitative data, other kind of data should be ignored in the calculation. For this reason, it is necessary to indicate if there are non-quantitative variables and also its position in the matrix.
The CV of each column of X
.
R. Salmerón ([email protected]) and C. García ([email protected]).
CV
.
# random cte = array(1, 50) x1 = sample(1:50, 25) x2 = sample(1:50, 25) Z = cbind(cte, x1, x2) head(Z) CVs(Z) x3 = sample(c(array(1,25), array(0,25)), 25) W = cbind(Z, x3) head(W) CVs(W, dummy=TRUE, pos = 4) x0 = sample(c(array(1,25), array(0,25)), 25) Y = cbind(cte, x0, x1, x2, x3) head(Y) CVs(Y, dummy=TRUE, pos=c(2,5))
# random cte = array(1, 50) x1 = sample(1:50, 25) x2 = sample(1:50, 25) Z = cbind(cte, x1, x2) head(Z) CVs(Z) x3 = sample(c(array(1,25), array(0,25)), 25) W = cbind(Z, x3) head(W) CVs(W, dummy=TRUE, pos = 4) x0 = sample(c(array(1,25), array(0,25)), 25) Y = cbind(cte, x0, x1, x2, x3) head(Y) CVs(Y, dummy=TRUE, pos=c(2,5))
Klein and Goldberger data on consumption and wage income.
data("KG")
data("KG")
A data frame with 14 observations in relation to the following four variables:
consumption
Domestic consumption.
wage.income
Wage income.
non.farm.income
Non-wage-non-farm income.
farm.income
Farm income.
Data for the years 1942 to 1944 are not available for the war.
L. R. Klein and A.S. Goldberger (1964). An economic model of the United States, 1929-1952. North Holland Publishing Company, Amsterdan.
data(KG) head(KG)
data(KG) head(KG)
The function returns the index of Stewart of the independent variables in the multiple linear regession model.
ki(X, dummy = FALSE, pos = NULL)
ki(X, dummy = FALSE, pos = NULL)
X |
A numeric design matrix that should contain more than one regressor (intercept included). |
dummy |
A logical value that indicates if there are dummy variables in the design matrix |
pos |
A numeric vector that indicates the position of the dummy variables, if these exist, in the design matrix |
The index of Stewart allows to detect the near essential and non-essential multicollinearity existing in a multiple linear regression model. In addition, due to its relation with the Variance Inflation Factor (VIF), it allows to calculate the proportion of essential and non-essential multicollinearity in each independent variable (intercept excluded). The Stewart's index for the intercept indicates the degree of non-essential multicollinearity existing in the model.
The relation of the the VIF with the index of Stewart implies that it should not be calculated for non-quantitative variables.
ki |
Stewart's index for each independent variable. |
porc1 |
Proportion of essential multicollinearity in the i-th independent variable (without intercept). |
porc2 |
Proportion of non-essential multicollinearity in the i-th independent variable (without intercept). |
R. Salmerón ([email protected]) and C. García ([email protected]).
G. Stewart (1987). Collinearity and least squares regression. Statistical Science, 2 (1), 68-100.
L. R. Klein and A.S. Goldberger (1964). An economic model of the United States, 1929-1952. North Holland Publishing Company, Amsterdan.
H. Theil (1971). Principles of Econometrics. John Wiley & Sons, New York.
VIF
.
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) ki(theil.X, TRUE, pos = 4) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) ki(KG.X)
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) ki(theil.X, TRUE, pos = 4) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) ki(KG.X)
The function transforms the matrix X
so that each column has unit length, it is to say, a module equal to 1.
lu(X)
lu(X)
X |
A numeric matrix that should contain more than one column. |
Original matrix transformed so that each column has a module equal to 1.
R. Salmerón ([email protected]) and C. García ([email protected]).
R. Salmerón, C. B. García and J. García (2018). Variance Inflation Factor and Condition Number in multiple linear regression. Journal of Statistical Computation and Simulation, 88 (12), 2365-2384.
L. R. Klein and A.S. Goldberger (1964). An economic model of the United States, 1929-1952. North Holland Publishing Company, Amsterdan.
H. Theil (1971). Principles of Econometrics. John Wiley & Sons, New York.
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) lu(theil.X) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) lu(KG.X) # random x1 = sample(1:10,5) x2 = sample(1:10,5) x = cbind(x1, x2) x norm(x[,1],"2") norm(x[,2],"2") x.lu = lu(x) x.lu norm(x.lu[,1],"2") norm(x.lu[,2],"2")
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) lu(theil.X) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) lu(KG.X) # random x1 = sample(1:10,5) x2 = sample(1:10,5) x = cbind(x1, x2) x norm(x[,1],"2") norm(x[,2],"2") x.lu = lu(x) x.lu norm(x.lu[,1],"2") norm(x.lu[,2],"2")
The function collects all existing measures to detect worrying multicollinearity in the package multiCol
.
multiCol(X, dummy = FALSE, pos = NULL)
multiCol(X, dummy = FALSE, pos = NULL)
X |
A numeric design matrix that should contain more than one regressor (intercept included). |
dummy |
A logical value that indicates if there are dummy variables in the design matrix |
pos |
A numeric vector that indicates the position of the dummy variables, if these exist, in the design matrix |
If X contains two independent variables (intercept included) see SLM
function.
If X contains more than two independent variables (intercept included):
CV |
Coeficients of variation of quantitative variables in |
Prop |
Proportion of ones in the dummy variables. |
R |
Matrix correlation of the quantitative variables in |
detR |
Determinant of the matrix correlation of the quantitative variables in |
VIF |
Variance Inflation Factors of the quantitative variables in |
CN |
Condition Number of |
ki |
Stewart's index of the quantitative variables in |
For more detail, see the help of the functions in See Also
.
R. Salmerón ([email protected]) and C. García ([email protected]).
L. R. Klein and A.S. Goldberger (1964). An economic model of the United States, 1929-1952. North Holland Publishing Company, Amsterdan.
H. Theil (1971). Principles of Econometrics. John Wiley & Sons, New York.
SLM
, CV
, PROPs
, RdetR
, VIF
, CN
, ki
.
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) multiCol(theil.X, TRUE, pos = 4) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) multiCol(KG.X) # random x1 = array(1,25) x2 = rnorm(25,100,1) x = cbind(x1,x2) head(x) multiCol(x) # random x1 = array(1,25) x2 = sample(cbind(array(1,25),array(0,25)),25) x = cbind(x1,x2) head(x) multiCol(x, TRUE)
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) multiCol(theil.X, TRUE, pos = 4) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) multiCol(KG.X) # random x1 = array(1,25) x2 = rnorm(25,100,1) x = cbind(x1,x2) head(x) multiCol(x) # random x1 = array(1,25) x2 = sample(cbind(array(1,25),array(0,25)),25) x = cbind(x1,x2) head(x) multiCol(x, TRUE)
The functions collects all the measure to detect near worrying multicollinearity existing in the package multiCol
. In adddition, it provides the estimations by ordinary least squares (OLS) of the multiple linear regession model and the variations in the estimations of the coefficients as a consequence of changes in the observed data.
multiColLM(y, X, dummy = FALSE, pos1 = NULL, n, mu, dv, tol = 0.01, pos2 = NULL)
multiColLM(y, X, dummy = FALSE, pos1 = NULL, n, mu, dv, tol = 0.01, pos2 = NULL)
y |
Observations of the dependent variable of the model. |
X |
Observations of the independent variables of the model (intercept included). |
dummy |
A logical value that indicates if there are dummy variables in the design matrix |
pos1 |
A numeric vector that indicates the position of the dummy variables, if these exist, in the design matrix |
n |
Number of times that the perturbation is performed. |
mu |
Any real number. |
dv |
Any real positive number. |
tol |
A value between 0 and 1. By default |
pos2 |
A numeric vector that indicates the position of the independent variables to disturb once you eliminate in |
The estimation by OLS of the linear regression model.
Percentiles 2.5 and 97.5 of the proportion of the variations in the estimations of the coefficients obtained from a perturbation of tol
% in the quantitative variables of X
.
If X contains two independent variables (intercept included) see SLM
function.
If X contains more than two independent variables (intercept included):
CV |
Coeficients of variation of quantitative variables in |
Prop |
Proportion of ones in the dummy variables. |
R |
Matrix correlation of the quantitative variables in |
detR |
Determinant of the matrix correlation of the quantitative variables in |
VIF |
Variance Inflation Factors of the quantitative variables in |
CN |
Condition Number of |
ki |
Stewart's index of the quantitative variables in |
For more detail, see the help of the functions in See Also
.
R. Salmerón ([email protected]) and C. García ([email protected]).
L. R. Klein and A.S. Goldberger (1964). An economic model of the United States, 1929-1952. North Holland Publishing Company, Amsterdan.
H. Theil (1971). Principles of Econometrics. John Wiley & Sons, New York.
SLM
, CV
, PROPs
, RdetR
, VIF
, CN
, ki
, multiCol
, perturb
, perturb.n
.
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) head(theil.X) multiColLM(theil[,2], theil.X, dummy = TRUE, pos1 = 4, 5, 5, 5, tol=0.01, pos2 = 1:2) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) head(KG.X) multiColLM(KG[,1], KG.X, n = 500, mu = 5, dv = 5, tol=0.01, pos2 = 1:3)
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) head(theil.X) multiColLM(theil[,2], theil.X, dummy = TRUE, pos1 = 4, 5, 5, 5, tol=0.01, pos2 = 1:2) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) head(KG.X) multiColLM(KG[,1], KG.X, n = 500, mu = 5, dv = 5, tol=0.01, pos2 = 1:3)
The function modifies a set of quantitative data.
perturb(x, mu, dv, tol = 0.01)
perturb(x, mu, dv, tol = 0.01)
x |
A numeric quantitative vector. |
mu |
Any real number. |
dv |
Any real positive number. |
tol |
A value between 0 and 1. By default |
The vector of data set is modified a tol
% by following the procedure presented by Belsley (1982).
The vector x
modified a tol
%.
R. Salmerón ([email protected]) and C. García ([email protected]).
D. Belsley (1982). Assessing the presence of harmfull collinearity and other forms of weak data throught a test for signal-to-noise. Journal of Econometrics, 20, 211-253.
L. R. Klein and A.S. Goldberger (1964). An economic model of the United States, 1929-1952. North Holland Publishing Company, Amsterdan.
H. Theil (1971). Principles of Econometrics. John Wiley & Sons, New York.
# Henri Theil's textile consumption data modified data(theil) head(theil) consume.p1 = perturb(theil[,2], 3, 4, 0.01) consume.p2 = perturb(theil[,2], 50, 10, 0.01) x = cbind(theil[,2], consume.p1, consume.p2) head(x) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) farm.income.p1 = perturb(KG[,4], -3, 40, 0.01) farm.income.p2 = perturb(KG[,4], 10, 8, 0.01) x = cbind(KG[,4], farm.income.p1, farm.income.p2) head(x)
# Henri Theil's textile consumption data modified data(theil) head(theil) consume.p1 = perturb(theil[,2], 3, 4, 0.01) consume.p2 = perturb(theil[,2], 50, 10, 0.01) x = cbind(theil[,2], consume.p1, consume.p2) head(x) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) farm.income.p1 = perturb(KG[,4], -3, 40, 0.01) farm.income.p2 = perturb(KG[,4], 10, 8, 0.01) x = cbind(KG[,4], farm.income.p1, farm.income.p2) head(x)
The function quantifies the variations in the estimations of the coefficients of a multiple linear regression when a perturbation is introduced in the quantitative data set.
perturb.n(data, n, mu, dv, tol = 0.01, pos = NULL)
perturb.n(data, n, mu, dv, tol = 0.01, pos = NULL)
data |
Data set |
n |
Number of times that perturbation is performed. |
mu |
Any real number. |
dv |
Any real positive number. |
tol |
A value between 0 and 1. By default |
pos |
A numeric vector that indicates the position of the independent variables to disturb once you eliminate in |
tols |
A vector presenting the percentage of disturbance induced in the variables indicated in each iteration. |
norms |
A vector presenting the percentage of variation in the estimations of the coefficients in each iteration. |
tols
must be a constant vector equal to tol
. It is obtained to check if data have been correctly perturbed.
R. Salmerón ([email protected]) and C. García ([email protected]).
D. Belsley (1982). Assessing the presence of harmfull collinearity and other forms of weak data throught a test for signal-to-noise. Journal of Econometrics, 20, 211-253.
L. R. Klein and A.S. Goldberger (1964). An economic model of the United States, 1929-1952. North Holland Publishing Company, Amsterdan.
H. Theil (1971). Principles of Econometrics. John Wiley & Sons, New York.
tol = 0.01 mu = 10 dv = 10 # Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.y.X = cbind(theil[,2], cte, theil[,-(1:2)]) head(theil.y.X) iterations = 5 perturb.n.T = perturb.n(theil.y.X, iterations, mu, dv, tol, pos = c(1,2)) perturb.n.T mean(perturb.n.T[,1]) mean(perturb.n.T[,2]) c(min(perturb.n.T[,2]), max(perturb.n.T[,2])) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.y.X = cbind(KG[,1], cte, KG[,-1]) head(KG.y.X) iterations = 1000 perturb.n.KG = perturb.n(KG.y.X, iterations, mu, dv, tol, pos = c(1,2,3)) mean(perturb.n.KG[,1]) mean(perturb.n.KG[,2]) c(min(perturb.n.KG[,2]), max(perturb.n.KG[,2]))
tol = 0.01 mu = 10 dv = 10 # Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.y.X = cbind(theil[,2], cte, theil[,-(1:2)]) head(theil.y.X) iterations = 5 perturb.n.T = perturb.n(theil.y.X, iterations, mu, dv, tol, pos = c(1,2)) perturb.n.T mean(perturb.n.T[,1]) mean(perturb.n.T[,2]) c(min(perturb.n.T[,2]), max(perturb.n.T[,2])) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.y.X = cbind(KG[,1], cte, KG[,-1]) head(KG.y.X) iterations = 1000 perturb.n.KG = perturb.n(KG.y.X, iterations, mu, dv, tol, pos = c(1,2,3)) mean(perturb.n.KG[,1]) mean(perturb.n.KG[,2]) c(min(perturb.n.KG[,2]), max(perturb.n.KG[,2]))
The functions returns the proportion of ones in the dummy variables existing in a matrix.
PROPs(X, dummy = TRUE, pos = NULL)
PROPs(X, dummy = TRUE, pos = NULL)
X |
A numeric matrix that should contain more than one regressor (intercept included). |
dummy |
A logical value that indicates if there are dummy variables in the matrix |
pos |
A numeric vector that indicates the position of the dummy variables, if these exist, in the matrix |
R. Salmerón ([email protected]) and C. García ([email protected]).
# random x1 = sample(1:50, 25) x2 = sample(1:50, 25) x3 = sample(cbind(array(1,25), array(0,25)), 25) x4 = sample(cbind(array(1,25), array(0,25)), 25) x = cbind(x1, x2, x3, x4) head(x) PROPs(x, TRUE, pos = c(3,4))
# random x1 = sample(1:50, 25) x2 = sample(1:50, 25) x3 = sample(cbind(array(1,25), array(0,25)), 25) x4 = sample(cbind(array(1,25), array(0,25)), 25) x = cbind(x1, x2, x3, x4) head(x) PROPs(x, TRUE, pos = c(3,4))
The function returns the matrix of simple linear correlations between the independent variables of a multiple linear model and its determinant.
RdetR(X, dummy = FALSE, pos = NULL)
RdetR(X, dummy = FALSE, pos = NULL)
X |
A numeric design matrix that should contain more than one regressor (intercept included). |
dummy |
A logical value that indicates if there are dummy variables in the design matrix |
pos |
A numeric vector that indicates the position of the dummy variables, if these exist, in the design matrix |
The measures calculated by this function ignore completelly the role of the intercept in the linear relations between the independent variables. Thus, these measures only detect the near essential multicollinearity. Although the simple correlations only quantify relation between pairs of variables, the determinant of the matrix of correlations is able to detect broader relations. Due to the coefficients of simple linear regression are calculated for quantitative variables, if the model contains other kinds of variables (such as dummy variables), they should be omitted in the analysis by using the arguments dummy
and pos
.
R |
Correlation matrix of the independent variables of the multiple linear regression model. |
detR |
Determinant of |
Values of the coefficient of simple linear correlation higher than 0.9487 imply worrying near essential multicollinearity between pairs of variables.
Values of the determinant of R
lower than 0.1013 + 0.00008626*n - 0.01384*k
, where n
is the number of observations and k
the number of indepedent variables (intercept included), indicate worrying near essential multicollinearity.
R. Salmerón ([email protected]) and C. García ([email protected]).
C. García, R. Salmerón and C. B. García (2019). Choice of the ridge factor from the correlation matrix determinant. Journal of Statistical Computation and Simulation, 89 (2), 211-231.
D. Marquardt and R. Snee (1975). Ridge regression in practice. The American Statistician, 1 (29), 3–20.
L. R. Klein and A.S. Goldberger (1964). An economic model of the United States, 1929-1952. North Holland Publishing Company, Amsterdan.
H. Theil (1971). Principles of Econometrics. John Wiley & Sons, New York.
VIF
.
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) RdetR(theil.X, TRUE, pos = 4) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) RdetR(KG.X)
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) RdetR(theil.X, TRUE, pos = 4) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) RdetR(KG.X)
The function analyzes the presence of near worrying multicollinearity in the Simple Linear Model (SLM).
SLM(X, dummy = FALSE)
SLM(X, dummy = FALSE)
X |
A numeric design matrix that should contain two independent variables (intercept included). |
dummy |
A logical value that indicates if there are dummy variables in the design matrix |
The analysis of the presence of near worrying multicolllinearity in the SLM has been systematically ignored in some existing statistical softwares. However, it is possible to find worrying non essential multicollinearity in the SLM. In this case, the linear relation will be given by a second variable of X
with very little variablity. For this reason, the coeficient of variation is calculated when the variable is quantitative and the proportion of ones if the variable is non-quantitative.
If dummy=TRUE
:
Prop |
Proportion of ones in the dummy variable. |
CN |
Condition Number of |
If dummy=FALSE
:
CV |
Coeficient of variation of the second variable in |
VIF |
Variance Inflation Factor. |
CN |
Condition Number of |
ki |
Stewart's index of |
The VIF only detects the near essential multicollinearity and for this reason it is not appropriate to detect multicollinearity in the SLM. Indeed, in this case, the VIF will be always equal to 1.
R. Salmerón ([email protected]) and C. García ([email protected]).
R. Salmerón, C. B. García and J. García (2018). Variance Inflation Factor and Condition Number in multiple linear regression. Journal of Statistical Computation and Simulation, 88 (12), 2365-2384.
L. R. Klein and A.S. Goldberger (1964). An economic model of the United States, 1929-1952. North Holland Publishing Company, Amsterdan.
H. Theil (1971). Principles of Econometrics. John Wiley & Sons, New York.
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) SLM(theil.X, TRUE) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) SLM(KG.X) # random x1 = array(1,25) x2 = sample(1:50,25) x = cbind(x1,x2) head(x) SLM(x) # random x1 = array(1,25) x2 = rnorm(25,100,1) x = cbind(x1,x2) head(x) SLM(x) # random x1 = array(1,25) x2 = sample(cbind(array(1,25),array(0,25)),25) x = cbind(x1,x2) head(x) SLM(x, TRUE)
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) SLM(theil.X, TRUE) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) SLM(KG.X) # random x1 = array(1,25) x2 = sample(1:50,25) x = cbind(x1,x2) head(x) SLM(x) # random x1 = array(1,25) x2 = rnorm(25,100,1) x = cbind(x1,x2) head(x) SLM(x) # random x1 = array(1,25) x2 = sample(cbind(array(1,25),array(0,25)),25) x = cbind(x1,x2) head(x) SLM(x, TRUE)
Henri Theil's textile consumption data modified.
data("theil")
data("theil")
A data set with 17 observations in relation to the following five variables:
obs
Year.
consume
Volume of textile consumption per capita (base 1925=100).
income
Real Income per capita (base 1925=100).
relprice
Relative price of textiles (base 1925=100).
twentys
Dummy variable that differentiates between the twenties and thirties.
This data set is developed based on the original Henri Theil's textile consumption data. With the goal of showing the treatment of the detection of collinearity when non-quantitative variables exists in the multiple linear regression, a new dummy variable has been incorporated distinguishing between the twenties and thirties.
H. Theil (1971). Principles of Econometrics. John Wiley & Sons, New York.
data(theil) head(theil)
data(theil) head(theil)
The function returns the Variance Inflation Factors (VIFs) of the independent variables of the multiple linear regression model.
VIF(X, dummy = FALSE, pos = NULL)
VIF(X, dummy = FALSE, pos = NULL)
X |
A numeric design matrix that should contain more than one regressor (intercept included). |
dummy |
A logical value that indicates if there are dummy variables in the design matrix |
pos |
A numeric vector that indicates the position of the dummy variables, if these exist, in the design matrix |
The function returns the VIFs from the main diagonal of the inverse of the matrix of correlations of the independent variables of the multiple linear regression. Due to the VIF is only calculated for the independent variables, it only allows to detect the essential collinearity. In addition, the VIF is not adequate for dummy variables since it is obtained from the matrix of simple correlations.
Variance Inflation Factor of each independent variable excluded the intercept.
Values of VIF that exceed 10 indicate near essential multicolinearity.
R. Salmerón ([email protected]) and C. García ([email protected]).
D. Marquardt and R. Snee (1975). Ridge regression in practice. The American Statistician, 1 (29), 3–20.
L. R. Klein and A.S. Goldberger (1964). An economic model of the United States, 1929-1952. North Holland Publishing Company, Amsterdan.
H. Theil (1971). Principles of Econometrics. John Wiley & Sons, New York.
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) VIF(theil.X, TRUE, pos = 4) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) VIF(KG.X)
# Henri Theil's textile consumption data modified data(theil) head(theil) cte = array(1,length(theil[,2])) theil.X = cbind(cte,theil[,-(1:2)]) VIF(theil.X, TRUE, pos = 4) # Klein and Goldberger data on consumption and wage income data(KG) head(KG) cte = array(1,length(KG[,1])) KG.X = cbind(cte,KG[,-1]) VIF(KG.X)