Help for package rvif

Type:

Package

Title:

Collinearity Detection using Redefined Variance Inflation Factor and Graphical Methods

Version:

3.1

Date:

2028-09-01

Author:

R. Salmerón [aut, cre], C.B. García [aut]

Maintainer:

R. Salmerón <romansg@ugr.es>

Description:

The detection of troubling approximate collinearity in a multiple linear regression model is a classical problem in Econometrics. The objective of this package is to detect it using the variance inflation factor redefined and the scatterplot between the variance inflation factor and the coefficient of variation. For more details see Salmerón R., García C.B. and García J. (2018) <doi:10.1080/00949655.2018.1463376>, Salmerón, R., Rodríguez, A. and García C. (2020) <doi:10.1007/s00180-019-00922-x>, Salmerón, R., García, C.B, Rodríguez, A. and García, C. (2022) <doi:10.32614/RJ-2023-010>, Salmerón, R., García, C.B. and García, J. (2025) <doi:10.1007/s10614-024-10575-8> and Salmerón, R., García, C.B, García J. (2023, working paper) <doi:10.48550/arXiv.2005.02245>. You can also view the package vignette using 'browseVignettes("rvif")', the package website using 'browseURL(system.file("docs/index.html", package = "rvif"))' or version control on GitHub (https://github.com/rnoremlas/rvif_package).

License:

GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]

Encoding:

UTF-8

URL:

http://colldetreat.r-forge.r-project.org/, https://github.com/rnoremlas/rvif_package

Depends:

R (≥ 3.5.0), multiColl, car

LazyData:

true

Suggests:

knitr, rmarkdown, testthat (≥ 3.0.0)

VignetteBuilder:

knitr

Config/testthat/edition:

NeedsCompilation:

Packaged:

2025-09-01 08:46:32 UTC; Usuario

Repository:

CRAN

Date/Publication:

2025-09-05 15:50:02 UTC

Detecting multicollinearity using RVIF and graphical methods.

Description

Detecting troubling near-multicollinearity in multiple linear regression models is a classical econometric problem. The purpose of this package is to detect it by using the Redefined Variance Inflation Factor (RVIF) and the scatterplot between the Variance Inflation Factor (VIF) and the Coefficient of Variation (CV).

In addition, the RVIF is used to determine whether the statistical analysis of the model is affected by the degree of multicollinearity in the model.

Details

This package contains four functions. The first two, cv_vif and cv_vif_plot, respectively return the values of the Variance Inflation Factor (VIF) and the Coefficient of Variation (CV), as well as their representation in a scatterplot. It should be noted that the VIF is useful for detecting essential multicollinearity, while the CV is useful for detecting non-essential multicollinearity. Thus, the scatterplot of both measures can provide interesting information for determining whether there is a troubling degree of multicollinearity and identifying the type of multicollinearity present and the variables causing it.

On the other hand, the funcion rvif calculates the redefined VIF and the percentage of approximate multicollinearity due to each independent variable.

Finally, multicollinearity determines whether the degree of multicollinearity in the regression model affects the statistical analysis of the model, i.e., whether the non-rejection of the null hypothesis in the individual significance tests is due to the linear relationships between the independent variables of the model.

Author(s)

Román Salmerón Gómez (University of Granada) and Catalina B. García García (University of Granada).

Maintainer: Román Salmerón Gómez (romansg@ugr.es)

References

Salmerón, R., García, C.B. and García, J. (2018). Variance inflation factor and condition number in multiple linear regression. Journal of Statistical Computation and Simulation, 88:2365-2384, doi: https://doi.org/10.1080/00949655.2018.1463376.

Salmerón, R., Rodríguez, A. and García, C.B. (2020). Diagnosis and quantification of the non-essential collinearity. Computational Statistics, 35(2), 647-666, doi: https://doi.org/10.1007/s00180-019-00922-x.

Salmerón, R., García, C.B., Rodríguez, A. and García, C. (2022). Limitations in detecting multicollinearity due to scaling issues in the mcvis package. R Journal, 14(4), 264-279, doi: https://doi.org/10.32614/RJ-2023-010.

Salmerón, R., García, C.B. and García, J. (2025). A redefined Variance Inflation Factor: overcoming the limitations of the Variance Inflation Factor. Computational Economics, 65, 337-363, doi: https://doi.org/10.1007/s10614-024-10575-8.

Overcoming the inconsistences of the variance inflation factor: a redefined VIF and a test to detect statistical troubling multicollinearity by Salmerón, R., García, C.B and García, J. (working paper, https://arxiv.org/pdf/2005.02245).

Cobb-Douglas data

Description

Data used in Example 2 of Salmerón, García and García (2024) (subsection 4.2) on data for the Cobb-Douglas production function.

Usage

data("CDpf")

Format

A data frame containing 28 observations on the following 4 variables:

P: Production (dependent variable).
cte: Intercept.
logK: Capital (in logarithm).
logW: Work (in logarithm).

Details

This dataset was originally used by Olva Maldonado (2009).

References

Olva Maldonado, H. (2009). Análisis de la función de producción Cobb-Douglas y su aplicación en el sector productivo mexicano. Tesis, Universidad Autónoma de Chapingo.

Examples

  head(CDpf, n=5)
  y = CDpf[,1]
  x = CDpf[,2:4]  
  multicollinearity(y, x)

First simulated data for the simple linear regression model

Description

First data used in example 4 of Salmerón, García and García (2024) (subsection 4.4) on the special case of the simple linear model.

Usage

data("SLM1")

Format

A data frame with 50 observations on the following 3 variables:

y1: Dependent variable simulated as y = 3 + 4*V + u where u is normally distributed with a mean of 0 and a variance of 2.
cte: Intercept.
V: Simulated from a normal distribution with a mean of 10 and a variance of 100.

References

Examples

  head(SLM1, n=5)
  y = SLM1[,1]
  x = SLM1[,2:3]
  multicollinearity(y, x)

Second simulated data for the simple linear regression model

Description

Second data used in example 4 of Salmerón, García and García (2024) (subsection 4.4) on the special case of the simple linear model.

Usage

data("SLM2")

Format

A data frame with 50 observations on the following 3 variables:

y2: Dependent variable simulated as y = 3 + 4*Z + u where u is normally distributed with a mean of 0 and a variance of 2.
cte: Intercept.
Z: Simulated from a normal distribution with a mean of 10 and a variance of 0.1.

References

Examples

  head(SLM2, n=5)
  y = SLM2[,1]
  x = SLM2[,2:3]
  multicollinearity(y, x)

Wissel data

Description

Wissel data on outstanding mortgage debt.

Usage

data("Wissel")

Format

A data frame with 17 observations on the following 6 variables:

t: Year.
D: Outstanding mortgage debt (dependent variable).
cte: Intercept.
C: Personal consumption (trillions of dollars).
I: Personal income (trillions of dollars).
CP: Outstanding consumer credit (trillions of dollars).

References

Wissel, J. (2009). A new biased estimator for multivariate regression models with highly collinear variables. Ph.D. thesis, Erlangung des naturwissenschaftlichen Doktorgrades der Bayerischen Julius-Maximilians-Universität Würzburg, url: https://opus.bibliothek.uni-wuerzburg.de/opus4-wuerzburg/frontdoor/deliver/index/docId/2949/file/wissel.pdf.

Examples

  head(Wissel, n=5)
  y = Wissel[,2]
  x = Wissel[,3:6]
  multicollinearity(y, x)

VIF and CV calculation

Description

This function provides the values for the Variance Inflation Factor (VIF) and the Coefficient of Variation (CV) for the independent variables (excluding the intercept) in a multiple linear regression model.

Usage

cv_vif(x, tol = 1e-30)

Arguments

x

A numerical design matrix containing more than one regressor, including the intercept in the first column.

tol

A real number that indicates the tolerance beyond which the system is considered computationally unique when calculating the VIF. The default value is tol=1e-30.

Details

It is interesting to note the distinction between essential and non-essential multicollinearity. Essential multicollinearity happens when there is an approximate linear relationship between two or more independent variables (not including the intercept) while non-essential multicollinearity involves a linear relationship between the intercept and at least one independent variable. This distinction matters because the Variance Inflation Factor (VIF) only detects essential multicollinearity, while the Condition Value (CV) is useful for detecting only non-essential multicollinearity. Understanding the distinction between essential and non-essential multicollinearity and the limitations of each detection measure, can be very useful for identifying whether there is a troubling degree of multicollinearity, and determining the kind of multicollinearity present and the variables causing it.

Value

CV

Coefficient of Variation of each independent variable.

VIF

Variance Inflation Factor of each independent variable.

Author(s)

R. Salmerón (romansg@ugr.es) and C. García (cbgarcia@ugr.es).

References

Examples

### Example 1 
### At least three independent variables, including the intercept, must be present

	head(SLM1, n=5)
	y = SLM1[,1]
	x = SLM1[,2:3]
	cv_vif(x)

### Example 2
### Creating the design matrix

	library(multiColl)
	set.seed(2025)
	obs = 100
	cte = rep(1, obs)
	x2 = rnorm(obs, 5, 0.01)
	x3 = rnorm(obs, 5, 10)
	x4 = x3 + rnorm(obs, 5, 1)
	x5 = rnorm(obs, -1, 30)
	x = cbind(cte, x2, x3, x4, x5)
	cv_vif(x)

### Example 3 
### Obtaining the design matrix after executing the command 'lm'

	library(multiColl)
	set.seed(2025)
	obs = 100
	cte = rep(1, obs)
	x2 = rnorm(obs, 5, 0.01)
	x3 = rnorm(obs, 5, 10)
	x4 = x3 + rnorm(obs, 5, 1)
	x5 = rnorm(obs, -1, 30)
	u = rnorm(obs, 0, 2)
	y = 5 + 4*x2 - 5*x3 + 2*x4 - x5 + u
	reg = lm(y~x2+x3+x4+x5)
	x = model.matrix(reg)
	cv_vif(x) # identical to Example 2

### Example 3 
### Computationally singular system

	head(soil, n=5)
	y = soil[,16]
	x = soil[,-16]
	cv_vif(x)

Scatterplot of CV vs VIF

Description

This function provides a graphical representation of a scatter plot showing the Coefficient of Variation (CV) and the Variance Inflation Factor (VIF) for the independent variables (excluding the intercept) of a multiple linear regression model.

Usage

cv_vif_plot(x, limit = 40)

Arguments

x

This is the output of the function cv_vif.

limit

A real number that indicates the lower limit of the vertical axis. The default value is limit=40.

Details

The distinction between essential and non-essential multicollinearity and the limitations of each measure (CV and VIF) for detecting the different kinds of multicollinearity, can be very useful for identifying whether there is a troubling degree of multicollinearity, and determining the kind of multicollinearity present and the variables causing it.

For this purpose, it is important to include the lines corresponding to the established thresholds for each measure in the representation of the scatter plot of the CV and VIF: a dashed vertical line for 0.1002506 (CV) and a dotted horizontal line for 10 (VIF). These lines determine four regions (see Example 1), which can be interpreted as follows: A: existence of troubling non-essential and non-troubling essential multicollinearity; B: existence of troubling essential and non-essential multicollinearity; C: existence of non-troubling non-essential and troubling essential multicollinearity; D: non-troubling degree of existing multicollinearity (essential and non-essential).

Author(s)

R. Salmerón (romansg@ugr.es) and C.B. García (cbgarcia@ugr.es).

References

Examples

### Example 1

	plot(-2:20, -2:20, type = "n", xlab="Coefficient of Variation", 
	                              ylab="Variance Inflation Factor")
	abline(h=10, col="red", lwd=3, lty=2)
	abline(h=0, col="black", lwd=1)
	abline(v=0.1002506, col="red", lwd=3, lty=3)
	#abline(v=0, col="red", lwd=1)
	text(-1.25, 2, "A", pos=3, col="blue")
	text(-1.25, 12, "B", pos=3, col="blue")
	text(10, 12, "C", pos=3, col="blue")
	text(10, 2, "D", pos=3, col="blue")

### Example 2

	library(multiColl)
	set.seed(2025)
	obs = 100
	cte = rep(1, obs)
	x2 = rnorm(obs, 5, 0.01)
	x3 = rnorm(obs, 5, 10)
	x4 = x3 + rnorm(obs, 5, 1)
	x5 = rnorm(obs, -1, 30)
	x = cbind(cte, x2, x3, x4, x5)
	cv_vif_plot(cv_vif(x))
	cv_vif_plot(cv_vif(x), limit=0) # notes the effect of the 'limit' argument

### Example 3
### Graphical representation is not possible
	
	head(SLM2, n=5)
	x = SLM2[,2:3]
	cv_vif_plot(cv_vif(x))
	
### Example 4
### Computationally singular system
	
	head(soil, n=5)
	x = soil[,-16]
	cv_vif_plot(cv_vif(x))

Spanish company employee data

Description

Data used in example 3 of Salmerón, García and García (2024) (subsection 4.3) on the number of employees of Spanish companies.

Usage

data("employees")

Format

A data frame with 15 observations on the following 5 variables:

NE: Number of employees (dependent variable).
cte: Intercept.
FA: Fixed assets (in euros).
OI: Operating income (in euros).
S: Sales (in euros).

Details

This dataset is originally used by Salmerón, Rodríguez, García and García (2020).

References

Salmerón, R., Rodríguez, A., García, C.B. and García, J. (2020). The VIF and MSE in raise regression. Mathematics, 8(4), doi: https://doi.org/10.3390/math8040605.

Examples

  head(employees, n=5)
  y = employees[,1]
  x = employees[,3:5]
  multicollinearity(y, x)

Euribor data

Description

Data used in example 1 of Salmerón, García and García (2024) (subsection 4.1) on Euribor data.

Usage

data("euribor")

Format

A data frame with 47 observations on the following 5 variables:

E: Euribor (dependent variable, in percentage).
cte: Intercept.
HIPC: Harmonized index of consumer prices (in percentage).
BC: Balance of payments to net current account (millions of euros).
GD: Goverment deficit to net nonfinancial accounts (millions of euros).

Details

This dataset is originally used by Salmerón, Rodríguez and García (2020).

References

Examples

  head(euribor, n=5)
  y = euribor[,1]
  x = euribor[,2:5]
  multicollinearity(y, x)

Decision Rule to Detect Troubling Multicollinearity

Description

Given a multiple linear regression model with n observations and k independent variables, the degree of near-multicollinearity affects its statistical analysis (with a level of significance of alpha%) if there is a variable i, with i = 1,...,k, that verifies that the null hypothesis is not rejected in the original model and is rejected in the orthogonal model of reference.

Usage

multicollinearity(y, x, alpha = 0.05)

Arguments

y

A numerical vector representing the dependent variable of the model.

x

A numerical design matrix that should contain more than one regressor (intercept included in the first column).

alpha

Significance level (by default, 5%).

Details

This function compares the individual inference of the original model with that of the orthonormal model taken as reference.

Thus, if the null hypothesis is rejected in the individual significance tests in the model where there are no linear relationships between the independent variables (orthonormal) and is not rejected in the original model, the reason for the non-rejection is due to the existing linear relationships between the independent variables (multicollinearity) in the original model.

The second model is obtained from the first model by performing a QR decomposition, which eliminates the initial linear relationships.

Value

The function returns the value of the RVIF and the established thresholds, as well as indicating whether or not the individual significance analysis is affected by multicollinearity at the chosen significance level.

Author(s)

Román Salmerón Gómez (University of Granada) and Catalina B. García García (University of Granada).

Maintainer: Román Salmerón Gómez (romansg@ugr.es)

References

Salmerón, R., García, C.B. and García, J. (2025). A Redefined Variance Inflation Factor: overcoming the limitations of the Variance Inflation Factor. Computational Economics, 65, 337-363, doi: https://doi.org/10.1007/s10614-024-10575-8.

Examples

### Example 1
	
  set.seed(2024)
  obs = 100
  cte = rep(1, obs)
  x2 = rnorm(obs, 5, 0.01)  # related to intercept: non essential
  x3 = rnorm(obs, 5, 10)
  x4 = x3 + rnorm(obs, 5, 0.5) # related to x3: essential
  x5 = rnorm(obs, -1, 3)
  x6 = rnorm(obs, 15, 0.5)
  y = 4 + 5*x2 - 9*x3 -2*x4 + 2*x5 + 7*x6 + rnorm(obs, 0, 2)
  x = cbind(cte, x2, x3, x4, x5, x6)
  multicollinearity(y, x)

### Example 2
### Effect of sample size
  
  obs = 25 # by decreasing the number of observations affected to x4 
  cte = rep(1, obs)
  x2 = rnorm(obs, 5, 0.01)  # related to intercept: non essential
  x3 = rnorm(obs, 5, 10)
  x4 = x3 + rnorm(obs, 5, 0.5) # related to x3: essential
  x5 = rnorm(obs, -1, 3)
  x6 = rnorm(obs, 15, 0.5)
  y = 4 + 5*x2 - 9*x3 -2*x4 + 2*x5 + 7*x6 + rnorm(obs, 0, 2)
  x = cbind(cte, x2, x3, x4, x5, x6)
  multicollinearity(y, x)

### Example 3
  
  y = 4 - 9*x3 - 2*x5 + rnorm(obs, 0, 2)
  x = cbind(cte, x3, x5) # independently generated
  multicollinearity(y, x)
  
### Example 4
### Detection of multicollinearity in Wissel data
  
  head(Wissel, n=5)
  y = Wissel[,2]
  x = Wissel[,3:6]
  multicollinearity(y, x)
  
### Example 5
### Detection of multicollinearity in euribor data
  
  head(euribor, n=5)
  y = euribor[,1]
  x = euribor[,2:5]
  multicollinearity(y, x)
  
### Example 6
### Detection of multicollinearity in Cobb-Douglas production function data

  head(CDpf, n=5)
  y = CDpf[,1]
  x = CDpf[,2:4]  
  multicollinearity(y, x)
  
### Example 7
### Detection of multicollinearity in number of employees of Spanish companies data
  
  head(employees, n=5)
  y = employees[,1]
  x = employees[,3:5]
  multicollinearity(y, x)
  
### Example 8
### Detection of multicollinearity in simple linear model simulated data
  
  head(SLM1, n=5)
  y = SLM1[,1]
  x = SLM1[,2:3]
  multicollinearity(y, x)

  head(SLM2, n=5)
  y = SLM2[,1]
  x = SLM2[,2:3]
  multicollinearity(y, x)
    
### Example 9
### Detection of multicollinearity in soil characteristics data

  head(soil, n=5)
  y = soil[,16]
  x = soil[,-16] 
  x = cbind(rep(1, length(y)), x) # the design matrix has to have the intercept in the first column
  multicollinearity(y, x)
  multicollinearity(y, x[,-3]) # eliminating the problematic variable (SumCation)
  
### Example 10
### The intercept must be in the first column of the design matrix
  
  set.seed(2025)
  obs = 100
  cte = rep(1, obs)
  x2 = sample(1:500, obs)
  x3 = sample(1:500, obs)
  x4 = rep(4, obs)
  x = cbind(cte, x2, x3, x4)
  u = rnorm(obs, 0, 2)
  y = 5 + 2*x2 - 3*x3 + 10*x4 + u
  multicollinearity(y, x)
  multicollinearity(y, x[,-4]) # the constant variable is removed

RVIF calculation

Description

This function provides the values of the Redefined Variance Inflation Factor (RVIF) and the the percentage of near multicollinearity due to each independent variable.

Usage

rvifs(x, ul = TRUE, intercept = TRUE, tol = 1e-30)

Arguments

x

A numerical design matrix that should contain more than one regressor. If it has an intercept, this must be in the first column of the matrix).

ul

A logical value that indicates if the variables in the design matrix x are transformed to unit length. By default ul=TRUE.

intercept

A logical value that indicates if the design matrix x has an intercept. By default intercept=TRUE.

tol

Value determining whether the system is computationally singular. By default tol=1e-30.

Details

The Redefined Variation Inflation Factor (RVIF) is capable to detect both kind of multicollinearity: the essential (approximate linear relationship between at least two independent variables excluding the intercept) and non-essential (approximate linear relationship between the intercept and at least one of the remaining independent variables). This measure also quantifies the percentage of near multicollinearity due to each independent variable.

Value

RVIF

Redefined Variance Inflation Factor of each independent variable.

%

Percentage of near multicollinearity due to each independent variable.

Author(s)

R. Salmerón (romansg@ugr.es) and C. García (cbgarcia@ugr.es).

References

R. Salmerón, C. García, and J. García. (2018). Variance inflation factor and condition number in multiple linear regression. Journal of Statistical Computation and Simulation, 88:2365-2384, doi: https://doi.org/10.1080/00949655.2018.1463376.

Salmerón, R., García, C.B. y García, J. (2025). A redefined Variance Inflation Factor: overcoming the limitations of the Variance Inflation Factor. Computational Economics, 65, 337-363, doi: https://doi.org/10.1007/s10614-024-10575-8.

Examples

### Example 1
	
	library(multiColl)
	set.seed(2025)
	obs = 100
	cte = rep(1, obs)
	x2 = rnorm(obs, 5, 0.01)
	x3 = rnorm(obs, 5, 10)
	x4 = x3 + rnorm(obs, 5, 1)
	x5 = rnorm(obs, -1, 30)
	x = cbind(cte, x2, x3, x4, x5)
	rvifs(x)
	
### Example 2
### The special case of the simple linear regression model
	
	head(SLM1, n=5)
	x = SLM1[,2:3]
	rvifs(x)
	
### Example 3
### The intercept must be in the first column of the design matrix
	
	set.seed(2025)
	obs = 100
	cte = rep(1, obs)
	x2 = sample(1:500, obs)
	x3 = sample(1:500, obs)
	x4 = rep(4, obs)
	x = cbind(cte, x2, x3, x4)
	rvifs(x) # also: perfect multicollinearity between the intercept and the constant variable
	rvifs(x[,-1], intercept = FALSE) # removing the constant from the design matrix
	
### Example 4
### Cases of perfect multicollinearity or computationally singular systems
	
	head(soil, n=5)
	x = soil[,-16]
	rvifs(x)

Soil characteristics data

Description

Data used in Bondell and Reich's paper on soil characteristics used as predictors of forest diversity.

Usage

data("soil")

Format

A data frame with 20 observations on the following 16 variables.

BaseSat: % Base Saturation.
SumCation: Sum Cations (sums of cations like calcium, magnesium, potassium and sodium).
CECbuffer: CEC.
Ca: Calcium.
Mg: Magnesium.
K: Potassium.
Na: Sodium.
P: Phosphorus.
Cu: Copper.
Zn: Zinc.
Mn: Manganese.
HumicMatter: Humic Matter.
Density: Density.
pH: pH.
ExchAc: Exchangeable Acidity.
Diversity: Forest diversity (dependent variable).

Details

This dataset is originally used by Bondell and Reich (2008).

References

Bondell, H.D. and Reich. B.J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics, 64 (1), 115–23, doi: https://doi.org/10.1111/j.1541-0420.2007.00843.x.

Examples

  head(soil, n=5)
  y = soil[,16]
  x = soil[,-16] 
  x = cbind(rep(1, length(y)), x) # the design matrix has to have the intercept in the first column
  multicollinearity(y, x)
  multicollinearity(y, x[,-3]) # eliminating the problematic variable (SumCation)

Detecting multicollinearity using RVIF and graphical methods.

Description

Details

Author(s)

References

Cobb-Douglas data

Description

Usage

Format

Details

References

Examples

First simulated data for the simple linear regression model

Description

Usage

Format

References

Examples

Second simulated data for the simple linear regression model

Description

Usage

Format

References

Examples

Wissel data

Description

Usage

Format

References

Examples

VIF and CV calculation

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Scatterplot of CV vs VIF

Description

Usage

Arguments

Details

Author(s)

References

See Also

Examples

Spanish company employee data

Description

Usage

Format

Details

References

Examples

Euribor data

Description

Usage

Format

Details

References

Examples

Decision Rule to Detect Troubling Multicollinearity

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

RVIF calculation

Description

Usage

Arguments

Details

Value

Author(s)