Help for package ecm

Type:

Package

Title:

Build Error Correction Models

Imports:

stats, utils, car, earth

Version:

7.2.0

Author:

Gaurav Bansal

Maintainer:

Gaurav Bansal <gaurbans@gmail.com>

Description:

Functions for easy building of error correction models (ECM) for time series regression.

URL:

https://github.com/gaurbans/ecm

License:

GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]

LazyData:

TRUE

RoxygenNote:

7.2.3

Encoding:

UTF-8

NeedsCompilation:

Packaged:

2024-01-22 21:18:42 UTC; GauravBansal

Repository:

CRAN

Date/Publication:

2024-01-22 22:32:50 UTC

FRED data on the Wilshire 5000 index and other economic factors

Description

A dataset containing quarterly performance of the Wilshire 5000 index, corporate profits, Federal Reserve funds rate, and the unemployment rate.

Usage

data(Wilshire)

Format

A data frame with 188 rows and 5 variables:

date: monthly date
Wilshire5000: quarterly Wilshire 5000 index, in value
CorpProfits: quarterly corporate profits, in value
FedFundsRate: quarterly federal funds rate, in percent
UnempRate: quarterly unemployment rate, in percent

Source

https://fred.stlouisfed.org/

Get cumulative count

Description

Get the cumulative count of a variable of interest

Usage

cumcount(x)

Arguments

x

A vector for which to get cumulative count

Value

The cumulative count of all items in x

Calculate Durbin's h-statistic

Description

Calculates Durbin's h-statistic for autoregressive models.

Usage

durbinH(model, ylag1var)

Arguments

model

The model being assessed

ylag1var

The variable in the model that represents the lag of the y-term

Details

Using the Durbin-Watson (DW) test for autoregressive models is inappropriate because the DW test itself tests for first order autocorrelation. This doesn't apply to an ECM model, for which the DW test is still valid, but the durbinH function in included here in case an autoregressive model has been built. If Durbin's h-statistic is greater than 1.96, it is likely that autocorrelation exists.

Value

Numeric Durbin's h statistic

Examples

##Not run

#Build a simple AR1 model to predict performance of the Wilshire 5000 Index
data(Wilshire)
Wilshire$Wilshire5000Lag1 <- c(NA, Wilshire$Wilshire5000[1:(nrow(Wilshire)-1)])
Wilshire <- Wilshire[complete.cases(Wilshire),]
AR1model <- lm(Wilshire5000 ~ Wilshire5000Lag1, data=Wilshire)

#Check Durbin's h-statistic on AR1model
durbinH(AR1model, "Wilshire5000Lag1")
#The h-statistic is 4.23, which means there is likely autocorrelation in the data.

Build an error correction model

Description

Builds an lm object that represents an error correction model (ECM) by automatically differencing and lagging predictor variables according to ECM methodology.

Usage

ecm(
  y,
  xeq,
  xtr,
  includeIntercept = TRUE,
  weights = NULL,
  linearFitter = "lm",
  ...
)

Arguments

y

The target variable

xeq

The variables to be used in the equilibrium term of the error correction model

xtr

The variables to be used in the transient term of the error correction model

includeIntercept

Boolean whether the y-intercept should be included (should be set to TRUE if using 'earth' as linearFitter)

weights

Optional vector of weights to be passed to the fitting process

linearFitter

Whether to use 'lm' or 'earth' to fit the model

...

Additional arguments to be passed to the 'lm' or 'earth' function (careful that some arguments may not be appropriate for ecm!)

Details

The general format of an ECM is

\Delta y_{t} = \beta_{0} + \beta_{1}\Delta x_{1,t} +...+ \beta_{i}\Delta x_{i,t} + \gamma(y_{t-1} - (\alpha_{1}x_{1,t-1} +...+ \alpha_{i}x_{i,t-1})).

The ecm function here modifies the equation to the following:

\Delta y = \beta_{0} + \beta_{1}\Delta x_{1,t} +...+ \beta_{i}\Delta x_{i,t} + \gamma y_{t-1} + \gamma_{1}x_{1,t-1} +...+ \gamma_{i}x_{i,t-1},

where \gamma_{i} = -\gamma \alpha_{i},

so it can be modeled as a simpler ordinary least squares (OLS) function using R's lm function.

Ordinarily, the ECM uses lag=1 when differencing the transient term and lagging the equilibrium term, as specified in the equation above. However, the ecm function here gives the user the ability to specify a lag greater than 1.

Notice that an ECM models the change in the target variable (y). This means that the predictors will be lagged and differenced, and the model will be built on one observation less than what the user inputs for y, xeq, and xtr. If these arguments contain vectors with too few observations (eg. one single observation), the function will not work. Additionally, for the same reason, if using weights in the ecm function, the length of weights should be one less than the number of rows in xeq or xtr.

When inputting a single variable for xeq or xtr in base R, it is important to input it in the format "xeq=df['col1']" so they inherit the class 'data.frame'. Inputting such as "xeq=df[,'col1']" or "xeq=df$col1" will result in errors in the ecm function. You can load data via other R packages that store data in other formats, as long as those formats also inherit the 'data.frame' class.

By default, base R's 'lm' is used to fit the model. However, users can opt to use 'earth', which uses Jerome Friedman's Multivariate Adaptive Regression Splines (MARS) to build a regression model, which transforms each continuous variable into piece-wise linear hinge functions. This allows for non-linear features in both the transient and equilibrium terms.

ECM models are used for time series data. This means the user may need to consider stationarity and/or cointegration before using the model.

Value

an lm object representing an error correction model

Examples

##Not run

#Use ecm to predict Wilshire 5000 index based on corporate profits, 
#Federal Reserve funds rate, and unemployment rate.
data(Wilshire)

#Use 2015-12-01 and earlier data to build models
trn <- Wilshire[Wilshire$date<='2015-12-01',]

#Assume all predictors are needed in the equilibrium and transient terms of ecm.
xeq <- xtr <- trn[c('CorpProfits', 'FedFundsRate', 'UnempRate')]
model1 <- ecm(trn$Wilshire5000, xeq, xtr, includeIntercept=TRUE)

#Assume CorpProfits and FedFundsRate are in the equilibrium term, 
#UnempRate has only transient impacts.
xeq <- trn[c('CorpProfits', 'FedFundsRate')]
xtr <- trn['UnempRate']
model2 <- ecm(trn$Wilshire5000, xeq, xtr, includeIntercept=TRUE)

Build an averaged error correction model

Description

Builds multiple ECM models on subsets of the data and averages them. See the lmave function for more details on the methodology and use cases for this approach.

Usage

ecmave(
  y,
  xeq,
  xtr,
  includeIntercept = TRUE,
  k,
  method = "boot",
  seed = 5,
  weights = NULL,
  ...
)

Arguments

y

The target variable

xeq

The variables to be used in the equilibrium term of the error correction model

xtr

The variables to be used in the transient term of the error correction model

includeIntercept

Boolean whether the y-intercept should be included

k

The number of models or data partitions desired

method

Whether to split data by folds ("fold"), nested folds ("nestedfold"), or bootstrapping ("boot")

seed

Seed for reproducibility (only needed if method is "boot")

weights

Optional vector of weights to be passed to the fitting process

...

Additional arguments to be passed to the 'lm' function (careful in that these may need to be modified for ecm or may not be appropriate!)

Details

In some cases, instead of building an ECM on the entire dataset, it may be preferable to build k ECM models on k subsets of the data, each subset containing (k-1)/k*nrow(data) observations of the full dataset, and then average their coefficients. Reasons to do this include controlling for overfitting or extending the training sample. For example, in many time series modeling exercises, the holdout test sample is often the latest few months or years worth of data. Ideally, it's desirable to include these data since they likely have more future predictive power than older observations. However, including the entire dataset in the training sample could result in overfitting, or using a different time period as the test sample may be even less representative of future performance. One potential solution is to build multiple ECM models using the entire dataset, each with a different holdout test sample, and then average them to get a final ECM. This approach is somewhat similar to the idea of random forest regression, in which multiple regression trees are built on subsets of the data and then averaged.

This function only works with the 'lm' linear fitter.

Value

an lm object representing an error correction model

Examples

##Not run

#Use ecm to predict Wilshire 5000 index based on corporate profits, 
#Federal Reserve funds rate, and unemployment rate
data(Wilshire)

#Use 2015-12-01 and earlier data to build models
trn <- Wilshire[Wilshire$date<='2015-12-01',]

#Build five ECM models and average them to get one model
xeq <- xtr <- trn[c('CorpProfits', 'FedFundsRate', 'UnempRate')]
model1 <- ecmave(trn$Wilshire5000, xeq, xtr, includeIntercept=TRUE, k=5)

Backwards selection using an averaged error correction model

Description

Much like the ecmback function, ecmaveback uses backwards selection to build an error correction model. However, it uses the averaging method of ecmave to build models and then choose variables based on lowest AIC or BIC, or highest adjusted R-squared.

Usage

ecmaveback(
  y,
  xeq,
  xtr,
  includeIntercept = T,
  criterion = "AIC",
  k,
  method = "boot",
  seed = 5,
  weights = NULL,
  keep = NULL,
  ...
)

Arguments

y

The target variable

xeq

The variables to be used in the equilibrium term of the error correction model

xtr

The variables to be used in the transient term of the error correction model

includeIntercept

Boolean whether the y-intercept should be included

criterion

Whether AIC (default), BIC, or adjustedR2 should be used to select variables

k

The number of models or data partitions desired

method

Whether to split data by folds ("fold"), nested folds ("nestedfold"), or bootstrapping ("boot")

seed

Seed for reproducibility (only needed if method is "boot")

weights

Optional vector of weights to be passed to the fitting process

keep

Optional character vector of variables to forcibly retain

...

Additional arguments to be passed to the 'lm' function (careful in that these may need to be modified for ecm or may not be appropriate!)

Details

When inputting a single variable for xeq or xtr, it is important to input it in the format "xeq=df['col1']" in order to retain the data frame class. Inputting such as "xeq=df[,'col1']" or "xeq=df$col1" will result in errors in the ecm function.

If using weights, the length of weights should be one less than the number of rows in xeq or xtr.

This function only works with the 'lm' linear fitter.

Value

an lm object representing an error correction model using backwards selection

Examples

##Not run

#Use ecm to predict Wilshire 5000 index based on corporate profits, 
#Federal Reserve funds rate, and unemployment rate
data(Wilshire)

#Use 2015-12-01 and earlier data to build models
trn <- Wilshire[Wilshire$date<='2015-12-01',]

#Use backwards selection to choose which predictors are needed 
xeq <- xtr <- trn[c('CorpProfits', 'FedFundsRate', 'UnempRate')]
modelaveback <- ecmaveback(trn$Wilshire5000, xeq, xtr, k = 5)
print(modelaveback)
#Backwards selection chose CorpProfits and FedFundsRate in the equilibrium term, 
#CorpProfits and UnempRate in the transient term.

modelavebackFFR <- ecmaveback(trn$Wilshire5000, xeq, xtr, k = 5, keep = 'UnempRate')
print(modelavebackFFR)
#Backwards selection was forced to retain UnempRate in both terms.

Backwards selection to build an error correction model

Description

Much like the ecm function, this builds an error correction model. However, it uses backwards selection to select the optimal predictors based on lowest AIC or BIC, or highest adjusted R-squared, rather than using all predictors.

Usage

ecmback(
  y,
  xeq,
  xtr,
  includeIntercept = T,
  criterion = "AIC",
  weights = NULL,
  keep = NULL,
  ...
)

Arguments

y

The target variable

xeq

The variables to be used in the equilibrium term of the error correction model

xtr

The variables to be used in the transient term of the error correction model

includeIntercept

Boolean whether the y-intercept should be included

criterion

Whether AIC (default), BIC, or adjustedR2 should be used to select variables

weights

Optional vector of weights to be passed to the fitting process

keep

Optional character vector of variables to forcibly retain

...

Additional arguments to be passed to the 'lm' function (careful in that these may need to be modified for ecm or may not be appropriate!)

Details

If using weights, the length of weights should be one less than the number of rows in xeq or xtr.

This function only works with the 'lm' linear fitter. The 'earth' linear fitter already does some variable selection, so one can use that via that 'ecm' function.

Value

an lm object representing an error correction model using backwards selection

Examples

##Not run

#Use ecm to predict Wilshire 5000 index based on corporate profits, 
#Federal Reserve funds rate, and unemployment rate
data(Wilshire)

#Use 2015-12-01 and earlier data to build models
trn <- Wilshire[Wilshire$date<='2015-12-01',]

#Use backwards selection to choose which predictors are needed 
xeq <- xtr <- trn[c('CorpProfits', 'FedFundsRate', 'UnempRate')]
modelback <- ecmback(trn$Wilshire5000, xeq, xtr)
print(modelback)
#Backwards selection chose CorpProfits and FedFundsRate in the equilibrium term, 
#CorpProfits and UnempRate in the transient term.

modelbackFFR <- ecmback(trn$Wilshire5000, xeq, xtr, keep = 'UnempRate')
print(modelbackFFR)
#Backwards selection was forced to retain UnempRate in both terms.

Predict using an ecm object

Description

Takes an ecm object and uses it to predict based on new data. This prediction does the undifferencing required to transform the change in y back to y itself.

Usage

ecmpredict(model, newdata, init)

Arguments

model

ecm object used to make predictions

newdata

Data frame to on which to predict

init

Initial value(s) for prediction

Details

Since error correction models only model the change in the target variable, an initial value must be specified. Additionally, the 'newdata' parameter should have at least 3 rows of data.

Value

Numeric predictions on new data based ecm object

Examples

##Not run

data(Wilshire)

#Rebuilding model1 from ecm example
trn <- Wilshire[Wilshire$date<='2015-12-01',]
xeq <- xtr <- trn[c('CorpProfits', 'FedFundsRate', 'UnempRate')]
model1 <- ecm(trn$Wilshire5000, xeq, xtr)
model2 <- ecm(trn$Wilshire5000, xeq, xtr, linearFitter='earth')

#Use 2015-12-01 and onwards data as test data to predict
tst <- Wilshire[Wilshire$date>='2015-12-01',]

#predict on tst using model1 and initial FedFundsRate
tst$model1Pred <- ecmpredict(model1, tst, tst$Wilshire5000[1])
tst$model2Pred <- ecmpredict(model2, tst, tst$Wilshire5000[1])

Lag a vector

Description

Create a vector of the lag of a variable and fill missing values with NA's.

Usage

lagpad(x, k = 1)

Arguments

x

A vector to be lagged

k

The number of lags to output

Value

The lagged vector with NA's in missing values

Build multiple lm models and average them

Description

Builds k lm models on k partitions of the data and averages their coefficients to get create one model. Each partition excludes k/nrow(data) observations. See links in the References section for further details on this methodology.

Usage

lmave(formula, data, k, method = "boot", seed = 5, weights = NULL, ...)

Arguments

formula

The formula to be passed to lm

data

The data to be used

k

The number of models or data partitions desired

method

Whether to split data by folds ("fold"), nested folds ("nestedfold"), or bootstrapping ("boot")

seed

Seed for reproducibility (only needed if method is "boot")

weights

Optional vector of weights to be passed to the fitting process

...

Additional arguments to be passed to the 'lm' function

Details

In some cases–especially in some time series modeling (see ecmave function)–rather than building one model on the entire dataset, it may be preferable to build multiple models on subsets of the data and average them. The lmave function splits the data into k partitions of size (k-1)/k*nrow(data), builds k models, and then averages the coefficients of these models to get a final model. This is similar to averaging multiple tree regression models in algorithms like random forest.

Unlike the 'ecm' functin, this function only works with the 'lm' linear fitter.

Value

an lm object

References

Jung, Y. & Hu, J. (2016). "A K-fold Averaging Cross-validation Procedure". https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5019184/

Cochrane, C. (2018). "Time Series Nested Cross-Validation". https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9

Examples

##Not run

#Build linear models to predict Wilshire 5000 index based on corporate profits, 
#Federal Reserve funds rate, and unemployment rate
data(Wilshire)

#Build one model on the entire dataset
modelall <- lm(Wilshire5000 ~ ., data = Wilshire[-1])

#Build a five fold averaged linear model on the entire dataset
modelave <- lmave('Wilshire5000 ~ .', data = Wilshire[-1], k = 5)

FRED data on the Wilshire 5000 index and other economic factors

Description

Usage

Format

Source

Get cumulative count

Description

Usage

Arguments

Value

Calculate Durbin's h-statistic

Description

Usage

Arguments

Details

Value

See Also

Examples

Build an error correction model

Description

Usage

Arguments

Details

Value

See Also

Examples

Build an averaged error correction model

Description

Usage

Arguments

Details

Value

See Also

Examples

Backwards selection using an averaged error correction model

Description

Usage

Arguments

Details

Value

See Also

Examples

Backwards selection to build an error correction model

Description

Usage

Arguments

Details

Value

See Also

Examples

Predict using an ecm object

Description

Usage

Arguments

Details

Value

Examples

Lag a vector

Description

Usage

Arguments

Value

Build multiple lm models and average them

Description

Usage

Arguments

Details

Value

References

See Also

Examples