Type: | Package |
Title: | Multivariate Outlier Explanations using Shapley Values and Mahalanobis Distances |
Version: | 0.1.2 |
Depends: | R (≥ 4.0.0) |
Maintainer: | Marcus Mayrhofer <marcus.mayrhofer@tuwien.ac.at> |
Description: | Based on Shapley values to explain multivariate outlyingness and to detect and impute cellwise outliers. Includes implementations of methods described in Mayrhofer and Filzmoser (2023) <doi:10.1016/j.ecosta.2023.04.003>. |
License: | GPL-3 |
Imports: | dplyr, Rdpack, stats, tibble, tidyr, robustbase, forcats, egg, ggplot2, gridExtra, RColorBrewer, magrittr |
Suggests: | grDevices, cellWise, robustHD, knitr, MASS, rmarkdown |
RdMacros: | Rdpack |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.3.2 |
VignetteBuilder: | knitr |
Repository: | CRAN |
NeedsCompilation: | no |
Packaged: | 2024-10-17 11:38:48 UTC; Marcus Mayrhofer |
Author: | Marcus Mayrhofer [aut, cre], Peter Filzmoser [aut] |
Date/Publication: | 2024-10-17 12:00:34 UTC |
ShapleyOutlier: Multivariate Outlier Explanations using Shapley Values and Mahalanobis Distances
Description
Based on Shapley values to explain multivariate outlyingness and to detect and impute cellwise outliers. Includes implementations of methods described in Mayrhofer and Filzmoser (2023) doi:10.1016/j.ecosta.2023.04.003.
Author(s)
Maintainer: Marcus Mayrhofer marcus.mayrhofer@tuwien.ac.at
Authors:
Peter Filzmoser
Pipe operator
Description
See magrittr::%>%
for details.
Usage
lhs %>% rhs
Arguments
lhs |
A value or the magrittr placeholder. |
rhs |
A function call using the magrittr semantics. |
Value
The result of calling 'rhs(lhs)'.
Detecting cellwise outliers using Shapley values based on local outlyingness.
Description
The MOE
function indicates outlying cells for
a data vector with p
entries or data matrix with n \times p
entries containing only numeric entries x
for a given center mu
and covariance matrix Sigma
using the Shapley value.
It is a more sophisticated alternative to the SCD
algorithm,
which uses the information of the regular cells to derive an alternative reference point (Mayrhofer and Filzmoser 2023).
Usage
MOE(
x,
mu,
Sigma,
Sigma_inv = NULL,
step_size = 0.1,
min_deviation = 0,
max_step = NULL,
local = TRUE,
max_iter = 1000,
q = 0.99,
check_outlyingness = FALSE,
check = TRUE,
cells = NULL,
method = "cellMCD"
)
Arguments
x |
Data vector with |
mu |
Either |
Sigma |
Either |
Sigma_inv |
Either |
step_size |
Numeric. Step size for the imputation of outlying cells, with |
min_deviation |
Numeric. Detection threshold, with |
max_step |
Either |
local |
Logical. If TRUE (default), the non-central Chi-Squared distribution is used to determine the cutoff value based on |
max_iter |
Integer. The maximum number of iterations. |
q |
Numeric. The quantile of the Chi-squared distribution for detection and imputation of outliers. Defaults to |
check_outlyingness |
Logical. If TRUE (default), the outlyingness is rechecked after applying |
check |
Logical. If |
cells |
Either |
method |
Either "cellMCD" (default) or "MCD". Specifies the method used for parameter estimation if |
Value
A list of class shapley_algorithm
(new_shapley_algorithm
) containing the following:
x |
A |
phi |
A |
mu_tilde |
A |
x_original |
A |
x_original |
The non-centrality parameters for the Chi-Squared distribution |
x_history |
A list with |
phi_history |
A list with |
mu_tilde_history |
A list with |
S_history |
A list with |
References
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
Examples
p <- 5
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
Sigma_inv <- solve(Sigma)
x <- c(0,1,2,2.3,2.5)
MOE_x <- MOE(x = x, mu = mu, Sigma = Sigma)
plot(MOE_x)
library(MASS)
set.seed(1)
n <- 100; p <- 10
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
X <- mvrnorm(n, mu, Sigma)
X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50)
MOE_X <- MOE(X, mu, Sigma)
plot(MOE_X, subset = 20)
Detecting cellwise outliers using Shapley values.
Description
The SCD
function indicates outlying cells for
a data vector with p
entries or data matrix with n \times p
entries containing only numeric entries x
for a given center mu
and covariance matrix Sigma
using the Shapley value (Mayrhofer and Filzmoser 2023).
Usage
SCD(
x,
mu,
Sigma,
Sigma_inv = NULL,
step_size = 0.1,
min_deviation = 0,
max_step = NULL,
max_iter = 1000,
q = 0.99,
method = "cellMCD",
check = TRUE,
cells = NULL
)
Arguments
x |
Data vector with |
mu |
Either |
Sigma |
Either |
Sigma_inv |
Either |
step_size |
Numeric. Step size for the imputation of outlying cells, with |
min_deviation |
Numeric. Detection threshold, with |
max_step |
Either |
max_iter |
Integer. The maximum number of iterations. |
q |
Numeric. The quantile of the Chi-squared distribution for detection and imputation of outliers. Defaults to |
method |
Either "cellMCD" (default) or "MCD". Specifies the method used for parameter estimation if |
check |
Logical. If |
cells |
Either |
Value
A list of class shapley_algorithm
(new_shapley_algorithm
) containing the following:
x |
A |
phi |
A |
x_original |
A |
x_history |
The path of how the original data vector was modified. |
phi_history |
The Shapley values corresponding to |
S_history |
The indices of the outlying cells in each iteration. |
References
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
Examples
p <- 5
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
Sigma_inv <- solve(Sigma)
x <- c(0,1,2,2.3,2.5)
SCD_x <- SCD(x = x, mu = mu, Sigma = Sigma)
plot(SCD_x)
library(MASS)
set.seed(1)
n <- 100; p <- 10
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
X <- mvrnorm(n, mu, Sigma)
X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50)
SCD_X <- SCD(X, mu, Sigma)
plot(SCD_X, subset = 20)
Weather data from Vienna
Description
Monthly data from the weather station Hohe Warte since April 1872 - Vienna (Stadt Wien 2022).
Usage
WeatherVienna
Format
A data frame with 1,804 rows and 25 columns:
year
Year
month
Month
t
Daily mean air temperature in °C, (t7 mean + t19 mean + tmax mean + tmin mean)/4; before 1971: t7 mean + t14 mean + 2 x t21 mean)
t_max
Absolute maximum air temperature in °C
t_min
Absolute air temperature minimum in °C
avg_t_max
Mean daily maximum air temperature in °C
avg_t_min
Mean daily minimum air temperature in °C
num_frost
Number of frost days (days with a temperature maximum tmin < 0.0 °C)
num_ice
Number of ice days (days with a temperature maximum tmax < 0.0 °C)
num_summer
Number of summer days (days with a temperature maximum tmax >= 25.0 °C)
num_heat
Number of hot days (days with a temperature maximum tmax >= 30.0 °C)
p
Daily mean air pressure in hPa (mean of all measurements at 7 a.m., 2 p.m., 7 p.m. CET; before 1971 9 p.m. instead of 7 p.m.)
p_max
Maximum air pressure in hPa (maximum of all measurements7 am, 2 pm, 7 pm CET; before 1971 9 pm instead of 7 pm)
p_min
Minimum air pressure in hPa (minimum of all measurements7 am, 2 pm, 7 pm CET; before 1971 9 pm instead of 7 pm)
sun_h
Monthly total sunshine duration in hours
num_clear
Number of clear days (daily mean cloudiness < 20/100)
num_cloud
Number of cloudy days (daily mean cloudiness > 80/100)
rel_hum
Daily mean relative humidity in percent (2 x RH7 mean + RH14 mean + RH19 mean)/4; before 1971 9 p.m. instead of 7 p.m.)
rel_hum_max
Relative humidity maximum in percent
rel_hum_min
Relative humidity minimum in percent
wind_v
Monthly average wind speed in km/h
num_wind_v60
Number of days with wind peaks >= 60 km/h
wind_v_max
Maximum wind speed in km/h
precp_sum
Monthly total precipitation in mm
num_precp_01
Number of days with precipitation >= 0.1 mm
Source
The data were downloaded from https://www.data.gv.at/katalog/dataset/wetter-seit-1872-hohe-warte-wien
in September 2022.
References
Stadt Wien (2022). “Monthly data from the weather station Hohe Warte since April 1872 - Vienna.” https://www.data.gv.at/katalog/dataset/wetter-seit-1872-hohe-warte-wien.
Examples
data("WeatherVienna")
summary(WeatherVienna)
Class constructor for class shapley
.
Description
This function creates an object of class shapley
that is returned by the shapley
function.
Usage
new_shapley(phi = numeric(), mu_tilde = NULL, non_centrality = NULL)
Arguments
phi |
A |
mu_tilde |
Optional. A |
non_centrality |
Optional. The non-centrality parameters for the Chi-Squared distribution,
which are given by |
Value
Named list of class shapley
, containing the input parameters.
Class constructor for class shapley_algorithm
.
Description
This function creates an object of class shapley_algorithm
that is returned
by the SCD
and MOE
functions.
Usage
new_shapley_algorithm(
x = numeric(),
phi = numeric(),
x_original = numeric(),
mu_tilde = NULL,
non_centrality = NULL,
x_history = NULL,
phi_history = NULL,
mu_tilde_history = NULL,
S_history = NULL
)
Arguments
x |
A |
phi |
A |
x_original |
A |
mu_tilde |
Optional. A |
non_centrality |
Optional. The non-centrality parameters for the Chi-Squared distribution,
which are given by |
x_history |
Optional. A list with |
phi_history |
Optional. A list with |
mu_tilde_history |
Optional. A list with |
S_history |
Optional. A list with |
Value
Named list of class shapley_algorithm
, containing the input parameters.
Class constructor for class shapley_interaction
.
Description
This function creates an object of class shapley_interaction
that is returned
by the shapley_interaction
function.
Usage
new_shapley_interaction(PHI = numeric())
Arguments
PHI |
A |
Value
Matrix of class shapley_interaction
, containing input matrix PHI
.
Barplot of Shapley values
Description
Barplot of Shapley values
Usage
## S3 method for class 'shapley'
plot(
x,
subset = NULL,
chi2.q = 0.99,
abbrev.var = 3,
abbrev.obs = 10,
sort.var = FALSE,
sort.obs = FALSE,
plot_md = TRUE,
md_squared = TRUE,
rotate_x = TRUE,
...
)
Arguments
x |
A list of class |
subset |
Either an integer, |
chi2.q |
Quantile, only used if |
abbrev.var |
Integer. If |
abbrev.obs |
Integer. If |
sort.var |
Logical. If |
sort.obs |
Logical. If |
plot_md |
Logical. If |
md_squared |
Logical. If |
rotate_x |
Logical. If |
... |
Optional arguments passed to methods. |
Value
Returns a barplot that displays the Shapley values (shapley
)for each observation and optionally (plot_md = TRUE
)
includes the squared Mahalanobis distance (black bar) and the corresponding (non-)central chi-square quantile (dotted line).
Examples
library(MASS)
set.seed(1)
n <- 100; p <- 10
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
X <- mvrnorm(n, mu, Sigma)
X_clean <- X
X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50)
call_shapley <- shapley(X, mu, Sigma)
plot(call_shapley, subset = 1:20)
plot(call_shapley, subset = 5, rotate_x = FALSE)
plot(call_shapley, subset = 5, md_squared = FALSE, rotate_x = FALSE)
Barplot and tileplot of Shapley values.
Description
Barplot and tileplot of Shapley values.
Usage
## S3 method for class 'shapley_algorithm'
plot(
x,
type = "both",
subset = NULL,
abbrev.var = FALSE,
abbrev.obs = FALSE,
sort.var = FALSE,
sort.obs = FALSE,
n_digits = 2,
rotate_x = TRUE,
continuous_rowname = FALSE,
...
)
Arguments
x |
A list of class |
type |
Either |
subset |
Either an integer, |
abbrev.var |
Integer. If |
abbrev.obs |
Integer. If |
sort.var |
Logical. If |
sort.obs |
Logical. If |
n_digits |
Integer. If |
rotate_x |
Logical. If |
continuous_rowname |
Logical. If |
... |
Arguments passed on to |
Value
Returns plots for a list of class shapley_algorithm
.
If type
is "bar"
, a barplot is generated. It displays the Shapley values (shapley
)
for each observation and optionally (plot_md = TRUE
) includes the squared Mahalanobis distance (black bar)
and the corresponding (non-)central chi-square quantile (dotted line).
If type
is "cell"
a tileplot is generated. It displays each cells of the dataset and shows the original value from the observations,
color coding indicates whether those values were higher (red) or lower (blue) than the imputed values,
and the color intensity is based on the magnitude of the Shapley value.
If type
is "both"
, the barplot and the tileplot are generated.
Examples
library(MASS)
set.seed(1)
n <- 100; p <- 10
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
X <- mvrnorm(n, mu, Sigma)
X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50)
MOE_X <- MOE(X, mu, Sigma)
plot(MOE_X, subset = 20, n_digits = 0)
Plot of Shapley interaction indices
Description
Plot of Shapley interaction indices
Usage
## S3 method for class 'shapley_interaction'
plot(
x,
abbrev = 4,
title = "Shapley Interaction",
legend = TRUE,
text_size = 22,
...
)
Arguments
x |
A |
abbrev |
Integer. If |
title |
Character. Title of the plot. |
legend |
Logical. If TRUE (default), a legend is plotted. |
text_size |
Integer. Size of the text in the plot |
... |
Optional arguments passed to methods. |
Value
Returns a figure consisting of two panels. The upper panel shows the Shapley values, and the lower panel the Shapley interaction indices.
Examples
p <- 5
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
Sigma_inv <- solve(Sigma)
x <- c(0,1,2,2.3,2.5)
PHI <- shapley_interaction(x, mu, Sigma)
plot(PHI)
Print function for class shapley
.
Description
Print function for class shapley
.
Usage
## S3 method for class 'shapley'
print(x, ...)
Arguments
x |
List of class |
... |
Optional arguments passed to methods. |
Value
Prints the list entries of x
that are not NULL
.
Print function for class shapley_algorithm
.
Description
Print function for class shapley_algorithm
.
Usage
## S3 method for class 'shapley_algorithm'
print(x, ...)
Arguments
x |
List of class |
... |
Optional arguments passed to methods. |
Value
Prints the imputed data and the Shapley values.
Print function for class shapley_interaction
.
Description
Print function for class shapley_interaction
.
Usage
## S3 method for class 'shapley_interaction'
print(x, ...)
Arguments
x |
Matrix of class |
... |
Optional arguments passed to methods. |
Value
Prints the Shapley interaction indices.
Decomposition of squared Mahalanobis distance using Shapley values.
Description
The shapley
function computes a p
-dimensional vector containing the decomposition of the
squared Mahalanobis distance of x
(with respect to mu
and Sigma
)
into outlyingness contributions of the individual variables (Mayrhofer and Filzmoser 2023).
The value of the j
-th coordinate of this vector represents the
average marginal contribution of the j
-th variable to the squared Mahalanobis distance of
the individual observation x
.
If cells
is provided, Shapley values of x
are computed with respect to a local reference point,
that is based on a cellwise prediction of each coordinate, using the information of the regular cells of x
, see (Mayrhofer and Filzmoser 2023).
If x
is a n \times p
matrix, a n \times p
matrix is returned, containing the decomposition for each row.
Usage
shapley(
x,
mu = NULL,
Sigma = NULL,
inverted = FALSE,
method = "cellMCD",
check = TRUE,
cells = NULL
)
Arguments
x |
Data vector with |
mu |
Either |
Sigma |
Either |
inverted |
Logical. If |
method |
Either "cellMCD" (default) or "MCD". Specifies the method used for parameter estimation if |
check |
Logical. If |
cells |
Either |
Value
phi |
A |
mu_tilde |
A |
non_centrality |
The non-centrality parameters for the Chi-Squared distribution, given by |
References
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
Examples
## Without outlying cells as input in the 'cells' argument#'
# Single observation
p <- 5
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
Sigma_inv <- solve(Sigma)
x <- c(0,1,2,2.3,2.5)
shapley(x, mu, Sigma)
phi <- shapley(x, mu, Sigma_inv, inverted = TRUE)
plot(phi)
# Multiple observations
library(MASS)
set.seed(1)
n <- 100; p <- 10
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
X <- mvrnorm(n, mu, Sigma)
X_clean <- X
X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50)
call_shapley <- shapley(X, mu, Sigma)
plot(call_shapley, subset = 20)
## Giving outlying cells as input in the 'cells' argument
# Single observation
p <- 5
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
Sigma_inv <- solve(Sigma)
x <- c(0,1,2,2.3,2.5)
call_shapley <- shapley(x, mu, Sigma_inv, inverted = TRUE,
method = "cellMCD", check = TRUE, cells = c(1,1,0,0,0))
plot(call_shapley)
# Multiple observations
library(MASS)
set.seed(1)
n <- 100; p <- 10
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
X <- mvrnorm(n, mu, Sigma)
X_clean <- X
X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50)
call_shapley <- shapley(X, mu, Sigma, cells = (X_clean - X)!=0)
plot(call_shapley, subset = 20)
Decomposition of squared Mahalanobis distance using Shapley interaction indices.
Description
The shapley_interaction
function computes a p \times p
matrix
containing pairwise outlyingness scores based on Shapley interaction indices.
It decomposes the squared Mahalanobis distance of x
(with respect to mu
and Sigma
)
into outlyingness contributions of pairs of variables (Mayrhofer and Filzmoser 2023).
Usage
shapley_interaction(x, mu, Sigma, inverted = FALSE)
Arguments
x |
Data vector with |
mu |
Either |
Sigma |
Either |
inverted |
Logical. If |
Value
A p \times p
matrix containing the decomposition of the squared Mahalanobis distance of x
into outlyingness scores for pairs of variables with respect to mu
and Sigma
.
References
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
Examples
p <- 5
mu <- rep(0,p)
Sigma <- matrix(0.9, p, p); diag(Sigma) = 1
Sigma_inv <- solve(Sigma)
x <- c(0,1,2,2.3,2.5)
shapley_interaction(x, mu, Sigma)
PHI <- shapley_interaction(x, mu, Sigma_inv, inverted = TRUE)
plot(PHI)