Title: | Basic Robust Statistics |
---|---|
Description: | "Essential" Robust Statistics. Tools allowing to analyze data with robust methods. This includes regression methodology including model selections and multivariate statistics where we strive to cover the book "Robust Statistics, Theory and Methods" by 'Maronna, Martin and Yohai'; Wiley 2006. |
Authors: | Martin Maechler [aut, cre] , Peter Rousseeuw [ctb] (Qn and Sn), Christophe Croux [ctb] (Qn and Sn), Valentin Todorov [aut] (most robust Cov), Andreas Ruckstuhl [aut] (nlrob, anova, glmrob), Matias Salibian-Barrera [aut] (lmrob orig.), Tobias Verbeke [ctb, fnd] (mc, adjbox), Manuel Koller [aut] (mc, lmrob, psi-func.), Eduardo L. T. Conceicao [aut] (MM-, tau-, CM-, and MTL- nlrob), Maria Anna di Palma [ctb] (initial version of Comedian) |
Maintainer: | Martin Maechler <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.99-5 |
Built: | 2024-11-01 13:27:41 UTC |
Source: | https://github.com/r-forge/robustbase |
Produces boxplots adjusted for skewed distributions as proposed in Hubert and Vandervieren (2008).
adjbox(x, ...) ## S3 method for class 'formula' adjbox(formula, data = NULL, ..., subset, na.action = NULL) ## Default S3 method: adjbox(x, ..., range = 1.5, doReflect = FALSE, width = NULL, varwidth = FALSE, notch = FALSE, outline = TRUE, names, plot = TRUE, border = par("fg"), col = NULL, log = "", pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5), horizontal = FALSE, add = FALSE, at = NULL)
adjbox(x, ...) ## S3 method for class 'formula' adjbox(formula, data = NULL, ..., subset, na.action = NULL) ## Default S3 method: adjbox(x, ..., range = 1.5, doReflect = FALSE, width = NULL, varwidth = FALSE, notch = FALSE, outline = TRUE, names, plot = TRUE, border = par("fg"), col = NULL, log = "", pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5), horizontal = FALSE, add = FALSE, at = NULL)
formula |
a formula, such as |
data |
a data.frame (or list) from which the variables in
|
subset |
an optional vector specifying a subset of observations to be used for plotting. |
na.action |
a function which indicates what should happen
when the data contain |
x |
for specifying data from which the boxplots are to be
produced. Either a numeric vector, or a single list containing such
vectors. Additional unnamed arguments specify further data
as separate vectors (each corresponding to a component boxplot).
|
... |
For the For the default method, unnamed arguments are additional data
vectors (unless |
range |
this determines how far the plot whiskers extend out
from the box, and is simply passed as argument |
doReflect |
logical indicating if the MC should also be
computed on the reflected sample |
width |
a vector giving the relative widths of the boxes making up the plot. |
varwidth |
if |
notch |
if |
outline |
if |
names |
group labels which will be printed under each boxplot. |
boxwex |
a scale factor to be applied to all boxes. When there are only a few groups, the appearance of the plot can be improved by making the boxes narrower. |
staplewex |
staple line width expansion, proportional to box width. |
outwex |
outlier line width expansion, proportional to box width. |
plot |
if |
border |
an optional vector of colors for the outlines of the
boxplots. The values in |
col |
if |
log |
character indicating if x or y or both coordinates should be plotted in log scale. |
pars |
a list of (potentially many) more graphical parameters,
e.g., |
horizontal |
logical indicating if the boxplots should be
horizontal; default |
add |
logical, if true add boxplot to current plot. |
at |
numeric vector giving the locations where the boxplots should
be drawn, particularly when |
The generic function adjbox
currently has a default method
(adjbox.default
) and a formula interface (adjbox.formula
).
If multiple groups are supplied either as multiple arguments or via a
formula, parallel boxplots will be plotted, in the order of the
arguments or the order of the levels of the factor (see
factor
).
Missing values are ignored when forming boxplots.
Extremes of the upper and whiskers of the adjusted boxplots are
computed using the medcouple (mc()
), a robust measure of
skewness. For details, cf. TODO
A list
with the following components:
stats |
a matrix, each column contains the extreme of the lower whisker, the lower hinge, the median, the upper hinge and the extreme of the upper whisker for one group/plot. If all the inputs have the same class attribute, so will this component. |
n |
a vector with the number of observations in each group. |
coef |
a matrix where each column contains the lower and upper extremes of the notch. |
out |
the values of any data points which lie beyond the extremes of the whiskers. |
group |
a vector of the same length as out whose elements indicate to which group the outlier belongs. |
names |
a vector of names for the groups. |
The code and documentation only slightly modifies the code of
boxplot.default
, boxplot.formula
and
boxplot.stats
R Core Development Team, slightly adapted by Tobias Verbeke
Hubert, M. and Vandervieren, E. (2008). An adjusted boxplot for skewed distributions, Computational Statistics and Data Analysis 52, 5186–5201. doi:10.1016/j.csda.2007.11.008
if(require("boot")) { ### Hubert and Vandervieren (2008), Fig. 5.%(2006): p. 10, Fig. 4. data(coal, package = "boot") coaldiff <- diff(coal$date) op <- par(mfrow = c(1,2)) boxplot(coaldiff, main = "Original Boxplot") adjbox(coaldiff, main = "Adjusted Boxplot") par(op) } ### Hubert and Vandervieren (2008), p. 11, Fig. 7a -- enhanced op <- par(mfrow = c(2,2), mar = c(1,3,3,1), oma = c(0,0,3,0)) with(condroz, { boxplot(Ca, main = "Original Boxplot") adjbox (Ca, main = "Adjusted Boxplot") boxplot(Ca, main = "Original Boxplot [log]", log = "y") adjbox (Ca, main = "Adjusted Boxplot [log]", log = "y") }) mtext("'Ca' from data(condroz)", outer=TRUE, font = par("font.main"), cex = 2) par(op)
if(require("boot")) { ### Hubert and Vandervieren (2008), Fig. 5.%(2006): p. 10, Fig. 4. data(coal, package = "boot") coaldiff <- diff(coal$date) op <- par(mfrow = c(1,2)) boxplot(coaldiff, main = "Original Boxplot") adjbox(coaldiff, main = "Adjusted Boxplot") par(op) } ### Hubert and Vandervieren (2008), p. 11, Fig. 7a -- enhanced op <- par(mfrow = c(2,2), mar = c(1,3,3,1), oma = c(0,0,3,0)) with(condroz, { boxplot(Ca, main = "Original Boxplot") adjbox (Ca, main = "Adjusted Boxplot") boxplot(Ca, main = "Original Boxplot [log]", log = "y") adjbox (Ca, main = "Adjusted Boxplot [log]", log = "y") }) mtext("'Ca' from data(condroz)", outer=TRUE, font = par("font.main"), cex = 2) par(op)
Computes the “statistics” for producing boxplots adjusted for
skewed distributions as proposed in Hubert and Vandervieren (2008),
see adjbox
.
adjboxStats(x, coef = 1.5, a = -4, b = 3, do.conf = TRUE, do.out = TRUE, ...)
adjboxStats(x, coef = 1.5, a = -4, b = 3, do.conf = TRUE, do.out = TRUE, ...)
x |
a numeric vector for which adjusted boxplot statistics are computed. |
coef |
number determining how far ‘whiskers’ extend out
from the box, see |
a , b
|
scaling factors multiplied by the medcouple
|
do.conf , do.out
|
logicals; if |
... |
further optional arguments to be passed to
|
Given the quartiles ,
, the interquartile
range
, and the medcouple
mc(x)
, coef
,
the “fence” is defined,
for as
and for as
and all observations x
outside the fence, the “potential
outliers”, are returned in out
.
Note that a typo in robustbase version up to 0.7-8, for the (rare left-skewed) case where mc(x) < 0, lead to a “fence” not wide enough in the upper part, and hence less outliers there.
A list
with the components
stats |
a vector of length 5, containing the extreme of the lower whisker, the lower hinge, the median, the upper hinge and the extreme of the upper whisker. |
n |
the number of observations |
conf |
the lower and upper extremes of the ‘notch’
( |
fence |
length 2 vector of interval boundaries which define the non-outliers, and hence the whiskers of the plot. |
out |
the values of any data points which lie beyond the fence, and hence beyond the extremes of the whiskers. |
The code only slightly modifies the code of R's
boxplot.stats
.
R Core Development Team (boxplot.stats
); adapted
by Tobias Verbeke and Martin Maechler.
adjbox()
, also for references,
the function which mainly uses this one;
further boxplot.stats
.
data(condroz) adjboxStats(ccA <- condroz[,"Ca"]) adjboxStats(ccA, doReflect = TRUE)# small difference in fence ## Test reflection invariance [was not ok, up to and including robustbase_0.7-8] a1 <- adjboxStats( ccA, doReflect = TRUE) a2 <- adjboxStats(-ccA, doReflect = TRUE) nm1 <- c("stats", "conf", "fence") stopifnot(all.equal( a1[nm1], lapply(a2[nm1], function(u) rev(-u))), all.equal(a1[["out"]], -a2[["out"]]))
data(condroz) adjboxStats(ccA <- condroz[,"Ca"]) adjboxStats(ccA, doReflect = TRUE)# small difference in fence ## Test reflection invariance [was not ok, up to and including robustbase_0.7-8] a1 <- adjboxStats( ccA, doReflect = TRUE) a2 <- adjboxStats(-ccA, doReflect = TRUE) nm1 <- c("stats", "conf", "fence") stopifnot(all.equal( a1[nm1], lapply(a2[nm1], function(u) rev(-u))), all.equal(a1[["out"]], -a2[["out"]]))
For an data matrix (or data frame)
x
,
compute the “outlyingness” of all observations.
Outlyingness here is a generalization of the Donoho-Stahel
outlyingness measure, where skewness is taken into account via the
medcouple,
mc()
.
adjOutlyingness(x, ndir = 250, p.samp = p, clower = 4, cupper = 3, IQRtype = 7, alpha.cutoff = 0.75, coef = 1.5, qr.tol = 1e-12, keep.tol = 1e-12, only.outlyingness = FALSE, maxit.mult = max(100, p), trace.lev = 0, mcReflect = n <= 100, mcScale = TRUE, mcMaxit = 2*maxit.mult, mcEps1 = 1e-12, mcEps2 = 1e-15, mcTrace = max(0, trace.lev-1))
adjOutlyingness(x, ndir = 250, p.samp = p, clower = 4, cupper = 3, IQRtype = 7, alpha.cutoff = 0.75, coef = 1.5, qr.tol = 1e-12, keep.tol = 1e-12, only.outlyingness = FALSE, maxit.mult = max(100, p), trace.lev = 0, mcReflect = n <= 100, mcScale = TRUE, mcMaxit = 2*maxit.mult, mcEps1 = 1e-12, mcEps2 = 1e-15, mcTrace = max(0, trace.lev-1))
x |
a numeric |
ndir |
positive integer specifying the number of directions that should be searched. |
p.samp |
the sample size to use for finding good random
directions, must be at least |
clower , cupper
|
the constant to be used for the lower and upper
tails, in order to transform the data towards symmetry. You can set
|
IQRtype |
a number from |
alpha.cutoff |
number in (0,1) specifying the quantiles
|
coef |
positive number specifying the factor with which the
interquartile range ( |
qr.tol |
positive tolerance to be used for |
keep.tol |
positive tolerance to determine which of the sample
direction should be kept, namely only those for which
|
only.outlyingness |
logical indicating if the final outlier determination should be skipped. In that case, a vector is returned, see ‘Value:’ below. |
maxit.mult |
integer factor; |
trace.lev |
an integer, if positive allows to monitor the direction search. |
mcReflect |
passed as |
mcScale |
passed as |
mcMaxit |
passed as |
mcEps1 |
passed as |
mcEps2 |
passed as |
mcTrace |
passed as |
FIXME: Details in the comment of the Matlab code; also in the reference(s).
The method as described can be useful as preprocessing in FASTICA (http://research.ics.aalto.fi/ica/fastica/ see also the R package fastICA.
If only.outlyingness
is true, a vector adjout
,
otherwise, as by default, a list with components
adjout |
numeric of |
cutoff |
cutoff for “outlier” with respect to the adjusted
outlyingnesses, and depending on |
nonOut |
logical of |
If there are too many degrees of freedom for the projections, i.e., when
, the current definition of adjusted outlyingness
is ill-posed, as one of the projections may lead to a denominator
(quartile difference) of zero, and hence formally an adjusted
outlyingness of infinity.
The current implementation avoids
Inf
results, but will return
seemingly random adjout
values of around which may
be completely misleading, see, e.g., the
longley
data example.
The result is random as it depends on the sample of
ndir
directions chosen; specifically, to get sub samples the algorithm uses
sample.int(n, p.samp)
which from R version 3.6.0 depends on
RNGkind(*, sample.kind)
. Exact reproducibility of results
from R versions 3.5.3 and earlier, requires setting
RNGversion("3.5.0")
.
In any case, do use set.seed()
yourself
for reproducibility!
Till Aug/Oct. 2014, the default values for clower
and cupper
were
accidentally reversed, and the signs inside exp(.)
where swapped
in the (now corrected) two expressions
tup <- Q3 + coef * IQR * exp(.... + clower * tmc * (tmc < 0)) tlo <- Q1 - coef * IQR * exp(.... - cupper * tmc * (tmc < 0))
already in the code from Antwerpen (‘mcrsoft/adjoutlingness.R’), contrary to the published reference.
Further, the original algorithm had not been scale-equivariant in the direction construction, which has been amended in 2014-10 as well.
The results, including diagnosed outliers, therefore have changed, typically slightly, since robustbase version 0.92-0.
Guy Brys; help page and improvements by Martin Maechler
Brys, G., Hubert, M., and Rousseeuw, P.J. (2005) A Robustification of Independent Component Analysis; Journal of Chemometrics, 19, 1–12.
Hubert, M., Van der Veeken, S. (2008) Outlier detection for skewed data; Journal of Chemometrics 22, 235–246; doi:10.1002/cem.1123.
For the up-to-date reference, please consult https://wis.kuleuven.be/statdatascience/robust
the adjusted boxplot, adjbox
and the medcouple,
mc
.
## An Example with bad condition number and "border case" outliers dim(longley) # 16 x 7 // set seed, as result is random : set.seed(31) ao1 <- adjOutlyingness(longley, mcScale=FALSE) ## which are outlying ? which(!ao1$nonOut) ## for this seed, two: "1956", "1957"; (often: none) ## For seeds 1:100, we observe (Linux 64b) if(FALSE) { adjO <- sapply(1:100, function(iSeed) { set.seed(iSeed); adjOutlyingness(longley)$nonOut }) table(nrow(longley) - colSums(adjO)) } ## #{outl.}: 0 1 2 3 ## #{cases}: 74 17 6 3 ## An Example with outliers : dim(hbk) set.seed(1) ao.hbk <- adjOutlyingness(hbk) str(ao.hbk) hist(ao.hbk $adjout)## really two groups table(ao.hbk$nonOut)## 14 outliers, 61 non-outliers: ## outliers are : which(! ao.hbk$nonOut) # 1 .. 14 --- but not for all random seeds! ## here, they are(*) the same as found by (much faster) MCD: ## *) only "almost", since the 2023-05 change to covMcd() cc <- covMcd(hbk) table(cc = cc$mcd.wt, ao = ao.hbk$nonOut)# one differ..: stopifnot(sum(cc$mcd.wt != ao.hbk$nonOut) <= 1) ## This is revealing: About 1--2 cases, where outliers are *not* == 1:14 ## (2023: ~ 1/8 [sec] per call) if(interactive()) { for(i in 1:30) { print(system.time(ao.hbk <- adjOutlyingness(hbk))) if(!identical(iout <- which(!ao.hbk$nonOut), 1:14)) { cat("Outliers:\n"); print(iout) } } } ## "Central" outlyingness: *not* calling mc() anymore, since 2014-12-11: trace(mc) out <- capture.output( oo <- adjOutlyingness(hbk, clower=0, cupper=0) ) untrace(mc) stopifnot(length(out) == 0) ## A rank-deficient case T <- tcrossprod(data.matrix(toxicity)) try(adjOutlyingness(T, maxit. = 20, trace.lev = 2)) # fails and recommends: T. <- fullRank(T) aT <- adjOutlyingness(T.) plot(sort(aT$adjout, decreasing=TRUE), log="y") plot(T.[,9:10], col = (1:2)[1 + (aT$adjout > 10000)]) ## .. (not conclusive; directions are random, more 'ndir' makes a difference!)
## An Example with bad condition number and "border case" outliers dim(longley) # 16 x 7 // set seed, as result is random : set.seed(31) ao1 <- adjOutlyingness(longley, mcScale=FALSE) ## which are outlying ? which(!ao1$nonOut) ## for this seed, two: "1956", "1957"; (often: none) ## For seeds 1:100, we observe (Linux 64b) if(FALSE) { adjO <- sapply(1:100, function(iSeed) { set.seed(iSeed); adjOutlyingness(longley)$nonOut }) table(nrow(longley) - colSums(adjO)) } ## #{outl.}: 0 1 2 3 ## #{cases}: 74 17 6 3 ## An Example with outliers : dim(hbk) set.seed(1) ao.hbk <- adjOutlyingness(hbk) str(ao.hbk) hist(ao.hbk $adjout)## really two groups table(ao.hbk$nonOut)## 14 outliers, 61 non-outliers: ## outliers are : which(! ao.hbk$nonOut) # 1 .. 14 --- but not for all random seeds! ## here, they are(*) the same as found by (much faster) MCD: ## *) only "almost", since the 2023-05 change to covMcd() cc <- covMcd(hbk) table(cc = cc$mcd.wt, ao = ao.hbk$nonOut)# one differ..: stopifnot(sum(cc$mcd.wt != ao.hbk$nonOut) <= 1) ## This is revealing: About 1--2 cases, where outliers are *not* == 1:14 ## (2023: ~ 1/8 [sec] per call) if(interactive()) { for(i in 1:30) { print(system.time(ao.hbk <- adjOutlyingness(hbk))) if(!identical(iout <- which(!ao.hbk$nonOut), 1:14)) { cat("Outliers:\n"); print(iout) } } } ## "Central" outlyingness: *not* calling mc() anymore, since 2014-12-11: trace(mc) out <- capture.output( oo <- adjOutlyingness(hbk, clower=0, cupper=0) ) untrace(mc) stopifnot(length(out) == 0) ## A rank-deficient case T <- tcrossprod(data.matrix(toxicity)) try(adjOutlyingness(T, maxit. = 20, trace.lev = 2)) # fails and recommends: T. <- fullRank(T) aT <- adjOutlyingness(T.) plot(sort(aT$adjout, decreasing=TRUE), log="y") plot(T.[,9:10], col = (1:2)[1 + (aT$adjout > 10000)]) ## .. (not conclusive; directions are random, more 'ndir' makes a difference!)
Aircraft Data, deals with 23 single-engine aircraft built over the years 1947-1979, from Office of Naval Research. The dependent variable is cost (in units of $100,000) and the explanatory variables are aspect ratio, lift-to-drag ratio, weight of plane (in pounds) and maximal thrust.
data(aircraft, package="robustbase")
data(aircraft, package="robustbase")
A data frame with 23 observations on the following 5 variables.
X1
Aspect Ratio
X2
Lift-to-Drag Ratio
X3
Weight
X4
Thrust
Y
Cost
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, page 154, table 22.
data(aircraft) summary( lm.airc <- lm(Y ~ ., data = aircraft)) summary(rlm.airc <- MASS::rlm(Y ~ ., data = aircraft)) aircraft.x <- data.matrix(aircraft[,1:4]) c_air <- covMcd(aircraft.x) c_air
data(aircraft) summary( lm.airc <- lm(Y ~ ., data = aircraft)) summary(rlm.airc <- MASS::rlm(Y ~ ., data = aircraft)) aircraft.x <- data.matrix(aircraft[,1:4]) c_air <- covMcd(aircraft.x) c_air
Air Quality Data Set for May 1973, from Chambers et al. (1983). The whole data set consists of daily readings of air quality values from May 1, 1973 to September 30, 1973, but here are included only the values for May. This data set is an example of the special treatment of the missing values.
data(airmay, package="robustbase")
data(airmay, package="robustbase")
A data frame with 31 observations on the following 4 variables.
X1
Solar Radiation in Longleys in the frequency band 4000-7700 from 0800 to 1200 hours at Central Park
X2
Average windspeed (in miles per hour) between 7000 and 1000 hours at La Guardia Airport
X3
Maximum daily temperature (in degrees Fahrenheit) at La Guardia Airport
Y
Mean ozone concentration (in parts per billion) from 1300 to 1500 hours at Roosevelt Island
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, p.86, table 6.
data(airmay) summary(lm.airmay <- lm(Y ~ ., data=airmay)) airmay.x <- data.matrix(airmay[,1:3])
data(airmay) summary(lm.airmay <- lm(Y ~ ., data=airmay)) airmay.x <- data.matrix(airmay[,1:3])
The solubility of alcohols in water is important in understanding alcohol transport in living organisms. This dataset from (Romanelli et al., 2001) contains physicochemical characteristics of 44 aliphatic alcohols. The aim of the experiment was the prediction of the solubility on the basis of molecular descriptors.
data(alcohol, package="robustbase")
data(alcohol, package="robustbase")
A data frame with 44 observations on the following 7 numeric variables.
SAG
solvent accessible surface-bounded molecular volume.
V
volume
logPC
Log(PC); PC = octanol-water partitions coefficient
P
polarizability
RM
molar refractivity
Mass
the mass
logSolubility
ln(Solubility), the response.
The website accompanying the MMY-book: https://www.wiley.com/legacy/wileychi/robust_statistics/
Maronna, R.A., Martin, R.D. and Yohai, V.J. (2006) Robust Statistics, Theory and Methods, Wiley.
data(alcohol) ## version of data set with trivial names, as s.alcohol <- alcohol names(s.alcohol) <- paste("Col", 1:7, sep='')
data(alcohol) ## version of data set with trivial names, as s.alcohol <- alcohol names(s.alcohol) <- paste("Col", 1:7, sep='')
This dataset contains daily means (from midnight to midnight) of NOx, i.e., mono-nitrogen oxides, in [ppb] at 13 sites in central Switzerland and Aarau for the year 2004.
data(ambientNOxCH, package="robustbase")
data(ambientNOxCH, package="robustbase")
A data frame with 366 observations on the following 14 variables.
date
date of day, of class "Date"
.
ad
Site is located north of Altdorf 100 meters east of motorway A2, on an open field at the beginning of a more than 2000m deep valley (690.175, 193.55; 438; inLuft)
ba
Site is located in the centre of the little town of Baden in a residential area. Baden has 34'000 inhabitants and is situated on the swiss plateau (666.075, 257.972; 377; inLuft).
ef
Site is located 6 km south of altdorf and 800 m north of the village of Erstfeld. The motorway A2 passes 5 m west of the measuring site. Over 8 million vehicles have passed Erstfeld in 2004 where 13% of the counts were attributed to trucks (691.43, 187.69; 457; MFM-U).
la
Site is located on a wooded hill in a rural area called Laegern, about 190 m above Baden, which is about 5 km away (669.8, 259; 690; NABEL).
lu
Site is located in the center of town of Lucerne, which has 57'000 inhabitants (666.19, 211.975; 460; inLuft).
re
Site is located 1 km west of Reiden on the Swiss plateau. The motorway A2 passes 5 m west of the measuring site (639.56, 232.11; 462; MFM-U).
ri
Site is located at Rigi Seebodenalp, 649 m above the lake of Lucerne on an alp with half a dozen small houses (677.9, 213.5; 1030; NABEL).
se
Site is located in Sedel next to town of Lucerne 35m above and 250m south of motorway A14 from Zug to Lucerne on a low hill with free 360° panorama (665.5, 213.41; 484; inLuft).
si
Site is located at the border of a small industrial area in Sisseln, 300 m east of a main road (640.725, 266.25; 305; inLuft).
st
Site is located at the south east border of Stans with 7'000 inhabitants (670.85, 201.025; 438; inLuft).
su
Site is located in the center of Suhr (8700 inhabitants), 10 m from the main road (648.49, 246.985; 403; inLuft).
sz
Site is located in Schwyz (14'200 inhabitants) near a shopping center (691.92, 208.03; 470; inLuft).
zg
Site is located in the centre of Zug with 22'000 inhabitants, 24 m from the main road (681.625, 224.625; 420; inLuft).
The 13 sites are part of one of the three air quality monitoring networks:
inLuft (regional authorities of central Switzerland and canton Aargau)
NABEL (Swiss federal network)
MFM-U (Monitoring flankierende Massnahmen Umwelt), special Swiss
federal network along transit motorways A2 and A13 from Germany to
Italy through Switzerland
The information within the brackets means: Swiss
coordinates km east, km north; m above sea level; network
When the measuring sites are exposed to the same atmospheric condition
and when there is no singular emission event at any site,
log(mean(NOx) of a specific day at each site)
is a linear
function of log(yearly.mean(NOx) at the corresponding site)
. The
offset and the slope of the straight
line reflects the atmospheric conditions at this specific day. During
winter time, often an inversion prevents the emissions from being
diluted vertically, so that there evolve two separate atmospheric
compartements: One below the inversion boundary with polluted air and one
above with relatively clean air. In our example below, Rigi Seebodenalp
is above the inversion boundary between December 10th and 12th.
http://www.in-luft.ch/
http://www.empa.ch/plugin/template/empa/*/6794
http://www.bafu.admin.ch/umweltbeobachtung/02272/02280
another NOx dataset, NOxEmissions
.
data(ambientNOxCH) str (ambientNOxCH) yearly <- log(colMeans(ambientNOxCH[,-1], na.rm=TRUE)) xlim <- range(yearly) lNOx <- log(ambientNOxCH[, -1]) days <- ambientNOxCH[, "date"] ## Subset of 9 days starting at April 4: idays <- seq(which(ambientNOxCH$date=="2004-12-04"), length=9) ylim <- range(lNOx[idays,],na.rm=TRUE) op <- par(mfrow=c(3,3),mar=rep(1,4), oma = c(0,0,2,0)) for (id in idays) { daily <- unlist(lNOx[id,]) plot(NA, xlim=xlim,ylim=ylim, ann=FALSE, type = "n") abline(0:1, col="light gray") abline(lmrob(daily~yearly, na.action=na.exclude), col="red", lwd=2) text(yearly, daily, names(yearly), col="blue") mtext(days[id], side=1, line=-1.2, cex=.75, adj=.98) } mtext("Daily ~ Yearly log( NOx mean values ) at 13 Swiss locations", outer=TRUE) par(op) ## do all 366 regressions: Least Squares and Robust: LS <- lapply(1:nrow(ambientNOxCH), function(id) lm(unlist(lNOx[id,]) ~ yearly, na.action = na.exclude)) R <- lapply(1:nrow(ambientNOxCH), function(id) lmrob(unlist(lNOx[id,]) ~ yearly, na.action = na.exclude)) ## currently 4 warnings about non-convergence; ## which ones? days[notOk <- ! sapply(R, `[[`, "converged") ] ## "2004-01-10" "2004-05-12" "2004-05-16" "2004-11-16" ## first problematic case: daily <- unlist(lNOx[which(notOk)[1],]) plot(daily ~ yearly, main = paste("lmrob() non-convergent:",days[notOk[1]])) rr <- lmrob(daily ~ yearly, na.action = na.exclude, control = lmrob.control(trace=3, max.it = 100)) ##-> 53 iter. ## Look at all coefficients: R.cf <- t(sapply(R, coef)) C.cf <- t(sapply(LS, coef)) plot(C.cf, xlim=range(C.cf[,1],R.cf[,1]), ylim=range(C.cf[,2],R.cf[,2])) mD1 <- rowMeans(abs(C.cf - R.cf)) lrg <- mD1 > quantile(mD1, 0.80) arrows(C.cf[lrg,1], C.cf[lrg,2], R.cf[lrg,1], R.cf[lrg,2], length=.1, col="light gray") points(R.cf, col=2) ## All robustness weights aW <- t(sapply(R, weights, type="robustness")) colnames(aW) <- names(yearly) summary(aW) sort(colSums(aW < 0.05, na.rm = TRUE)) # how often "clear outlier": # lu st zg ba se sz su si re la ef ad ri # 0 0 0 1 1 1 2 3 4 10 14 17 48 lattice::levelplot(aW, asp=1/2, main="Robustness weights", xlab= "day", ylab= "site")
data(ambientNOxCH) str (ambientNOxCH) yearly <- log(colMeans(ambientNOxCH[,-1], na.rm=TRUE)) xlim <- range(yearly) lNOx <- log(ambientNOxCH[, -1]) days <- ambientNOxCH[, "date"] ## Subset of 9 days starting at April 4: idays <- seq(which(ambientNOxCH$date=="2004-12-04"), length=9) ylim <- range(lNOx[idays,],na.rm=TRUE) op <- par(mfrow=c(3,3),mar=rep(1,4), oma = c(0,0,2,0)) for (id in idays) { daily <- unlist(lNOx[id,]) plot(NA, xlim=xlim,ylim=ylim, ann=FALSE, type = "n") abline(0:1, col="light gray") abline(lmrob(daily~yearly, na.action=na.exclude), col="red", lwd=2) text(yearly, daily, names(yearly), col="blue") mtext(days[id], side=1, line=-1.2, cex=.75, adj=.98) } mtext("Daily ~ Yearly log( NOx mean values ) at 13 Swiss locations", outer=TRUE) par(op) ## do all 366 regressions: Least Squares and Robust: LS <- lapply(1:nrow(ambientNOxCH), function(id) lm(unlist(lNOx[id,]) ~ yearly, na.action = na.exclude)) R <- lapply(1:nrow(ambientNOxCH), function(id) lmrob(unlist(lNOx[id,]) ~ yearly, na.action = na.exclude)) ## currently 4 warnings about non-convergence; ## which ones? days[notOk <- ! sapply(R, `[[`, "converged") ] ## "2004-01-10" "2004-05-12" "2004-05-16" "2004-11-16" ## first problematic case: daily <- unlist(lNOx[which(notOk)[1],]) plot(daily ~ yearly, main = paste("lmrob() non-convergent:",days[notOk[1]])) rr <- lmrob(daily ~ yearly, na.action = na.exclude, control = lmrob.control(trace=3, max.it = 100)) ##-> 53 iter. ## Look at all coefficients: R.cf <- t(sapply(R, coef)) C.cf <- t(sapply(LS, coef)) plot(C.cf, xlim=range(C.cf[,1],R.cf[,1]), ylim=range(C.cf[,2],R.cf[,2])) mD1 <- rowMeans(abs(C.cf - R.cf)) lrg <- mD1 > quantile(mD1, 0.80) arrows(C.cf[lrg,1], C.cf[lrg,2], R.cf[lrg,1], R.cf[lrg,2], length=.1, col="light gray") points(R.cf, col=2) ## All robustness weights aW <- t(sapply(R, weights, type="robustness")) colnames(aW) <- names(yearly) summary(aW) sort(colSums(aW < 0.05, na.rm = TRUE)) # how often "clear outlier": # lu st zg ba se sz su si re la ef ad ri # 0 0 0 1 1 1 2 3 4 10 14 17 48 lattice::levelplot(aW, asp=1/2, main="Robustness weights", xlab= "day", ylab= "site")
A data frame with average brain and body weights for 62 species of land mammals and three others.
Note that this is simply the union of Animals
and mammals
.
Animals2
Animals2
body
body weight in kg
brain
brain weight in g
After loading the MASS package, the data set is simply constructed by
Animals2 <- local({D <- rbind(Animals, mammals);
unique(D[order(D$body,D$brain),])})
.
Rousseeuw and Leroy (1987)'s ‘brain’ data is the same as
MASS's Animals
(with Rat and Brachiosaurus interchanged,
see the example below).
Weisberg, S. (1985) Applied Linear Regression. 2nd edition. Wiley, pp. 144–5.
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection. Wiley, p. 57.
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Forth Edition. Springer.
data(Animals2) ## Sensible Plot needs doubly logarithmic scale plot(Animals2, log = "xy") ## Regression example plot: plotbb <- function(bbdat) { d.name <- deparse(substitute(bbdat)) plot(log(brain) ~ log(body), data = bbdat, main = d.name) abline( lm(log(brain) ~ log(body), data = bbdat)) abline(MASS::rlm(log(brain) ~ log(body), data = bbdat), col = 2) legend("bottomright", leg = c("lm", "rlm"), col=1:2, lwd=1, inset = 1/20) } plotbb(bbdat = Animals2) ## The `same' plot for Rousseeuw's subset: data(Animals, package = "MASS") brain <- Animals[c(1:24, 26:25, 27:28),] plotbb(bbdat = brain) lbrain <- log(brain) plot(mahalanobis(lbrain, colMeans(lbrain), var(lbrain)), main = "Classical Mahalanobis Distances") mcd <- covMcd(lbrain) plot(mahalanobis(lbrain,mcd$center,mcd$cov), main = "Robust (MCD) Mahalanobis Distances")
data(Animals2) ## Sensible Plot needs doubly logarithmic scale plot(Animals2, log = "xy") ## Regression example plot: plotbb <- function(bbdat) { d.name <- deparse(substitute(bbdat)) plot(log(brain) ~ log(body), data = bbdat, main = d.name) abline( lm(log(brain) ~ log(body), data = bbdat)) abline(MASS::rlm(log(brain) ~ log(body), data = bbdat), col = 2) legend("bottomright", leg = c("lm", "rlm"), col=1:2, lwd=1, inset = 1/20) } plotbb(bbdat = Animals2) ## The `same' plot for Rousseeuw's subset: data(Animals, package = "MASS") brain <- Animals[c(1:24, 26:25, 27:28),] plotbb(bbdat = brain) lbrain <- log(brain) plot(mahalanobis(lbrain, colMeans(lbrain), var(lbrain)), main = "Classical Mahalanobis Distances") mcd <- covMcd(lbrain) plot(mahalanobis(lbrain,mcd$center,mcd$cov), main = "Robust (MCD) Mahalanobis Distances")
Compute an analysis of robust quasi-deviance table for one or more
generalized linear models fitted by glmrob
.
## S3 method for class 'glmrob' anova(object, ..., test = c("Wald", "QD", "QDapprox"))
## S3 method for class 'glmrob' anova(object, ..., test = c("Wald", "QD", "QDapprox"))
object , ...
|
objects of class |
test |
a character string specifying the test statistic to be
used. (Partially) matching one of |
Specifying a single object gives a sequential analysis of robust quasi-deviance table for that fit. That is, the reductions in the robust residual quasi-deviance as each term of the formula is added in turn are given in as the rows of a table. (Currently not yet implemented.)
If more than one object is specified, the table has a row for the residual quasi-degrees of freedom (However, this information is never used in the asymptotic tests). For all but the first model, the change in degrees of freedom and robust quasi-deviance is also given. (This only makes statistical sense if the models are nested.) It is conventional to list the models from smallest to largest, but this is up to the user.
In addition, the table will contain test statistics and P values
comparing the reduction in robust quasi-deviance for the model on the
row to that on top of it. For all robust fitting methods, the
“Wald”-type test between two models can be applied (test
= "Wald"
).
When using Mallows or Huber type robust estimators
(method="Mqle"
in glmrob
), then there are
additional test methods. One is the robust quasi-deviance test
(test = "QD"
), as described by Cantoni and Ronchetti (2001).
The asymptotic distribution is approximated by a chi-square
distibution. Another test (test = "QDapprox"
) is based on a
quadratic approximation of the robust quasi-deviance test
statistic. Its asymptotic distribution is chi-square (see the reference).
The comparison between two or more models by anova.glmrob
will only be valid if they are fitted to the same dataset and by the same
robust fitting method using the same tuning constant (
tcc
in
glmrob
).
Basically, an object of class anova
inheriting from class
data.frame
.
Andreas Ruckstuhl
E. Cantoni and E. Ronchetti (2001) Robust Inference for Generalized Linear Models. JASA 96 (455), 1022–1030.
E.Cantoni (2004) Analysis of Robust Quasi-deviances for Generalized Linear Models. Journal of Statistical Software 10, https://www.jstatsoft.org/article/view/v010i04
## Binomial response ----------- data(carrots) Cfit2 <- glmrob(cbind(success, total-success) ~ logdose + block, family=binomial, data=carrots, method="Mqle", control=glmrobMqle.control(tcc=1.2)) summary(Cfit2) Cfit4 <- glmrob(cbind(success, total-success) ~ logdose * block, family=binomial, data=carrots, method="Mqle", control=glmrobMqle.control(tcc=1.2)) anova(Cfit2, Cfit4, test="Wald") anova(Cfit2, Cfit4, test="QD") anova(Cfit2, Cfit4, test="QDapprox") ## Poisson response ------------ data(epilepsy) Efit2 <- glmrob(Ysum ~ Age10 + Base4*Trt, family=poisson, data=epilepsy, method="Mqle", control=glmrobMqle.control(tcc=1.2,maxit=100)) summary(Efit2) Efit3 <- glmrob(Ysum ~ Age10 + Base4 + Trt, family=poisson, data=epilepsy, method="Mqle", control=glmrobMqle.control(tcc=1.2,maxit=100)) anova(Efit3, Efit2, test = "Wald") anova(Efit3, Efit2, test = "QD") ## trivial intercept-only-model: E0 <- update(Efit3, . ~ 1) anova(E0, Efit3, Efit2, test = "QDapprox")
## Binomial response ----------- data(carrots) Cfit2 <- glmrob(cbind(success, total-success) ~ logdose + block, family=binomial, data=carrots, method="Mqle", control=glmrobMqle.control(tcc=1.2)) summary(Cfit2) Cfit4 <- glmrob(cbind(success, total-success) ~ logdose * block, family=binomial, data=carrots, method="Mqle", control=glmrobMqle.control(tcc=1.2)) anova(Cfit2, Cfit4, test="Wald") anova(Cfit2, Cfit4, test="QD") anova(Cfit2, Cfit4, test="QDapprox") ## Poisson response ------------ data(epilepsy) Efit2 <- glmrob(Ysum ~ Age10 + Base4*Trt, family=poisson, data=epilepsy, method="Mqle", control=glmrobMqle.control(tcc=1.2,maxit=100)) summary(Efit2) Efit3 <- glmrob(Ysum ~ Age10 + Base4 + Trt, family=poisson, data=epilepsy, method="Mqle", control=glmrobMqle.control(tcc=1.2,maxit=100)) anova(Efit3, Efit2, test = "Wald") anova(Efit3, Efit2, test = "QD") ## trivial intercept-only-model: E0 <- update(Efit3, . ~ 1) anova(E0, Efit3, Efit2, test = "QDapprox")
Compute an analysis of robust Wald-type or deviance-type test tables
for one or more linear regression models fitted by lmrob
.
## S3 method for class 'lmrob' anova(object, ..., test = c("Wald", "Deviance"), verbose = getOption("verbose"))
## S3 method for class 'lmrob' anova(object, ..., test = c("Wald", "Deviance"), verbose = getOption("verbose"))
object , ...
|
objects of class |
test |
a character string specifying the test statistic to be used.
Can be one of |
verbose |
logical; if true some informative messages are printed. |
Specifying a single object gives a sequential analysis of a robust quasi-deviance table for that fit. That is, the reductions in the robust residual deviance as each term of the formula is added in turn are given in as the rows of a table. (Currently not yet implemented.)
If more than one object is specified, the table has a row for the residual quasi-degrees of freedom (however, this information is never used in the asymptotic tests). For all but the first model, the change in degrees of freedom and robust deviance is also given. (This only makes statistical sense if the models are nested.) As opposed to the convention, the models are forced to be listed from largest to smallest due to computational reasons.
In addition, the table will contain test statistics and P values
comparing the reduction in robust deviances for the model on the row
to that on top of it. There are two different robust tests available:
The "Wald"-type test (test = "Wald"
) and the Deviance-type test
(test = "Deviance"
). When using formula description of the
nested models in the dot arguments and test = "Deviance"
, you
may be urged to supply a lmrob
fit for these models by
an error message. This happens when the coefficients of the largest
model reduced to the nested models result in invalid initial estimates
for the nested models (indicated by robustness weights which are all 0).
The comparison between two or more models by anova.lmrob
will only be valid if they are fitted to the same dataset.
Basically, an object of class anova
inheriting from class
data.frame
.
Andreas Ruckstuhl
data(salinity) summary(m0.sali <- lmrob(Y ~ . , data = salinity)) anova(m0.sali, Y ~ X1 + X3) ## -> X2 is not needed (m1.sali <- lmrob(Y ~ X1 + X3, data = salinity)) anova(m0.sali, m1.sali) # the same as before anova(m0.sali, m1.sali, test = "Deviance") ## whereas 'X3' is highly significant: m2 <- update(m0.sali, ~ . -X3) anova(m0.sali, m2) anova(m0.sali, m2, test = "Deviance") ## Global test [often not interesting]: anova(m0.sali, update(m0.sali, . ~ 1), test = "Wald") anova(m0.sali, update(m0.sali, . ~ 1), test = "Deviance") if(require("MPV")) { ## Montgomery, Peck & Vining datasets Jet <- table.b13 Jet.rflm1 <- lmrob(y ~ ., data=Jet, control = lmrob.control(max.it = 500)) summary(Jet.rflm1) anova(Jet.rflm1, y ~ x1 + x5 + x6, test="Wald") try( anova(Jet.rflm1, y ~ x1 + x5 + x6, test="Deviance") ) ## -> Error in anovaLm.... Please fit the nested models by lmrob ## {{ since all robustness weights become 0 in the nested model ! }} ## Ok: Do as the error message told us: ## test by comparing the two *fitted* models: Jet.rflm2 <- lmrob(y ~ x1 + x5 + x6, data=Jet, control=lmrob.control(max.it=100)) anova(Jet.rflm1, Jet.rflm2, test="Deviance") } # end{"MPV" data}
data(salinity) summary(m0.sali <- lmrob(Y ~ . , data = salinity)) anova(m0.sali, Y ~ X1 + X3) ## -> X2 is not needed (m1.sali <- lmrob(Y ~ X1 + X3, data = salinity)) anova(m0.sali, m1.sali) # the same as before anova(m0.sali, m1.sali, test = "Deviance") ## whereas 'X3' is highly significant: m2 <- update(m0.sali, ~ . -X3) anova(m0.sali, m2) anova(m0.sali, m2, test = "Deviance") ## Global test [often not interesting]: anova(m0.sali, update(m0.sali, . ~ 1), test = "Wald") anova(m0.sali, update(m0.sali, . ~ 1), test = "Deviance") if(require("MPV")) { ## Montgomery, Peck & Vining datasets Jet <- table.b13 Jet.rflm1 <- lmrob(y ~ ., data=Jet, control = lmrob.control(max.it = 500)) summary(Jet.rflm1) anova(Jet.rflm1, y ~ x1 + x5 + x6, test="Wald") try( anova(Jet.rflm1, y ~ x1 + x5 + x6, test="Deviance") ) ## -> Error in anovaLm.... Please fit the nested models by lmrob ## {{ since all robustness weights become 0 in the nested model ! }} ## Ok: Do as the error message told us: ## test by comparing the two *fitted* models: Jet.rflm2 <- lmrob(y ~ x1 + x5 + x6, data=Jet, control=lmrob.control(max.it=100)) anova(Jet.rflm1, Jet.rflm2, test="Deviance") } # end{"MPV" data}
An agricultural experiment in which different tillage methods were implemented. The effects of tillage on plant (maize) biomass were subsequently determined by modeling biomass accumulation for each tillage treatment using a 3 parameter Weibull function.
A datset where the total biomass is modeled conditional on a three value factor, and hence vector parameters are used.
data("biomassTill", package="robustbase")
data("biomassTill", package="robustbase")
A data frame with 58 observations on the following 3 variables.
Tillage
Tillage treatments, a factor
with levels
CA-
:a no-tillage system with plant residues removed
CA+
:a no-tillage system with plant residues retained
CT
:a conventionally tilled system with residues incorporated
DVS
the development stage of the maize crop. A DVS of
1
represents maize anthesis (flowering), and a DVS of 2
represents physiological maturity. For the data, numeric vector with
5 different values between 0.5 and 2.
Biomass
accumulated biomass of maize plants from each tillage treatment.
Biom.2
the same as Biomass
, but with three
values replaced by “gross errors”.
From Strahinja Stepanovic and John Laborde, Department of Agronomy & Horticulture, University of Nebraska-Lincoln, USA
data(biomassTill) str(biomassTill) require(lattice) ## With long tailed errors xyplot(Biomass ~ DVS | Tillage, data = biomassTill, type=c("p","smooth")) ## With additional 2 outliers: xyplot(Biom.2 ~ DVS | Tillage, data = biomassTill, type=c("p","smooth")) ### Fit nonlinear Regression models: ----------------------------------- ## simple starting values, needed: m00st <- list(Wm = rep(300, 3), a = rep( 1.5, 3), b = rep( 2.2, 3)) robm <- nlrob(Biomass ~ Wm[Tillage] * (-expm1(-(DVS/a[Tillage])^b[Tillage])), data = biomassTill, start = m00st, maxit = 200) ## ----------- summary(robm) ## ... 103 IRWLS iterations plot(sort(robm$rweights), log = "y", main = "ordered robustness weights (log scale)") mtext(getCall(robm)) ## the classical (only works for the mild outliers): cl.m <- nls(Biomass ~ Wm[Tillage] * (-expm1(-(DVS/a[Tillage])^b[Tillage])), data = biomassTill, start = m00st) ## now for the extra-outlier data: -- fails with singular gradient !! try( rob2 <- nlrob(Biom.2 ~ Wm[Tillage] * (-expm1(-(DVS/a[Tillage])^b[Tillage])), data = biomassTill, start = m00st) ) ## use better starting values: m1st <- setNames(as.list(as.data.frame(matrix( coef(robm), 3))), c("Wm", "a","b")) try(# just breaks a bit later! rob2 <- nlrob(Biom.2 ~ Wm[Tillage] * (-expm1(-(DVS/a[Tillage])^b[Tillage])), data = biomassTill, start = m1st, maxit= 200, trace=TRUE) ) ## Comparison {more to come} % once we have "MM" working... rbind(start = unlist(m00st), class = coef(cl.m), rob = coef(robm))
data(biomassTill) str(biomassTill) require(lattice) ## With long tailed errors xyplot(Biomass ~ DVS | Tillage, data = biomassTill, type=c("p","smooth")) ## With additional 2 outliers: xyplot(Biom.2 ~ DVS | Tillage, data = biomassTill, type=c("p","smooth")) ### Fit nonlinear Regression models: ----------------------------------- ## simple starting values, needed: m00st <- list(Wm = rep(300, 3), a = rep( 1.5, 3), b = rep( 2.2, 3)) robm <- nlrob(Biomass ~ Wm[Tillage] * (-expm1(-(DVS/a[Tillage])^b[Tillage])), data = biomassTill, start = m00st, maxit = 200) ## ----------- summary(robm) ## ... 103 IRWLS iterations plot(sort(robm$rweights), log = "y", main = "ordered robustness weights (log scale)") mtext(getCall(robm)) ## the classical (only works for the mild outliers): cl.m <- nls(Biomass ~ Wm[Tillage] * (-expm1(-(DVS/a[Tillage])^b[Tillage])), data = biomassTill, start = m00st) ## now for the extra-outlier data: -- fails with singular gradient !! try( rob2 <- nlrob(Biom.2 ~ Wm[Tillage] * (-expm1(-(DVS/a[Tillage])^b[Tillage])), data = biomassTill, start = m00st) ) ## use better starting values: m1st <- setNames(as.list(as.data.frame(matrix( coef(robm), 3))), c("Wm", "a","b")) try(# just breaks a bit later! rob2 <- nlrob(Biom.2 ~ Wm[Tillage] * (-expm1(-(DVS/a[Tillage])^b[Tillage])), data = biomassTill, start = m1st, maxit= 200, trace=TRUE) ) ## Comparison {more to come} % once we have "MM" working... rbind(start = unlist(m00st), class = coef(cl.m), rob = coef(robm))
This data set was used by Campbell (1984) to locate bushfire scars. The dataset contains satelite measurements on five frequency bands, corresponding to each of 38 pixels.
data(bushfire, package="robustbase")
data(bushfire, package="robustbase")
A data frame with 38 observations on 5 variables.
Maronna, R.A. and Yohai, V.J. (1995) The Behavoiur of the Stahel-Donoho Robust Multivariate Estimator. Journal of the American Statistical Association 90, 330–341.
data(bushfire) plot(bushfire) covMcd(bushfire)
data(bushfire) plot(bushfire) covMcd(bushfire)
Computation of the estimator of Bianco and Yohai (1996) in logistic regression. Now provides both the weighted and regular (unweighted) BY-estimator.
By default, an intercept term is included and p parameters are estimated. For more details, see the reference.
Note: This function is for “back-compatibility” with the
BYlogreg()
code web-published at KU Leuven, Belgium,
and also available as file ‘FunctionsRob/BYlogreg.ssc’ from
https://www.wiley.com/legacy/wileychi/robust_statistics/robust.html.
However instead of using this function, the recommended interface is
glmrob(*, method = "BY")
or ... method = "WBY" ..
,
see glmrob
.
BYlogreg(x0, y, initwml = TRUE, addIntercept = TRUE, const = 0.5, kmax = 1000, maxhalf = 10, sigma.min = 1e-4, trace.lev = 0)
BYlogreg(x0, y, initwml = TRUE, addIntercept = TRUE, const = 0.5, kmax = 1000, maxhalf = 10, sigma.min = 1e-4, trace.lev = 0)
x0 |
a numeric |
y |
numeric |
initwml |
logical for selecting one of the two possible methods for computing the initial value of the optimization process. If If |
addIntercept |
logical indicating that a column of |
const |
tuning constant used in the computation of the estimator (default=0.5). |
kmax |
maximum number of iterations before convergence (default=1000). |
maxhalf |
max number of step-halving (default=10). |
sigma.min |
smallest value of the scale parameter before implosion (and hence non-convergence) is assumed. |
trace.lev |
logical (or integer) indicating if intermediate results
should be printed; defaults to |
a list with components
convergence |
logical indicating if convergence was achieved |
objective |
the value of the objective function at the minimum |
coefficients |
vector of parameter estimates |
vcov |
variance-covariance matrix of the coefficients (if convergence is TRUE). |
sterror |
standard errors, i.e., simply |
Originally, Christophe Croux and Gentiane Haesbroeck, with thanks to Kristel Joossens and Valentin Todorov for improvements.
Speedup, tweaks, more “control” arguments: Martin Maechler.
Croux, C., and Haesbroeck, G. (2003) Implementing the Bianco and Yohai estimator for Logistic Regression, Computational Statistics and Data Analysis 44, 273–295.
Ana M. Bianco and Víctor J. Yohai (1996) Robust estimation in the logistic regression model. In Helmut Rieder, Robust Statistics, Data Analysis, and Computer Intensive Methods, Lecture Notes in Statistics 109, pages 17–34.
The more typical way to compute BY-estimates (via
formula
and methods):
glmrob(*, method = "WBY")
and .. method = "BY"
.
set.seed(17) x0 <- matrix(rnorm(100,1)) y <- rbinom(100, size=1, prob= 0.5) # ~= as.numeric(runif(100) > 0.5) BY <- BYlogreg(x0,y) BY <- BYlogreg(x0,y, trace.lev=TRUE) ## The "Vaso Constriction" aka "skin" data: data(vaso) vX <- model.matrix( ~ log(Volume) + log(Rate), data=vaso) vY <- vaso[,"Y"] head(cbind(vX, vY))# 'X' does include the intercept vWBY <- BYlogreg(x0 = vX, y = vY, addIntercept=FALSE) # as 'vX' has it already v.BY <- BYlogreg(x0 = vX, y = vY, addIntercept=FALSE, initwml=FALSE) ## they are relatively close, well used to be closer than now, ## with the (2023-05, VT) change of covMcd() scale-correction stopifnot( all.equal(vWBY, v.BY, tolerance = 0.008) ) # was ~ 1e-4 till 2023-05
set.seed(17) x0 <- matrix(rnorm(100,1)) y <- rbinom(100, size=1, prob= 0.5) # ~= as.numeric(runif(100) > 0.5) BY <- BYlogreg(x0,y) BY <- BYlogreg(x0,y, trace.lev=TRUE) ## The "Vaso Constriction" aka "skin" data: data(vaso) vX <- model.matrix( ~ log(Volume) + log(Rate), data=vaso) vY <- vaso[,"Y"] head(cbind(vX, vY))# 'X' does include the intercept vWBY <- BYlogreg(x0 = vX, y = vY, addIntercept=FALSE) # as 'vX' has it already v.BY <- BYlogreg(x0 = vX, y = vY, addIntercept=FALSE, initwml=FALSE) ## they are relatively close, well used to be closer than now, ## with the (2023-05, VT) change of covMcd() scale-correction stopifnot( all.equal(vWBY, v.BY, tolerance = 0.008) ) # was ~ 1e-4 till 2023-05
The damage carrots data set from Phelps (1982) was used by McCullagh and Nelder (1989) in order to illustrate diagnostic techniques because of the presence of an outlier. In a soil experiment trial with three blocks, eight levels of insecticide were applied and the carrots were tested for insect damage.
data(carrots, package="robustbase")
data(carrots, package="robustbase")
A data frame with 24 observations on the following 4 variables.
integer giving the number of carrots with insect damage.
integer giving the total number of carrots per experimental unit.
a numeric vector giving log(dose) values (eight different levels only).
factor with levels B1
to B3
Phelps, K. (1982).
Use of the complementary log-log function to describe doseresponse
relationships in insecticide evaluation field trials.
In R. Gilchrist (Ed.), Lecture Notes in Statistics, No. 14.
GLIM.82: Proceedings of the International Conference on Generalized
Linear Models; Springer-Verlag.
McCullagh P. and Nelder, J. A. (1989) Generalized Linear Models. London: Chapman and Hall.
Eva Cantoni and Elvezio Ronchetti (2001); JASA, and
Eva Cantoni (2004); JSS, see glmrob
data(carrots) str(carrots) plot(success/total ~ logdose, data = carrots, col = as.integer(block)) coplot(success/total ~ logdose | block, data = carrots) ## Classical glm Cfit0 <- glm(cbind(success, total-success) ~ logdose + block, data=carrots, family=binomial) summary(Cfit0) ## Robust Fit (see help(glmrob)) ....
data(carrots) str(carrots) plot(success/total ~ logdose, data = carrots, col = as.integer(block)) coplot(success/total ~ logdose | block, data = carrots) ## Classical glm Cfit0 <- glm(cbind(success, total-success) ~ logdose + block, data=carrots, family=binomial) summary(Cfit0) ## Robust Fit (see help(glmrob)) ....
To modify an object of class psi_func
,
i.e. typically change the tuning parameters, the generic function
chgDefaults()
is called and works via the corresponding method.
object = "psi_func"
The method is used to change the default
values for the tuning parameters, and returns an object of class
psi_func
, a copy of input object
with the slot tDefs
possibly changed;.
## Hampel's psi and rho: H.38 <- chgDefaults(hampelPsi, k = c(1.5, 3.5, 8)) H.38 plot(H.38) ## for more see ?psiFunc
## Hampel's psi and rho: H.38 <- chgDefaults(hampelPsi, k = c(1.5, 3.5, 8)) H.38 plot(H.38) ## for more see ?psiFunc
Compute classical principal components (PC) via SVD (svd
or eigenvalue decomposition (eigen
) with non-trivial
rank determination.
classPC(x, scale = FALSE, center = TRUE, signflip = TRUE, via.svd = n > p, scores = FALSE)
classPC(x, scale = FALSE, center = TRUE, signflip = TRUE, via.svd = n > p, scores = FALSE)
x |
a numeric |
scale |
logical indicating if the matrix should be scaled; it is
mean centered in any case (via
|
center |
logical or numeric vector for “centering” the matrix. |
signflip |
logical indicating if the sign(.) of the loadings should be determined should flipped such that the absolutely largest value is always positive. |
via.svd |
logical indicating if the computation is via SVD or Eigen decomposition; the latter makes sense typically only for n <= p. |
scores |
logical indicating |
a list
with components
rank |
the (numerical) matrix rank of |
eigenvalues |
the |
loadings |
the loadings, a |
scores |
if the |
center |
a numeric |
scale |
if the |
Valentin Todorov; efficiency tweaks by Martin Maechler
In spirit very similar to R's standard prcomp
and
princomp
, one of the main differences being how the
rank is determined via a non-trivial tolerance.
set.seed(17) x <- matrix(rnorm(120), 10, 12) # n < p {the unusual case} pcx <- classPC(x) (k <- pcx$rank) # = 9 [after centering!] pc2 <- classPC(x, scores=TRUE) pcS <- classPC(x, via.svd=TRUE) all.equal(pcx, pcS, tol = 1e-8) ## TRUE: eigen() & svd() based PC are close here pc0 <- classPC(x, center=FALSE, scale=TRUE) pc0$rank # = 10 here *no* centering (as E[.] = 0) ## Loadings are orthnormal: zapsmall( crossprod( pcx$loadings ) ) ## PC Scores are roughly orthogonal: S.S <- crossprod(pc2$scores) print.table(signif(zapsmall(S.S), 3), zero.print=".") stopifnot(all.equal(pcx$eigenvalues, diag(S.S)/k)) ## the usual n > p case : pc.x <- classPC(t(x)) pc.x$rank # = 10, full rank in the n > p case cpc1 <- classPC(cbind(1:3)) # 1-D matrix stopifnot(cpc1$rank == 1, all.equal(cpc1$eigenvalues, 1), all.equal(cpc1$loadings, 1))
set.seed(17) x <- matrix(rnorm(120), 10, 12) # n < p {the unusual case} pcx <- classPC(x) (k <- pcx$rank) # = 9 [after centering!] pc2 <- classPC(x, scores=TRUE) pcS <- classPC(x, via.svd=TRUE) all.equal(pcx, pcS, tol = 1e-8) ## TRUE: eigen() & svd() based PC are close here pc0 <- classPC(x, center=FALSE, scale=TRUE) pc0$rank # = 10 here *no* centering (as E[.] = 0) ## Loadings are orthnormal: zapsmall( crossprod( pcx$loadings ) ) ## PC Scores are roughly orthogonal: S.S <- crossprod(pc2$scores) print.table(signif(zapsmall(S.S), 3), zero.print=".") stopifnot(all.equal(pcx$eigenvalues, diag(S.S)/k)) ## the usual n > p case : pc.x <- classPC(t(x)) pc.x$rank # = 10, full rank in the n > p case cpc1 <- classPC(cbind(1:3)) # 1-D matrix stopifnot(cpc1$rank == 1, all.equal(cpc1$eigenvalues, 1), all.equal(cpc1$loadings, 1))
This data set contains the measurements concerning the cloud point of a Liquid, from Draper and Smith (1969). The cloud point is a measure of the degree of crystallization in a stock.
data(cloud, package="robustbase")
data(cloud, package="robustbase")
A data frame with 19 observations on the following 2 variables.
Percentage
Percentage of I-8
CloudPoint
Cloud point
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, p.96, table 10.
data(cloud) summary(lm.cloud <- lm(CloudPoint ~., data=cloud))
data(cloud) summary(lm.cloud <- lm(CloudPoint ~., data=cloud))
Contains information on 20 Schools from the Mid-Atlantic and New England States, drawn from a population studied by Coleman et al. (1966). Mosteller and Tukey (1977) analyze this sample consisting of measurements on six different variables, one of which will be treated as a responce.
data(coleman, package="robustbase")
data(coleman, package="robustbase")
A data frame with 20 observations on the following 6 variables.
salaryP
staff salaries per pupil
fatherWc
percent of white-collar fathers
sstatus
socioeconomic status composite deviation: means for family size, family intactness, father's education, mother's education, and home items
teacherSc
mean teacher's verbal test score
motherLev
mean mother's educational level, one unit is equal to two school years
Y
verbal mean test score (y, all sixth graders)
Valentin Todorov
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection Wiley, p.79, table 2.
data(coleman) pairs(coleman) summary( lm.coleman <- lm(Y ~ . , data = coleman)) summary(lts.coleman <- ltsReg(Y ~ . , data = coleman)) coleman.x <- data.matrix(coleman[, 1:6]) (Cc <- covMcd(coleman.x))
data(coleman) pairs(coleman) summary( lm.coleman <- lm(Y ~ . , data = coleman)) summary(lts.coleman <- ltsReg(Y ~ . , data = coleman)) coleman.x <- data.matrix(coleman[, 1:6]) (Cc <- covMcd(coleman.x))
Calculates the median for each row (column) of a matrix x
.
This is the same as but more efficient than apply(x, MM, median)
for MM=2 or MM=1, respectively.
colMedians(x, na.rm = FALSE, hasNA = TRUE, keep.names=TRUE) rowMedians(x, na.rm = FALSE, hasNA = TRUE, keep.names=TRUE)
colMedians(x, na.rm = FALSE, hasNA = TRUE, keep.names=TRUE) rowMedians(x, na.rm = FALSE, hasNA = TRUE, keep.names=TRUE)
x |
|
na.rm |
|
hasNA |
logical indicating if |
keep.names |
logical indicating if row or column names of |
The implementation of rowMedians()
and colMedians()
is optimized for both speed and memory.
To avoid coercing to double
s (and hence memory allocation), there
is a special implementation for integer
matrices.
That is, if x
is an integer
matrix
, then
rowMedians(as.double(x))
(rowMedians(as.double(x))
)
would require three times the memory of rowMedians(x)
(colMedians(x)
), but all this is avoided.
a numeric
vector of length or
, respectively.
Missing values are excluded before calculating the medians
unless hasNA
is false. Note that na.rm
has no
effect and is automatically false when hasNA
is false, i.e.,
internally, before computations start, the following is executed:
if (!hasNA) ## If there are no NAs, don't try to remove them narm <- FALSE
Henrik Bengtsson, Harris Jaffee, Martin Maechler
See wgt.himedian()
for a weighted hi-median, and
colWeightedMedians()
etc from package
matrixStats for weighted medians.
For mean estimates, see rowMeans()
in colSums
().
set.seed(1); n <- 234; p <- 543 # n*p = 127'062 x <- matrix(rnorm(n*p), n, p) x[sample(seq_along(x), size= n*p / 256)] <- NA R1 <- system.time(r1 <- rowMedians(x, na.rm=TRUE)) C1 <- system.time(y1 <- colMedians(x, na.rm=TRUE)) R2 <- system.time(r2 <- apply(x, 1, median, na.rm=TRUE)) C2 <- system.time(y2 <- apply(x, 2, median, na.rm=TRUE)) R2 / R1 # speedup factor: ~= 4 {platform dependent} C2 / C1 # speedup factor: ~= 5.8 {platform dependent} stopifnot(all.equal(y1, y2, tol=1e-15), all.equal(r1, r2, tol=1e-15)) (m <- cbind(x1=3, x2=c(4:1, 3:4,4))) stopifnot(colMedians(m) == 3, all.equal(colMeans(m), colMedians(m)),# <- including names ! all.equal(rowMeans(m), rowMedians(m)))
set.seed(1); n <- 234; p <- 543 # n*p = 127'062 x <- matrix(rnorm(n*p), n, p) x[sample(seq_along(x), size= n*p / 256)] <- NA R1 <- system.time(r1 <- rowMedians(x, na.rm=TRUE)) C1 <- system.time(y1 <- colMedians(x, na.rm=TRUE)) R2 <- system.time(r2 <- apply(x, 1, median, na.rm=TRUE)) C2 <- system.time(y2 <- apply(x, 2, median, na.rm=TRUE)) R2 / R1 # speedup factor: ~= 4 {platform dependent} C2 / C1 # speedup factor: ~= 5.8 {platform dependent} stopifnot(all.equal(y1, y2, tol=1e-15), all.equal(r1, r2, tol=1e-15)) (m <- cbind(x1=3, x2=c(4:1, 3:4,4))) stopifnot(colMedians(m) == 3, all.equal(colMeans(m), colMedians(m)),# <- including names ! all.equal(rowMeans(m), rowMedians(m)))
Dataset with pH-value and Calcium content in soil samples, collected in different communities of the Condroz region in Belgium. The data pertain to a subset of 428 samples with a pH-value between 7.0 and 7.5.
data(condroz, package="robustbase")
data(condroz, package="robustbase")
A data frame with 428 observations on the following 2 variables.
Ca
Calcium content of the soil sample
pH
pH value of the soil sample
For more information on the dataset, cf. Goegebeur et al. (2005).
Hubert and Vandervieren (2006), p. 10. This dataset is also studied in Vandewalle et al. (2004).
See also those for adjbox
.
Goegebeur, Y., Planchon, V., Beirlant, J., Oger, R. (2005). Quality Assesment of Pedochemical Data Using Extreme Value Methodology, Journal of Applied Science, 5, p. 1092-1102.
Vandewalle, B., Beirlant, J., Hubert, M. (2004). A robust estimator of the tail index based on an exponential regression model, in Hubert, M., Pison G., Struyf, A. and S. Van Aelst, ed., Theory and Applications of Recent Robust Methods, Birkhäuser, Basel, p. 367-376.
adjbox(condroz$Ca)
adjbox(condroz$Ca)
Compute (versions of) the (multivariate) “Comedian” covariance, i.e., multivariate location and scatter estimator
covComed(X, n.iter = 2, reweight = FALSE, tolSolve = control$tolSolve, trace = control$trace, wgtFUN = control$wgtFUN, control = rrcov.control())
covComed(X, n.iter = 2, reweight = FALSE, tolSolve = control$tolSolve, trace = control$trace, wgtFUN = control$wgtFUN, control = rrcov.control())
X |
data matrix of dimension, say |
n.iter |
number of comedian() iterations. Can be as low as zero. |
reweight |
logical indicating if the final distances and weights
should be recomputed from the final |
tolSolve |
a numerical tolerance passed to |
trace |
logical (or integer) indicating if intermediate results
should be printed; defaults to |
wgtFUN |
a character string or |
control |
a list with estimation options - this includes those
above provided in the function specification, see
|
.. not yet ..
an object of class "covComed"
which is basically a list with components
comp1 |
Description of 'comp1' |
comp2 |
Description of 'comp2' |
... FIXME ...
Maria Anna di Palma (initial), Valentin Todorov and Martin Maechler
Falk, M. (1997) On mad and comedians. Annals of the Institute of Statistical Mathematics 49, 615–644.
Falk, M. (1998). A note on the comedian for elliptical distributions. Journal of Multivariate Analysis 67, 306–317.
covMcd
, etc
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) (cc1 <- covComed(hbk.x)) (ccW <- covComed(hbk.x, reweight=TRUE)) cc0 <- covComed(hbk.x, n.iter=0) cc0W <- covComed(hbk.x, n.iter=0, reweight=TRUE) stopifnot(all.equal(unclass(cc0), # here, the 0-1 weights don't change: cc0W[names(cc0)], tol=1e-12, check.environment = FALSE), which(cc1$weights == 0) == 1:14, which(ccW$weights == 0) == 1:14, which(cc0$weights == 0) == 1:14) ## Martin's smooth reweighting: ## List of experimental pre-specified wgtFUN() creators: ## Cutoffs may depend on (n, p, control$beta) : str(.wgtFUN.covComed)
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) (cc1 <- covComed(hbk.x)) (ccW <- covComed(hbk.x, reweight=TRUE)) cc0 <- covComed(hbk.x, n.iter=0) cc0W <- covComed(hbk.x, n.iter=0, reweight=TRUE) stopifnot(all.equal(unclass(cc0), # here, the 0-1 weights don't change: cc0W[names(cc0)], tol=1e-12, check.environment = FALSE), which(cc1$weights == 0) == 1:14, which(ccW$weights == 0) == 1:14, which(cc0$weights == 0) == 1:14) ## Martin's smooth reweighting: ## List of experimental pre-specified wgtFUN() creators: ## Cutoffs may depend on (n, p, control$beta) : str(.wgtFUN.covComed)
Compute the Minimum Covariance Determinant (MCD) estimator, a robust multivariate location and scale estimate with a high breakdown point, via the ‘Fast MCD’ or ‘Deterministic MCD’ (“DetMcd”) algorithm.
covMcd(x, cor = FALSE, raw.only = FALSE, alpha =, nsamp =, nmini =, kmini =, scalefn =, maxcsteps =, initHsets = NULL, save.hsets = FALSE, names = TRUE, seed =, tolSolve =, trace =, use.correction =, wgtFUN =, control = rrcov.control())
covMcd(x, cor = FALSE, raw.only = FALSE, alpha =, nsamp =, nmini =, kmini =, scalefn =, maxcsteps =, initHsets = NULL, save.hsets = FALSE, names = TRUE, seed =, tolSolve =, trace =, use.correction =, wgtFUN =, control = rrcov.control())
x |
a matrix or data frame. |
cor |
should the returned result include a correlation matrix?
Default is |
raw.only |
should only the “raw” estimate be returned, i.e., no (re)weighting step be performed; default is false. |
alpha |
numeric parameter controlling the size of the subsets
over which the determinant is minimized; roughly |
nsamp |
number of subsets used for initial estimates or For |
nmini , kmini
|
for |
scalefn |
for the deterministic MCD: |
maxcsteps |
maximal number of concentration steps in the deterministic MCD; should not be reached. |
initHsets |
NULL or a |
save.hsets |
(for deterministic MCD) logical indicating if the
initial subsets should be returned as |
names |
logical; if true (as by default), several parts of the
result have a |
seed |
initial seed for random generator, like
|
tolSolve |
numeric tolerance to be used for inversion
( |
trace |
logical (or integer) indicating if intermediate results
should be printed; defaults to |
use.correction |
whether to use finite sample correction
factors; defaults to |
wgtFUN |
a character string or |
control |
a list with estimation options - this includes those
above provided in the function specification, see
|
The minimum covariance determinant estimator of location and scatter
implemented in covMcd()
is similar to R function
cov.mcd()
in MASS. The MCD method looks for
the (
h.alpha.n(alpha,n,p)
) observations (out of )
whose classical covariance matrix has the lowest possible determinant.
The raw MCD estimate of location is then the average of these points,
whereas the raw MCD estimate of scatter is their covariance matrix,
multiplied by a consistency factor (
.MCDcons(p, h/n)
) and (if
use.correction
is true) a finite sample correction factor
(.MCDcnp2(p, n, alpha)
), to make it consistent at the
normal model and unbiased at small samples. Both rescaling factors
(consistency and finite sample) are returned in the length-2 vector
raw.cnp2
.
The implementation of covMcd
uses the Fast MCD algorithm of
Rousseeuw and Van Driessen (1999) to approximate the minimum
covariance determinant estimator.
Based on these raw MCD estimates, (unless argument raw.only
is
true), a reweighting step is performed, i.e., V <- cov.wt(x,w)
,
where w
are weights determined by “outlyingness” with
respect to the scaled raw MCD. Again, a consistency factor and
(if use.correction
is true) a finite sample correction factor
(.MCDcnp2.rew(p, n, alpha)
) are applied.
The reweighted covariance is typically considerably more efficient
than the raw one, see Pison et al. (2002).
The two rescaling factors for the reweighted estimates are returned in
cnp2
. Details for the computation of the finite sample
correction factors can be found in Pison et al. (2002).
An object of class "mcd"
which is basically a
list
with components
center |
the final estimate of location. |
cov |
the final estimate of scatter. |
cor |
the (final) estimate of the correlation matrix (only if
|
crit |
the value of the criterion, i.e., the logarithm of the determinant. Previous to Nov.2014, it contained the determinant itself which can under- or overflow relatively easily. |
best |
the best subset found and used for computing the raw
estimates, with |
mah |
mahalanobis distances of the observations using the final estimate of the location and scatter. |
mcd.wt |
weights of the observations using the final estimate of the location and scatter. |
cnp2 |
a vector of length two containing the consistency correction factor and the finite sample correction factor of the final estimate of the covariance matrix. |
raw.center |
the raw (not reweighted) estimate of location. |
raw.cov |
the raw (not reweighted) estimate of scatter. |
raw.mah |
mahalanobis distances of the observations based on the raw estimate of the location and scatter. |
raw.weights |
weights of the observations based on the raw estimate of the location and scatter. |
raw.cnp2 |
a vector of length two containing the consistency correction factor and the finite sample correction factor of the raw estimate of the covariance matrix. |
X |
the input data as numeric matrix, without |
n.obs |
total number of observations. |
alpha |
the size of the subsets over which the determinant is
minimized (the default is |
quan |
the number of observations, |
method |
character string naming the method (Minimum Covariance
Determinant), starting with |
iBest |
(for the deterministic MCD) contains indices from 1:6 denoting which of the (six) initial subsets lead to the best set found. |
n.csteps |
(for the deterministic MCD) for each of the initial subsets, the number of C-steps executed till convergence. |
call |
the call used (see |
Valentin Todorov [email protected], based on work written for S-plus by Peter Rousseeuw and Katrien van Driessen from University of Antwerp.
Visibility of (formerly internal) tuning parameters, notably
wgtFUN()
: Martin Maechler
Rousseeuw, P. J. and Leroy, A. M. (1987) Robust Regression and Outlier Detection. Wiley.
Rousseeuw, P. J. and van Driessen, K. (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223.
Pison, G., Van Aelst, S., and Willems, G. (2002) Small Sample Corrections for LTS and MCD, Metrika 55, 111–123.
Hubert, M., Rousseeuw, P. J. and Verdonck, T. (2012) A deterministic algorithm for robust location and scatter. Journal of Computational and Graphical Statistics 21, 618–637.
cov.mcd
from package MASS;
covOGK
as cheaper alternative for larger dimensions.
BACON
and covNNC
,
from package robustX;
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) set.seed(17) (cH <- covMcd(hbk.x)) cH0 <- covMcd(hbk.x, nsamp = "deterministic") with(cH0, stopifnot(quan == 39, iBest == c(1:4,6), # 5 out of 6 gave the same identical(raw.weights, mcd.wt), identical(which(mcd.wt == 0), 1:14), all.equal(crit, -1.045500594135))) ## the following three statements are equivalent c1 <- covMcd(hbk.x, alpha = 0.75) c2 <- covMcd(hbk.x, control = rrcov.control(alpha = 0.75)) ## direct specification overrides control one: c3 <- covMcd(hbk.x, alpha = 0.75, control = rrcov.control(alpha=0.95)) c1 ## Martin's smooth reweighting: ## List of experimental pre-specified wgtFUN() creators: ## Cutoffs may depend on (n, p, control$beta) : str(.wgtFUN.covMcd) cMM <- covMcd(hbk.x, wgtFUN = "sm1.adaptive") ina <- which(names(cH) == "call") all.equal(cMM[-ina], cH[-ina]) # *some* differences, not huge (same 'best'): stopifnot(all.equal(cMM[-ina], cH[-ina], tol = 0.2))
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) set.seed(17) (cH <- covMcd(hbk.x)) cH0 <- covMcd(hbk.x, nsamp = "deterministic") with(cH0, stopifnot(quan == 39, iBest == c(1:4,6), # 5 out of 6 gave the same identical(raw.weights, mcd.wt), identical(which(mcd.wt == 0), 1:14), all.equal(crit, -1.045500594135))) ## the following three statements are equivalent c1 <- covMcd(hbk.x, alpha = 0.75) c2 <- covMcd(hbk.x, control = rrcov.control(alpha = 0.75)) ## direct specification overrides control one: c3 <- covMcd(hbk.x, alpha = 0.75, control = rrcov.control(alpha=0.95)) c1 ## Martin's smooth reweighting: ## List of experimental pre-specified wgtFUN() creators: ## Cutoffs may depend on (n, p, control$beta) : str(.wgtFUN.covMcd) cMM <- covMcd(hbk.x, wgtFUN = "sm1.adaptive") ina <- which(names(cH) == "call") all.equal(cMM[-ina], cH[-ina]) # *some* differences, not huge (same 'best'): stopifnot(all.equal(cMM[-ina], cH[-ina], tol = 0.2))
Computes the orthogonalized pairwise covariance matrix estimate described in in Maronna and Zamar (2002). The pairwise proposal goes back to Gnanadesikan and Kettenring (1972).
covOGK(X, n.iter = 2, sigmamu, rcov = covGK, weight.fn = hard.rejection, keep.data = FALSE, ...) covGK (x, y, scalefn = scaleTau2, ...) s_mad(x, mu.too = FALSE, na.rm = FALSE) s_IQR(x, mu.too = FALSE, na.rm = FALSE)
covOGK(X, n.iter = 2, sigmamu, rcov = covGK, weight.fn = hard.rejection, keep.data = FALSE, ...) covGK (x, y, scalefn = scaleTau2, ...) s_mad(x, mu.too = FALSE, na.rm = FALSE) s_IQR(x, mu.too = FALSE, na.rm = FALSE)
X |
data in something that can be coerced into a numeric matrix. |
n.iter |
number of orthogonalization iterations. Usually 1 or 2; values greater than 2 are unlikely to have any significant effect on the estimate (other than increasing the computing time). |
sigmamu , scalefn
|
a function that computes univariate robust
location and scale estimates. By default it should return a single
numeric value containing the robust scale (standard deviation)
estimate. When |
rcov |
function that computes a robust covariance estimate
between two vectors. The default, Gnanadesikan-Kettenring's
|
weight.fn |
a function of the robust distances and the number of
variables |
keep.data |
logical indicating if the (untransformed) data matrix
|
... |
additional arguments; for |
x , y
|
numeric vectors of the same length, the covariance of which
is sought in |
mu.too |
logical indicating if both location and scale should be
returned or just the scale (when |
na.rm |
if |
Typical default values for the function arguments
sigmamu
, rcov
, and weight.fn
, are
available as well, see the Examples below,
but their names and calling sequences are
still subject to discussion and may be changed in the future.
The current default, weight.fn = hard.rejection
corresponds to
the proposition in the litterature, but Martin Maechler strongly
believes that the hard threshold currently in use is too arbitrary,
and further that soft thresholding should be used instead, anyway.
covOGK()
currently returns a list with components
center |
robust location: numeric vector of length |
cov |
robust covariance matrix estimate: |
wcenter , wcov
|
re-weighted versions of |
weights |
the robustness weights used. |
distances |
the mahalanobis distances computed using
|
......
but note that this might be radically changed to returning an
S4 classed object!
covGK()
is a trivial 1-line function returning the covariance
estimate
where is the scale estimate of
specified by
scalefn
.
s_mad()
, and s_IQR()
return the
scale estimates mad
or IQR
respectively, where the s_*
functions return a length-2 vector
(mu, sig) when mu.too = TRUE
, see also scaleTau2
.
Kjell Konis [email protected], with modifications by Martin Maechler.
Maronna, R.A. and Zamar, R.H. (2002) Robust estimates of location and dispersion of high-dimensional datasets; Technometrics 44(4), 307–317.
Gnanadesikan, R. and John R. Kettenring (1972) Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics 28, 81–124.
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cO1 <- covOGK(hbk.x, sigmamu = scaleTau2) cO2 <- covOGK(hbk.x, sigmamu = s_Qn) cO3 <- covOGK(hbk.x, sigmamu = s_Sn) cO4 <- covOGK(hbk.x, sigmamu = s_mad) cO5 <- covOGK(hbk.x, sigmamu = s_IQR) data(toxicity) cO1tox <- covOGK(toxicity, sigmamu = scaleTau2) cO2tox <- covOGK(toxicity, sigmamu = s_Qn) ## nice formatting of correlation matrices: as.dist(round(cov2cor(cO1tox$cov), 2)) as.dist(round(cov2cor(cO2tox$cov), 2)) ## "graphical" symnum(cov2cor(cO1tox$cov)) symnum(cov2cor(cO2tox$cov), legend=FALSE)
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cO1 <- covOGK(hbk.x, sigmamu = scaleTau2) cO2 <- covOGK(hbk.x, sigmamu = s_Qn) cO3 <- covOGK(hbk.x, sigmamu = s_Sn) cO4 <- covOGK(hbk.x, sigmamu = s_mad) cO5 <- covOGK(hbk.x, sigmamu = s_IQR) data(toxicity) cO1tox <- covOGK(toxicity, sigmamu = scaleTau2) cO2tox <- covOGK(toxicity, sigmamu = s_Qn) ## nice formatting of correlation matrices: as.dist(round(cov2cor(cO1tox$cov), 2)) as.dist(round(cov2cor(cO2tox$cov), 2)) ## "graphical" symnum(cov2cor(cO1tox$cov)) symnum(cov2cor(cO2tox$cov), legend=FALSE)
Data set issued from a study of the adverse events of a drug on 117 patients affected by Crohn's disease (a chronic inflammatory disease of the intestines).
data(CrohnD, package="robustbase")
data(CrohnD, package="robustbase")
A data frame with 117 observations on the following 9 variables.
ID
the numeric patient IDs
nrAdvE
the number of adverse events
BMI
Body MASS Index, i.e., .
height
in cm
country
a factor with levels 0
and 1
sex
the person's gender, a binary factor with levels
M
F
age
in years, a numeric vector
weight
in kilograms, a numeric vector
treat
how CD was treated: a factor with levels
0
, 1
and 2
, meaning placebo, drug 1 and drug 2.
form the authors of the reference, with permission by the original data collecting agency.
Serigne N. Lô and Elvezio Ronchetti (2006). Robust Second Order Accurate Inference for Generalized Linear Models. Technical report, University of Geneva, Switzerland.
data(CrohnD) str(CrohnD) with(CrohnD, ftable(table(sex,country, treat)))
data(CrohnD) str(CrohnD) with(CrohnD, ftable(table(sex,country, treat)))
The original data set was bivariate and recorded for ten subjects the
prolongation of sleep caused by two different drugs. These data were
used by Student as the first illustration of the paired t-test which
only needs the differences of the two measurements. These
differences are the values of cushny
.
data(cushny, package="robustbase")
data(cushny, package="robustbase")
numeric vector, sorted increasingly:
0 0.8 1 1.2 1.3 1.3 1.4 1.8 2.4 4.6
Cushny, A.R. and Peebles, A.R. (1905) The action of optical isomers. II. Hyoscines. J. Physiol. 32, 501–510.
These data were used by Student(1908) as the first illustration of the
paired t-test, see also sleep
; then cited by
Fisher (1925) and thereforth copied in numerous books as an example of
a normally distributed sample, see, e.g., Anderson (1958).
Student (1908) The probable error of a mean. Biometrika 6, 1–25.
Fisher, R.A. (1925) Statistical Methods for Research Workers; Oliver & Boyd, Edinburgh.
Anderson, T.W. (1958) An Introduction to Multivariate Statistical Analysis; Wiley, N.Y.
Hampel, F., Ronchetti, E., Rousseeuw, P. and Stahel, W. (1986) Robust Statistics: The Approach Based on Influence Functions; Wiley, N.Y.
data(cushny) plot(cushny, rep(0, 10), pch = 3, cex = 3, ylab = "", yaxt = "n") plot(jitter(cushny), rep(0, 10), pch = 3, cex = 2, main = "'cushny' data (n= 10)", ylab = "", yaxt = "n") abline(h=0, col="gray", lty=3) myPt <- function(m, lwd = 2, ..., e = 1.5*par("cxy")[2]) segments(m, +e, m, -e, lwd = lwd, ...) myPt( mean(cushny), col = "pink3") myPt(median(cushny), col = "light blue") legend("topright", c("mean", "median"), lwd = 2, col = c("pink3", "light blue"), inset = .01) ## The 'sleep' data from the standard 'datasets' package: d.sleep <- local({ gr <- with(datasets::sleep, split(extra, group)) gr[[2]] - gr[[1]] }) stopifnot(all.equal(cushny, sort(d.sleep), tolerance=1e-15))
data(cushny) plot(cushny, rep(0, 10), pch = 3, cex = 3, ylab = "", yaxt = "n") plot(jitter(cushny), rep(0, 10), pch = 3, cex = 2, main = "'cushny' data (n= 10)", ylab = "", yaxt = "n") abline(h=0, col="gray", lty=3) myPt <- function(m, lwd = 2, ..., e = 1.5*par("cxy")[2]) segments(m, +e, m, -e, lwd = lwd, ...) myPt( mean(cushny), col = "pink3") myPt(median(cushny), col = "light blue") legend("topright", c("mean", "median"), lwd = 2, col = c("pink3", "light blue"), inset = .01) ## The 'sleep' data from the standard 'datasets' package: d.sleep <- local({ gr <- with(datasets::sleep, split(extra, group)) gr[[2]] - gr[[1]] }) stopifnot(all.equal(cushny, sort(d.sleep), tolerance=1e-15))
Delivery Time Data, from Montgomery and Peck (1982). The aim is to explain the time required to service a vending machine (Y) by means of the number of products stocked (X1) and the distance walked by the route driver (X2).
data(delivery, package="robustbase")
data(delivery, package="robustbase")
A data frame with 25 observations on the following 3 variables.
n.prod
Number of Products
distance
Distance
delTime
Delivery time
Montgomery and Peck (1982, p.116)
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, page 155, table 23.
data(delivery) summary(lm.deli <- lm(delTime ~ ., data = delivery)) delivery.x <- as.matrix(delivery[, 1:2]) c_deli <- covMcd(delivery.x) c_deli
data(delivery) summary(lm.deli <- lm(delTime ~ ., data = delivery)) delivery.x <- as.matrix(delivery[, 1:2]) c_deli <- covMcd(delivery.x) c_deli
Education Expenditure Data, from Chatterjee and Price (1977, p.108). This data set, representing the education expenditure variables in the 50 US states, providing an interesting example of heteroscedacity.
data(education, package="robustbase")
data(education, package="robustbase")
A data frame with 50 observations on the following 6 variables.
State
State
Region
Region (1=Northeastern, 2=North central, 3=Southern, 4=Western)
X1
Number of residents per thousand residing in urban areas in 1970
X2
Per capita personal income in 1973
X3
Number of residents per thousand under 18 years of age in 1974
Y
Per capita expenditure on public education in a state, projected for 1975
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, p.110, table 16.
data(education) education.x <- data.matrix(education[, 3:5]) summary(lm.education <- lm(Y ~ Region + X1+X2+X3, data=education)) ## See example(lmrob.M.S) # for how robust regression is used
data(education) education.x <- data.matrix(education[, 3:5]) summary(lm.education <- lm(Y ~ Region + X1+X2+X3, data=education)) ## See example(lmrob.M.S) # for how robust regression is used
Data from a clinical trial of 59 patients with epilepsy (Breslow, 1996) in order to illustrate diagnostic techniques in Poisson regression.
data(epilepsy, package="robustbase")
data(epilepsy, package="robustbase")
A data frame with 59 observations on the following 11 variables.
ID
Patient identification number
Y1
Number of epilepsy attacks patients have during the first follow-up period
Y2
Number of epilepsy attacks patients have during the second follow-up period
Y3
Number of epilepsy attacks patients have during the third follow-up period
Y4
Number of epilepsy attacks patients have during the forth follow-up period
Base
Number of epileptic attacks recorded during 8 week period prior to randomization
Age
Age of the patients
Trt
a factor with levels placebo
progabide
indicating whether the anti-epilepsy
drug Progabide has been applied or not
Ysum
Total number of epilepsy attacks patients have during the four follow-up periods
Age10
Age of the patients devided by 10
Base4
Variable Base
devided by 4
Thall and Vail reported data from a clinical trial of 59 patients with epilepsy, 31 of whom were randomized to receive the anti-epilepsy drug Progabide and 28 of whom received a placebo. Baseline data consisted of the patient's age and the number of epileptic seizures recorded during 8 week period prior to randomization. The response consisted of counts of seizures occuring during the four consecutive follow-up periods of two weeks each.
Thall, P.F. and Vail S.C. (1990) Some covariance models for longitudinal count data with overdispersion. Biometrics 46, 657–671.
Diggle, P.J., Liang, K.Y., and Zeger, S.L. (1994) Analysis of Longitudinal Data; Clarendon Press.
Breslow N. E. (1996) Generalized linear models: Checking assumptions and strengthening conclusions. Statistica Applicata 8, 23–41.
data(epilepsy) str(epilepsy) pairs(epilepsy[,c("Ysum","Base4","Trt","Age10")]) Efit1 <- glm(Ysum ~ Age10 + Base4*Trt, family=poisson, data=epilepsy) summary(Efit1) ## Robust Fit : Efit2 <- glmrob(Ysum ~ Age10 + Base4*Trt, family=poisson, data=epilepsy, method = "Mqle", tcc=1.2, maxit=100) summary(Efit2)
data(epilepsy) str(epilepsy) pairs(epilepsy[,c("Ysum","Base4","Trt","Age10")]) Efit1 <- glm(Ysum ~ Age10 + Base4*Trt, family=poisson, data=epilepsy) summary(Efit1) ## Robust Fit : Efit2 <- glmrob(Ysum ~ Age10 + Base4*Trt, family=poisson, data=epilepsy, method = "Mqle", tcc=1.2, maxit=100) summary(Efit2)
Extract the estimation method as a character
string from a fitted model.
estimethod(object, ...)
estimethod(object, ...)
object |
a fitted model. |
... |
additional, optional arguments. (None are used in our methods) |
This is a (S3) generic function for which we provide methods,
currently for nlrob
only.
a character
string, the estimation method used.
nlrob
, and nlrob.MM
, notably for examples.
This is an artificial data set, cleverly construced and used by Antille and May to demonstrate ‘problems’ with LMS and LTS.
data(exAM, package="robustbase")
data(exAM, package="robustbase")
A data frame with 12 observations on 2 variables, x
and y
.
Because the points are not in general position, both LMS and LTS
typically fail; however, e.g., rlm(*,
method="MM")
“works”.
Antille, G. and El May, H. (1992)
The use of slices in the LMS and the method of density slices:
Foundation and comparison.
In Yadolah Dodge and Joe Whittaker, editors, COMPSTAT:
Proc. 10th Symp. Computat. Statist., Neuchatel, 1, 441–445;
Physica-Verlag.
data(exAM) plot(exAM) summary(ls <- lm(y ~ x, data=exAM)) abline(ls)
data(exAM) plot(exAM) summary(ls <- lm(y ~ x, data=exAM)) abline(ls)
This data consists of 150 randomly selected persons from a survey with information on over 2000 elderly US citizens, where the response, indicates participation in the U.S. Food Stamp Program.
data(foodstamp, package="robustbase")
data(foodstamp, package="robustbase")
A data frame with 150 observations on the following 4 variables.
participation
participation in U.S. Food Stamp Program; yes = 1, no = 0
tenancy
tenancy, indicating home ownership; yes = 1, no = 0
suppl.income
supplemental income, indicating whether some form of supplemental security income is received; yes = 1, no = 0
income
monthly income (in US dollars)
Data description and first analysis: Stefanski et al.(1986) who indicate Rizek(1978) as original source of the larger study.
Electronic version from CRAN package catdata.
Rizek, R. L. (1978) The 1977-78 Nationwide Food Consumption Survey. Family Econ. Rev., Fall, 3–7.
Stefanski, L. A., Carroll, R. J. and Ruppert, D. (1986) Optimally bounded score functions for generalized linear models with applications to logistic regression. Biometrika 73, 413–424.
Künsch, H. R., Stefanski, L. A., Carroll, R. J. (1989) Conditionally unbiased bounded-influence estimation in general regression models, with applications to generalized linear models. J. American Statistical Association 84, 460–466.
data(foodstamp) (T123 <- xtabs(~ participation+ tenancy+ suppl.income, data=foodstamp)) summary(T123) ## ==> the binary var's are clearly not independent foodSt <- within(foodstamp, { logInc <- log(1 + income) rm(income) }) m1 <- glm(participation ~ ., family=binomial, data=foodSt) summary(m1) rm1 <- glmrob(participation ~ ., family=binomial, data=foodSt) summary(rm1) ## Now use robust weights.on.x : rm2 <- glmrob(participation ~ ., family=binomial, data=foodSt, weights.on.x = "robCov") summary(rm2)## aha, now the weights are different: which( weights(rm2, type="robust") < 0.5)
data(foodstamp) (T123 <- xtabs(~ participation+ tenancy+ suppl.income, data=foodstamp)) summary(T123) ## ==> the binary var's are clearly not independent foodSt <- within(foodstamp, { logInc <- log(1 + income) rm(income) }) m1 <- glm(participation ~ ., family=binomial, data=foodSt) summary(m1) rm1 <- glmrob(participation ~ ., family=binomial, data=foodSt) summary(rm1) ## Now use robust weights.on.x : rm2 <- glmrob(participation ~ ., family=binomial, data=foodSt, weights.on.x = "robCov") summary(rm2)## aha, now the weights are different: which( weights(rm2, type="robust") < 0.5)
From the QR decomposition with pivoting, (qr(x, tol)
if
), if
the matrix is not of full rank, the corresponding columns (
) or rows (
) are omitted to form a full rank matrix.
fullRank(x, tol = 1e-7, qrx = qr(x, tol=tol))
fullRank(x, tol = 1e-7, qrx = qr(x, tol=tol))
x |
a numeric matrix of dimension |
tol |
tolerance for determining rank (deficiency). Currently is
simply passed to |
qrx |
optionally may be used to pass a |
a version of the matrix x
, with less columns or rows if
x
's rank was smaller than min(n,p)
.
If x
is of full rank, it is returned unchanged.
This is useful for robustness algorithms that rely on matrices
of full rank, e.g.,
adjOutlyingness
.
This also works for numeric data frames and whenever qr()
works correctly.
Martin Maechler
qr
; for more sophisticated rank determination,
rankMatrix
from package Matrix.
stopifnot(identical(fullRank(wood), wood)) ## More sophisticated and delicate dim(T <- tcrossprod(data.matrix(toxicity))) # 38 x 38 dim(T. <- fullRank(T)) # 38 x 10 if(requireNamespace("Matrix")) { rMmeths <- eval(formals(Matrix::rankMatrix)$method) rT. <- sapply(rMmeths, function(.m.) Matrix::rankMatrix(T., method = .m.)) print(rT.) # "qr" (= "qrLinpack"): 13, others rather 10 } dim(T.2 <- fullRank(T, tol = 1e-15))# 38 x 18 dim(T.3 <- fullRank(T, tol = 1e-12))# 38 x 13 dim(T.3 <- fullRank(T, tol = 1e-10))# 38 x 13 dim(T.3 <- fullRank(T, tol = 1e-8 ))# 38 x 12 dim(T.) # default from above 38 x 10 dim(T.3 <- fullRank(T, tol = 1e-5 ))# 38 x 10 -- still plot(svd(T, 0,0)$d, log="y", main = "singular values of T", yaxt="n") axis(2, at=10^(-14:5), las=1) ## pretty clearly indicates that rank 10 is "correct" here.
stopifnot(identical(fullRank(wood), wood)) ## More sophisticated and delicate dim(T <- tcrossprod(data.matrix(toxicity))) # 38 x 38 dim(T. <- fullRank(T)) # 38 x 10 if(requireNamespace("Matrix")) { rMmeths <- eval(formals(Matrix::rankMatrix)$method) rT. <- sapply(rMmeths, function(.m.) Matrix::rankMatrix(T., method = .m.)) print(rT.) # "qr" (= "qrLinpack"): 13, others rather 10 } dim(T.2 <- fullRank(T, tol = 1e-15))# 38 x 18 dim(T.3 <- fullRank(T, tol = 1e-12))# 38 x 13 dim(T.3 <- fullRank(T, tol = 1e-10))# 38 x 13 dim(T.3 <- fullRank(T, tol = 1e-8 ))# 38 x 12 dim(T.) # default from above 38 x 10 dim(T.3 <- fullRank(T, tol = 1e-5 ))# 38 x 10 -- still plot(svd(T, 0,0)$d, log="y", main = "singular values of T", yaxt="n") axis(2, at=10^(-14:5), las=1) ## pretty clearly indicates that rank 10 is "correct" here.
The class "functionX"
of vectorized functions of one argument
x
and typically further tuning parameters.
Objects can be created by calls of the form new("functionX", ...)
.
.Data
:Directly extends class "function"
.
Class "function"
, from data part.
Class "OptionalFunction"
, by class "function"
.
Class "PossibleMethod"
, by class "function"
.
No methods defined with class "functionX" in the signature.
Martin Maechler
psiFunc()
, and class descriptions of
functionXal
for functionals of
"functionX"
, and psi_func
which has
several functionX
slots.
The class "functionXal"
is a class of functionals (typically
integrals) typically of functionX
functions.
Since the functionX
functions typically also depend on tuning
parameters, objects of this class ("functionXal"
) are functions
of these tuning parameters.
.Data
:Directly extends class "function"
.
Class "function"
, from data part.
Class "OptionalFunction"
, by class "function"
.
Class "PossibleMethod"
, by class "function"
.
psiFunc()
and the class definitions of
functionX
and
psi_func
which has several functionXal
slots.
glmrob
is used to fit generalized linear models by robust
methods. The models are specified by giving a symbolic description of
the linear predictor and a description of the error distribution.
Currently, robust methods are implemented for family =
binomial
, = poisson
, = Gamma
and = gaussian
.
glmrob(formula, family, data, weights, subset, na.action, start = NULL, offset, method = c("Mqle", "BY", "WBY", "MT"), weights.on.x = c("none", "hat", "robCov", "covMcd"), control = NULL, model = TRUE, x = FALSE, y = TRUE, contrasts = NULL, trace.lev = 0, ...)
glmrob(formula, family, data, weights, subset, na.action, start = NULL, offset, method = c("Mqle", "BY", "WBY", "MT"), weights.on.x = c("none", "hat", "robCov", "covMcd"), control = NULL, model = TRUE, x = FALSE, y = TRUE, contrasts = NULL, trace.lev = 0, ...)
formula |
a |
family |
a description of the error distribution and link function to
be used in the model. This can be a character string naming a
family function, a family |
data |
an optional data frame containing the variables
in the model. If not found in |
weights |
an optional vector of weights to be used in the fitting process. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen
when the data contain |
start |
starting values for the parameters in the linear
predictor. Note that specifying |
offset |
this can be used to specify an a priori known component to be included in the linear predictor during fitting. |
method |
a character string specifying the robust fitting method. The details of method specification are given below. |
weights.on.x |
a character string (can be abbreviated), a If If it is a |
control |
a list of parameters for controlling the fitting process.
See the documentation for |
model |
a logical value indicating whether model frame should be included as a component of the returned value. |
x , y
|
logical values indicating whether the response vector and model matrix used in the fitting process should be returned as components of the returned value. |
contrasts |
an optional list. See the |
trace.lev |
logical (or integer) indicating if intermediate results
should be printed; defaults to |
... |
arguments passed to |
method="model.frame"
returns the model.frame()
,
the same as glm()
.
method="Mqle"
fits a generalized linear
model using Mallows or Huber type robust estimators, as described in
Cantoni and Ronchetti (2001) and Cantoni and Ronchetti (2006). In
contrast to the implementation
described in Cantoni (2004), the pure influence algorithm is
implemented.
method="WBY"
and method="BY"
,
available for logistic regression (family = binomial
) only, call
BYlogreg(*, initwml= . )
for the (weighted) Bianco-Yohai
estimator, where initwml
is true for "WBY"
, and false
for "BY"
.
method="MT"
, currently only implemented for family = poisson
,
computes an “[M]-Estimator based on [T]ransformation”,
by Valdora and Yohai (2013), via (hidden internal) glmrobMT()
; as
that uses sample()
, from R version 3.6.0 it depends on
RNGkind(*, sample.kind)
. Exact reproducibility of results
from R versions 3.5.3 and earlier, requires setting
RNGversion("3.5.0")
.
weights.on.x= "robCov"
makes sense if all explanatory variables
are continuous.
In the cases,where weights.on.x
is "covMcd"
or
"robCov"
, or list with a “robCov” function, the
mahalanobis distances D^2
are computed with respect to the
covariance (location and scatter) estimate, and the weights are
1/sqrt(1+ pmax.int(0, 8*(D2 - p)/sqrt(2*p)))
,
where D2 = D^2
and p = ncol(X)
.
glmrob
returns an object of class "glmrob"
and is also
inheriting from glm
.
The summary
method, see summary.glmrob
, can
be used to obtain or print a summary of the results.
The generic accessor functions coefficients
,
effects
, fitted.values
and residuals
(see
residuals.glmrob
) can be used to extract various useful
features of the value returned by glmrob()
.
An object of class "glmrob"
is a list with at least the
following components:
coefficients |
a named vector of coefficients |
residuals |
the working residuals, that is the (robustly “huberized”) residuals in the final iteration of the IWLS fit. |
fitted.values |
the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function. |
w.r |
robustness weights for each observations; i.e.,
|
w.x |
weights used to down-weight observations based on the position of the observation in the design space. |
dispersion |
robust estimation of dispersion paramter if appropriate |
cov |
the estimated asymptotic covariance matrix of the estimated coefficients. |
tcc |
the tuning constant c in Huber's psi-function. |
family |
the |
linear.predictors |
the linear fit on link scale. |
deviance |
NULL; Exists because of compatipility reasons. |
iter |
the number of iterations used by the influence algorithm. |
converged |
logical. Was the IWLS algorithm judged to have converged? |
call |
the matched call. |
formula |
the formula supplied. |
terms |
the |
data |
the |
offset |
the offset vector used. |
control |
the value of the |
method |
the name of the robust fitter function used. |
contrasts |
(where relevant) the contrasts used. |
xlevels |
(where relevant) a record of the levels of the factors used in fitting. |
Andreas Ruckstuhl ("Mqle") and Martin Maechler
Eva Cantoni and Elvezio Ronchetti (2001) Robust Inference for Generalized Linear Models. JASA 96 (455), 1022–1030.
Eva Cantoni (2004) Analysis of Robust Quasi-deviances for Generalized Linear Models. Journal of Statistical Software, 10, https://www.jstatsoft.org/article/view/v010i04 Eva Cantoni and Elvezio Ronchetti (2006) A robust approach for skewed and heavy-tailed outcomes in the analysis of health care expenditures. Journal of Health Economics 25, 198–213.
S. Heritier, E. Cantoni, S. Copt, M.-P. Victoria-Feser (2009) Robust Methods in Biostatistics. Wiley Series in Probability and Statistics.
Marina Valdora and Víctor J. Yohai (2013) Robust estimators for Generalized Linear Models. In progress.
predict.glmrob
for prediction;
glmrobMqle.control
## Binomial response -------------- data(carrots) Cfit1 <- glm(cbind(success, total-success) ~ logdose + block, data = carrots, family = binomial) summary(Cfit1) Rfit1 <- glmrob(cbind(success, total-success) ~ logdose + block, family = binomial, data = carrots, method= "Mqle", control= glmrobMqle.control(tcc=1.2)) summary(Rfit1) Rfit2 <- glmrob(success/total ~ logdose + block, weights = total, family = binomial, data = carrots, method= "Mqle", control= glmrobMqle.control(tcc=1.2)) coef(Rfit2) ## The same as Rfit1 ## Binary response -------------- data(vaso) Vfit1 <- glm(Y ~ log(Volume) + log(Rate), family=binomial, data=vaso) coef(Vfit1) Vfit2 <- glmrob(Y ~ log(Volume) + log(Rate), family=binomial, data=vaso, method="Mqle", control = glmrobMqle.control(tcc=3.5)) coef(Vfit2) # c = 3.5 ==> not much different from classical ## Note the problems with tcc <= 3 %% FIXME algorithm ??? Vfit3 <- glmrob(Y ~ log(Volume) + log(Rate), family=binomial, data=vaso, method= "BY") coef(Vfit3)## note that results differ much. ## That's not unreasonable however, see Kuensch et al.(1989), p.465 ## Poisson response -------------- data(epilepsy) Efit1 <- glm(Ysum ~ Age10 + Base4*Trt, family=poisson, data=epilepsy) summary(Efit1) Efit2 <- glmrob(Ysum ~ Age10 + Base4*Trt, family = poisson, data = epilepsy, method= "Mqle", control = glmrobMqle.control(tcc= 1.2)) summary(Efit2) ## 'x' weighting: (Efit3 <- glmrob(Ysum ~ Age10 + Base4*Trt, family = poisson, data = epilepsy, method= "Mqle", weights.on.x = "hat", control = glmrobMqle.control(tcc= 1.2))) try( # gives singular cov matrix: 'Trt' is binary factor --> # affine equivariance and subsampling are problematic Efit4 <- glmrob(Ysum ~ Age10 + Base4*Trt, family = poisson, data = epilepsy, method= "Mqle", weights.on.x = "covMcd", control = glmrobMqle.control(tcc=1.2, maxit=100)) ) ##--> See example(possumDiv) for another Poisson-regression ### -------- Gamma family -- data from example(glm) --- clotting <- data.frame( u = c(5,10,15,20,30,40,60,80,100), lot1 = c(118,58,42,35,27,25,21,19,18), lot2 = c(69,35,26,21,18,16,13,12,12)) summary(cl <- glm (lot1 ~ log(u), data=clotting, family=Gamma)) summary(ro <- glmrob(lot1 ~ log(u), data=clotting, family=Gamma)) clotM5.high <- within(clotting, { lot1[5] <- 60 }) op <- par(mfrow=2:1, mgp = c(1.6, 0.8, 0), mar = c(3,3:1)) plot( lot1 ~ log(u), data=clotM5.high) plot(1/lot1 ~ log(u), data=clotM5.high) par(op) ## Obviously, there the first observation is an outlier with respect to both ## representations! cl5.high <- glm (lot1 ~ log(u), data=clotM5.high, family=Gamma) ro5.high <- glmrob(lot1 ~ log(u), data=clotM5.high, family=Gamma) with(ro5.high, cbind(w.x, w.r))## the 5th obs. is downweighted heavily! plot(1/lot1 ~ log(u), data=clotM5.high) abline(cl5.high, lty=2, col="red") abline(ro5.high, lwd=2, col="blue") ## result is ok (but not "perfect")
## Binomial response -------------- data(carrots) Cfit1 <- glm(cbind(success, total-success) ~ logdose + block, data = carrots, family = binomial) summary(Cfit1) Rfit1 <- glmrob(cbind(success, total-success) ~ logdose + block, family = binomial, data = carrots, method= "Mqle", control= glmrobMqle.control(tcc=1.2)) summary(Rfit1) Rfit2 <- glmrob(success/total ~ logdose + block, weights = total, family = binomial, data = carrots, method= "Mqle", control= glmrobMqle.control(tcc=1.2)) coef(Rfit2) ## The same as Rfit1 ## Binary response -------------- data(vaso) Vfit1 <- glm(Y ~ log(Volume) + log(Rate), family=binomial, data=vaso) coef(Vfit1) Vfit2 <- glmrob(Y ~ log(Volume) + log(Rate), family=binomial, data=vaso, method="Mqle", control = glmrobMqle.control(tcc=3.5)) coef(Vfit2) # c = 3.5 ==> not much different from classical ## Note the problems with tcc <= 3 %% FIXME algorithm ??? Vfit3 <- glmrob(Y ~ log(Volume) + log(Rate), family=binomial, data=vaso, method= "BY") coef(Vfit3)## note that results differ much. ## That's not unreasonable however, see Kuensch et al.(1989), p.465 ## Poisson response -------------- data(epilepsy) Efit1 <- glm(Ysum ~ Age10 + Base4*Trt, family=poisson, data=epilepsy) summary(Efit1) Efit2 <- glmrob(Ysum ~ Age10 + Base4*Trt, family = poisson, data = epilepsy, method= "Mqle", control = glmrobMqle.control(tcc= 1.2)) summary(Efit2) ## 'x' weighting: (Efit3 <- glmrob(Ysum ~ Age10 + Base4*Trt, family = poisson, data = epilepsy, method= "Mqle", weights.on.x = "hat", control = glmrobMqle.control(tcc= 1.2))) try( # gives singular cov matrix: 'Trt' is binary factor --> # affine equivariance and subsampling are problematic Efit4 <- glmrob(Ysum ~ Age10 + Base4*Trt, family = poisson, data = epilepsy, method= "Mqle", weights.on.x = "covMcd", control = glmrobMqle.control(tcc=1.2, maxit=100)) ) ##--> See example(possumDiv) for another Poisson-regression ### -------- Gamma family -- data from example(glm) --- clotting <- data.frame( u = c(5,10,15,20,30,40,60,80,100), lot1 = c(118,58,42,35,27,25,21,19,18), lot2 = c(69,35,26,21,18,16,13,12,12)) summary(cl <- glm (lot1 ~ log(u), data=clotting, family=Gamma)) summary(ro <- glmrob(lot1 ~ log(u), data=clotting, family=Gamma)) clotM5.high <- within(clotting, { lot1[5] <- 60 }) op <- par(mfrow=2:1, mgp = c(1.6, 0.8, 0), mar = c(3,3:1)) plot( lot1 ~ log(u), data=clotM5.high) plot(1/lot1 ~ log(u), data=clotM5.high) par(op) ## Obviously, there the first observation is an outlier with respect to both ## representations! cl5.high <- glm (lot1 ~ log(u), data=clotM5.high, family=Gamma) ro5.high <- glmrob(lot1 ~ log(u), data=clotM5.high, family=Gamma) with(ro5.high, cbind(w.x, w.r))## the 5th obs. is downweighted heavily! plot(1/lot1 ~ log(u), data=clotM5.high) abline(cl5.high, lty=2, col="red") abline(ro5.high, lwd=2, col="blue") ## result is ok (but not "perfect")
These are auxiliary functions as user interface for glmrob
fitting
when the different methods, "Mqle"
, "BY"
, or
"MT"
are used. Typically only used when calling glmrob
.
glmrobMqle.control(acc = 1e-04, test.acc = "coef", maxit = 50, tcc = 1.345) glmrobBY.control (maxit = 1000, const = 0.5, maxhalf = 10) glmrobMT.control (cw = 2.1, nsubm = 500, acc = 1e-06, maxit = 200)
glmrobMqle.control(acc = 1e-04, test.acc = "coef", maxit = 50, tcc = 1.345) glmrobBY.control (maxit = 1000, const = 0.5, maxhalf = 10) glmrobMT.control (cw = 2.1, nsubm = 500, acc = 1e-06, maxit = 200)
acc |
positive convergence tolerance; the iterations converge when ??? |
test.acc |
Only "coef" is currently implemented |
maxit |
integer giving the maximum number of iterations. |
tcc |
tuning constant c for Huber's psi-function |
const |
for "BY", the normalizing constant .. |
maxhalf |
for "BY"; the number of halving steps when the gradient
itself no longer improves. We have seen examples when increasing
|
cw |
tuning constant c for Tukey's biweight psi-function |
nsubm |
the number of subsamples to take for finding an initial
estimate for |
A list
with the arguments as components.
Andreas Ruckstuhl and Martin Maechler
str(glmrobMqle.control()) str(glmrobBY.control()) str(glmrobMT.control())
str(glmrobMqle.control()) str(glmrobBY.control()) str(glmrobMT.control())
Compute h(alpha) which is the size of the subsamples to be used
for MCD and LTS. Given ,
and
,
is an integer,
, where the exact formula also depends on
.
For ,
h == floor(n+p+1)/2
; for the general
case, it's simply
n2 <- (n+p+1) %/% 2; floor(2*n2 - n + 2*(n-n2)*alpha)
.
h.alpha.n(alpha, n, p)
h.alpha.n(alpha, n, p)
alpha |
fraction, numeric (vector) in [0.5, 1], see, e.g.,
|
n |
integer (valued vector), the sample size. |
p |
integer (valued vector), the dimension. |
numeric vector of ; when any of the arguments of
length greater than one, the usual R arithmetic (recycling) rules are
used.
covMcd
and ltsReg
which are
defined by and hence both use
h.alpha.n
.
n <- c(10:20,50,100) p <- 5 ## show the simple "alpha = 1/2" case: cbind(n=n, h= h.alpha.n(1/2, n, p), n2p = floor((n+p+1)/2)) ## alpha = 3/4 is recommended by some authors : n <- c(15, 20, 25, 30, 50, 100) cbind(n=n, h= h.alpha.n(3/4, n, p = 6))
n <- c(10:20,50,100) p <- 5 ## show the simple "alpha = 1/2" case: cbind(n=n, h= h.alpha.n(1/2, n, p), n2p = floor((n+p+1)/2)) ## alpha = 3/4 is recommended by some authors : n <- c(15, 20, 25, 30, 50, 100) cbind(n=n, h= h.alpha.n(3/4, n, p = 6))
Artificial Data Set generated by Hawkins, Bradu, and Kass (1984). The data set consists of 75 observations in four dimensions (one response and three explanatory variables). It provides a good example of the masking effect. The first 14 observations are outliers, created in two groups: 1–10 and 11–14. Only observations 12, 13 and 14 appear as outliers when using classical methods, but can be easily unmasked using robust distances computed by, e.g., MCD - covMcd().
data(hbk, package="robustbase")
data(hbk, package="robustbase")
A data frame with 75 observations on 4 variables, where the last variable is the dependent one.
x[,1]
x[,2]
x[,3]
y
This data set is also available in package wle as
artificial
.
Hawkins, D.M., Bradu, D., and Kass, G.V. (1984) Location of several outliers in multiple regression data using elemental sets. Technometrics 26, 197–208.
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, p.94.
data(hbk) plot(hbk) summary(lm.hbk <- lm(Y ~ ., data = hbk)) hbk.x <- data.matrix(hbk[, 1:3]) (cHBK <- covMcd(hbk.x))
data(hbk) plot(hbk) summary(lm.hbk <- lm(Y ~ ., data = hbk)) hbk.x <- data.matrix(hbk[, 1:3]) (cHBK <- covMcd(hbk.x))
This data set was analyzed by Weisberg (1980) and Chambers et al. (1983). A catheter is passed into a major vein or artery at the femoral region and moved into the heart. The proper length of the introduced catheter has to be guessed by the physician. The aim of the data set is to describe the relation between the catheter length and the patient's height (X1) and weight (X2).
This data sets is used to demonstrate the effects caused by collinearity. The correlation between height and weight is so high that either variable almost completely determines the other.
data(heart)
data(heart)
A data frame with 12 observations on the following 3 variables.
height
Patient's height in inches
weight
Patient's weights in pounds
clength
Y: Catheter Length (in centimeters)
There are other heart
datasets in other R packages,
notably survival, hence considering using
package = "robustbase"
, see examples.
Weisberg (1980)
Chambers et al. (1983)
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, p.103, table 13.
data(heart, package="robustbase") heart.x <- data.matrix(heart[, 1:2]) # the X-variables plot(heart.x) covMcd(heart.x) summary( lm.heart <- lm(clength ~ . , data = heart)) summary(lts.heart <- ltsReg(clength ~ . , data = heart))
data(heart, package="robustbase") heart.x <- data.matrix(heart[, 1:2]) # the X-variables plot(heart.x) covMcd(heart.x) summary( lm.heart <- lm(clength ~ . , data = heart)) summary(lts.heart <- ltsReg(clength ~ . , data = heart))
Huberization (named after Peter Huber's M-estimation algorithm for
location originally) replaces outlying values in a sample x
by
their respective boundary: when it is replaced by
and when
it is replaced by
. Consequently,
values inside the interval
remain unchanged.
Here, where
is
the robust scale estimate
Qn(x)
if that is positive,
and by default, is the robust huber estimate of location
(with tuning constant
).
In the degenerate case where Qn(x) == 0
, trimmed means of
abs(x - M)
are tried as scale estimate , with decreasing
trimming proportions specified by the decreasing
trim
vector.
huberize(x, M = huberM(x, k = k)$mu, c = k, trim = (5:1)/16, k = 1.5, warn0 = getOption("verbose"), saveTrim = TRUE)
huberize(x, M = huberM(x, k = k)$mu, c = k, trim = (5:1)/16, k = 1.5, warn0 = getOption("verbose"), saveTrim = TRUE)
x |
numeric vector which is to be huberized. |
M |
a number; defaulting to |
c |
a positive number, the tuning constant for huberization of the
sample |
trim |
a decreasing vector of trimming proportions in
|
k |
used if |
warn0 |
|
saveTrim |
a |
In regular cases, s = Qn(x)
is positive and used to
huberize values of x
outside [M - c*s, M + c*s]
.
In degenerate cases where Qn(x) == 0
, we search for
an by trying the trimmed mean
s := mean(abs(x-M), trim =
trim[j])
with less and less trimming (as the trimming
proportions trim[]
must decrease).
If even the last, trim[length(trim)]
, leads to , a
warning is printed when
warn0
is true.
a numeric vector as x
; in case Qn(x)
was zero and
saveTrim
is true, also containing the (last) trim
proportion used (to compute the scale ) as attribute
"trim"
(see attr()
, attributes
).
For the use in mc()
and similar cases where mainly numerical
stabilization is necessary, a large c = 1e12
will lead to no
huberization, i.e., all y == x
for y <- huberize(x, c)
for typical non-degenerate samples.
Martin Maechler
huberM
and mc
which is now stabilized by
default via something like huberize(*, c=1e11)
.
## For non-degenerate data and large c, nothing is huberized, ## as there are *no* really extreme outliers : set.seed(101) x <- rnorm(1000) stopifnot(all.equal(x, huberize(x, c=100))) ## OTOH, the "extremes" are shrunken towards the boundaries for smaller c: xh <- huberize(x, c = 2) table(x != xh) ## 45 out of a 1000: table(xh[x != xh])# 26 on the left boundary -2.098 and 19 on the right = 2.081 ## vizualization: stripchart(x); text(0,1, "x {original}", pos=3); yh <- 0.9 stripchart(xh, at = yh, add=TRUE, col=2) text(0, yh, "huberize(x, c=2)", col=2, pos=1) arrows( x[x!=xh], 1, xh[x!=xh], yh, length=1/8, col=adjustcolor("pink", 1/2))
## For non-degenerate data and large c, nothing is huberized, ## as there are *no* really extreme outliers : set.seed(101) x <- rnorm(1000) stopifnot(all.equal(x, huberize(x, c=100))) ## OTOH, the "extremes" are shrunken towards the boundaries for smaller c: xh <- huberize(x, c = 2) table(x != xh) ## 45 out of a 1000: table(xh[x != xh])# 26 on the left boundary -2.098 and 19 on the right = 2.081 ## vizualization: stripchart(x); text(0,1, "x {original}", pos=3); yh <- 0.9 stripchart(xh, at = yh, add=TRUE, col=2) text(0, yh, "huberize(x, c=2)", col=2, pos=1) arrows( x[x!=xh], 1, xh[x!=xh], yh, length=1/8, col=adjustcolor("pink", 1/2))
(Generalized) Huber M-estimator of location with MAD scale, being
sensible also when the scale is zero where huber()
returns an error.
huberM(x, k = 1.5, weights = NULL, tol = 1e-06, mu = if(is.null(weights)) median(x) else wgt.himedian(x, weights), s = if(is.null(weights)) mad(x, center=mu) else wgt.himedian(abs(x - mu), weights), se = FALSE, warn0scale = getOption("verbose"))
huberM(x, k = 1.5, weights = NULL, tol = 1e-06, mu = if(is.null(weights)) median(x) else wgt.himedian(x, weights), s = if(is.null(weights)) mad(x, center=mu) else wgt.himedian(abs(x - mu), weights), se = FALSE, warn0scale = getOption("verbose"))
x |
numeric vector. |
k |
positive factor; the algorithm winsorizes at |
weights |
numeric vector of non-negative weights of same length
as |
tol |
convergence tolerance. |
mu |
initial location estimator. |
s |
scale estimator held constant through the iterations. |
se |
logical indicating if the standard error should be computed
and returned (as |
warn0scale |
logical; if true, and |
Note that currently, when non-NULL
weights
are
specified, the default for initial location mu
and scale
s
is wgt.himedian
, where strictly speaking a
weighted “non-hi” median should be used for consistency.
Since s
is not updated, the results slightly differ, see the
examples below.
When se = TRUE
, the standard error is computed using the
correction factor but no finite sample correction.
list of location and scale parameters, and number of iterations used.
mu |
location estimate |
s |
the |
it |
the number of “Huber iterations” used. |
Martin Maechler, building on the MASS code mentioned.
Huber, P. J. (1981) Robust Statistics. Wiley.
hubers
(and huber
) in package MASS;
mad
.
huberM(c(1:9, 1000)) mad (c(1:9, 1000)) mad (rep(9, 100)) huberM(rep(9, 100)) ## When you have "binned" aka replicated observations: set.seed(7) x <- c(round(rnorm(1000),1), round(rnorm(50, m=10, sd = 10))) t.x <- table(x) # -> unique values and multiplicities x.uniq <- as.numeric(names(t.x)) ## == sort(unique(x)) x.mult <- unname(t.x) str(Hx <- huberM(x.uniq, weights = x.mult), digits = 7) str(Hx. <- huberM(x, s = Hx$s, se=TRUE), digits = 7) ## should be ~= Hx stopifnot(all.equal(Hx[-4], Hx.[-4])) str(Hx2 <- huberM(x, se=TRUE), digits = 7)## somewhat different, since 's' differs ## Confirm correctness of std.error : system.time( SS <- replicate(10000, vapply(huberM(rnorm(400), se=TRUE), as.double, 1.)) ) # ~ 2.8 seconds (was 12.2 s) rbind(mean(SS["SE",]), sd(SS["mu",]))# both ~ 0.0508 stopifnot(all.equal(mean(SS["SE",]), sd ( SS["mu",]), tolerance= 0.002))
huberM(c(1:9, 1000)) mad (c(1:9, 1000)) mad (rep(9, 100)) huberM(rep(9, 100)) ## When you have "binned" aka replicated observations: set.seed(7) x <- c(round(rnorm(1000),1), round(rnorm(50, m=10, sd = 10))) t.x <- table(x) # -> unique values and multiplicities x.uniq <- as.numeric(names(t.x)) ## == sort(unique(x)) x.mult <- unname(t.x) str(Hx <- huberM(x.uniq, weights = x.mult), digits = 7) str(Hx. <- huberM(x, s = Hx$s, se=TRUE), digits = 7) ## should be ~= Hx stopifnot(all.equal(Hx[-4], Hx.[-4])) str(Hx2 <- huberM(x, se=TRUE), digits = 7)## somewhat different, since 's' differs ## Confirm correctness of std.error : system.time( SS <- replicate(10000, vapply(huberM(rnorm(400), se=TRUE), as.double, 1.)) ) # ~ 2.8 seconds (was 12.2 s) rbind(mean(SS["SE",]), sd(SS["mu",]))# both ~ 0.0508 stopifnot(all.equal(mean(SS["SE",]), sd ( SS["mu",]), tolerance= 0.002))
The original data set is the waterflow in January of the Kootenay river, measured at two locations, namely, Libby (Montana) and Newgate (British Columbia) for 13 consecutive years, 1931–1943.
The data set is of mostly interest because it has been used as example in innumerous didactical situations about robust regression. To this end, one number (in observation 4) has been modified from the original data from originally 44.9 to 15.7 (here).
data(kootenay, package="robustbase")
data(kootenay, package="robustbase")
A data frame with 13 observations on the following 2 variables.
Libby
a numeric vector
Newgate
a numeric vector
The original (unmodified) version of the data is easily obtainable
as kootenay0
from the examples; other modified versions of the
data sets are also used in different places, see the examples below.
Original Data, p.58f of Ezekiel and Fox (1959), Methods of Correlation and Regression Analysis. Wiley, N.Y.
Hampel, F., Ronchetti, E., Rousseeuw, P. and Stahel, W. (1986) Robust Statistics: The Approach Based on Influence Functions; Wiley, N.Y.
Rousseeuw, P. J. and Leroy, A. M. (1987) Robust Regression & Outlier Detection, Wiley, N. Y.
data(kootenay) plot(kootenay, main = "'kootenay' data") points(kootenay[4,], col = 2, cex =2, pch = 3) abline(lm (Newgate ~ Libby, data = kootenay), col = "pink") abline(lmrob(Newgate ~ Libby, data = kootenay), col = "blue") ## The original version of Ezekiel & Fox: kootenay0 <- kootenay kootenay0[4, "Newgate"] <- 44.9 plot(kootenay0, main = "'kootenay0': the original data") abline(lm (Newgate ~ Libby, data = kootenay0), col = "pink") abline(lmrob(Newgate ~ Libby, data = kootenay0), col = "blue") ## The version with "milder" outlier -- Hampel et al., p.310 kootenay2 <- kootenay0 kootenay2[4, "Libby"] <- 20.0 # instead of 77.6 plot(kootenay2, main = "The 'kootenay2' data", xlim = range(kootenay[,"Libby"])) points(kootenay2[4,], col = 2, cex =2, pch = 3) abline(lm (Newgate ~ Libby, data = kootenay2), col = "pink") abline(lmrob(Newgate ~ Libby, data = kootenay2), col = "blue")
data(kootenay) plot(kootenay, main = "'kootenay' data") points(kootenay[4,], col = 2, cex =2, pch = 3) abline(lm (Newgate ~ Libby, data = kootenay), col = "pink") abline(lmrob(Newgate ~ Libby, data = kootenay), col = "blue") ## The original version of Ezekiel & Fox: kootenay0 <- kootenay kootenay0[4, "Newgate"] <- 44.9 plot(kootenay0, main = "'kootenay0': the original data") abline(lm (Newgate ~ Libby, data = kootenay0), col = "pink") abline(lmrob(Newgate ~ Libby, data = kootenay0), col = "blue") ## The version with "milder" outlier -- Hampel et al., p.310 kootenay2 <- kootenay0 kootenay2[4, "Libby"] <- 20.0 # instead of 77.6 plot(kootenay2, main = "The 'kootenay2' data", xlim = range(kootenay[,"Libby"])) points(kootenay2[4,], col = 2, cex =2, pch = 3) abline(lm (Newgate ~ Libby, data = kootenay2), col = "pink") abline(lmrob(Newgate ~ Libby, data = kootenay2), col = "blue")
Data on the Calibration of an Instrument that Measures Lactic Acid Concentration in Blood, from Afifi and Azen (1979) - comparing the true concentration X with the measured value Y.
data(lactic, package="robustbase")
data(lactic, package="robustbase")
A data frame with 20 observations on the following 2 variables.
X
True Concentration
Y
Instrument
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, p.62, table 10.
data(lactic) summary(lm.lactic <- lm(Y ~., data=lactic))
data(lactic) summary(lm.lactic <- lm(Y ~., data=lactic))
Compute the left and right ‘medcouple’, robust estimators of tail weight, in some sense robust versions of the kurtosis, the very unrobust centralized 4th moment.
lmc(x, mx = median(x, na.rm=na.rm), na.rm = FALSE, doReflect = FALSE, ...) rmc(x, mx = median(x, na.rm=na.rm), na.rm = FALSE, doReflect = FALSE, ...)
lmc(x, mx = median(x, na.rm=na.rm), na.rm = FALSE, doReflect = FALSE, ...) rmc(x, mx = median(x, na.rm=na.rm), na.rm = FALSE, doReflect = FALSE, ...)
x |
a numeric vector |
mx |
number, the “center” of lmc(x, mx, *) := mc(x[x <= mx], *) rmc(x, mx, *) := mc(x[x >= mx], *) |
na.rm |
logical indicating how missing values ( |
doReflect |
logical indicating if |
... |
further arguments to |
each a number (unless ...
contains full.result = TRUE
).
Brys, G., Hubert, M. and Struyf, A. (2006). Robust measures of tail weight, Computational Statistics and Data Analysis 50(3), 733–759.
and those in ‘References’ of mc
.
mc(1:5) # 0 for a symmetric sample lmc(1:5) # 0 rmc(1:5) # 0 x1 <- c(1, 2, 7, 9, 10) mc(x1) # = -1/3 c( lmc( x1), lmc( x1, doReflect=TRUE))# 0 -1/3 c( rmc( x1), rmc( x1, doReflect=TRUE))# -1/3 -1/6 c(-rmc(-x1), -rmc(-x1, doReflect=TRUE)) # 2/3 1/3 data(cushny) lmc(cushny) # 0.2 rmc(cushny) # 0.45 isSym_LRmc <- function(x, tol = 1e-14) all.equal(lmc(-x, doReflect=TRUE), rmc( x, doReflect=TRUE), tolerance = tol) sym <- c(-20, -5, -2:2, 5, 20) stopifnot(exprs = { lmc(sym) == 0.5 rmc(sym) == 0.5 isSym_LRmc(cushny) isSym_LRmc(x1) }) ## Susceptibility to large outliers: ## "Sensitivity Curve" := empirical influence function dX10 <- function(X) c(1:5,7,10,15,25, X) # generate skewed size-10 with 'X' x <- c(26:40, 45, 50, 60, 75, 100) (lmc10N <- vapply(x, function(X) lmc(dX10(X)), 1)) (rmc10N <- vapply(x, function(X) rmc(dX10(X)), 1)) cols <- adjustcolor(2:3, 3/4) plot(x, lmc10N, type="o", cex=1/2, main = "lmc & rmc( c(1:5,7,10,15,25, X) )", xlab=quote(X), log="x", col=cols[1]) lines(x, rmc10N, col=cols[2], lwd=3) legend("top", paste0(c("lmc", "rmc"), "(X)"), col=cols, lty=1, lwd=c(1,3), pch = c(1, NA), bty="n") n <- length(x) stopifnot(exprs = { all.equal(current = lmc10N, target = rep(0, n)) all.equal(current = rmc10N, target = c(3/19, 1/5, 5/21, 3/11, 7/23, rep(1/3, n-5))) ## and it stays stable with outlier X --> oo : lmc(dX10(1e300)) == 0 rmc(dX10(1e300)) == rmc10N[6] })
mc(1:5) # 0 for a symmetric sample lmc(1:5) # 0 rmc(1:5) # 0 x1 <- c(1, 2, 7, 9, 10) mc(x1) # = -1/3 c( lmc( x1), lmc( x1, doReflect=TRUE))# 0 -1/3 c( rmc( x1), rmc( x1, doReflect=TRUE))# -1/3 -1/6 c(-rmc(-x1), -rmc(-x1, doReflect=TRUE)) # 2/3 1/3 data(cushny) lmc(cushny) # 0.2 rmc(cushny) # 0.45 isSym_LRmc <- function(x, tol = 1e-14) all.equal(lmc(-x, doReflect=TRUE), rmc( x, doReflect=TRUE), tolerance = tol) sym <- c(-20, -5, -2:2, 5, 20) stopifnot(exprs = { lmc(sym) == 0.5 rmc(sym) == 0.5 isSym_LRmc(cushny) isSym_LRmc(x1) }) ## Susceptibility to large outliers: ## "Sensitivity Curve" := empirical influence function dX10 <- function(X) c(1:5,7,10,15,25, X) # generate skewed size-10 with 'X' x <- c(26:40, 45, 50, 60, 75, 100) (lmc10N <- vapply(x, function(X) lmc(dX10(X)), 1)) (rmc10N <- vapply(x, function(X) rmc(dX10(X)), 1)) cols <- adjustcolor(2:3, 3/4) plot(x, lmc10N, type="o", cex=1/2, main = "lmc & rmc( c(1:5,7,10,15,25, X) )", xlab=quote(X), log="x", col=cols[1]) lines(x, rmc10N, col=cols[2], lwd=3) legend("top", paste0(c("lmc", "rmc"), "(X)"), col=cols, lty=1, lwd=c(1,3), pch = c(1, NA), bty="n") n <- length(x) stopifnot(exprs = { all.equal(current = lmc10N, target = rep(0, n)) all.equal(current = rmc10N, target = c(3/19, 1/5, 5/21, 3/11, 7/23, rep(1/3, n-5))) ## and it stays stable with outlier X --> oo : lmc(dX10(1e300)) == 0 rmc(dX10(1e300)) == rmc10N[6] })
Computes fast MM-type estimators for linear (regression) models.
lmrob(formula, data, subset, weights, na.action, method = "MM", model = TRUE, x = !control$compute.rd, y = FALSE, singular.ok = TRUE, contrasts = NULL, offset = NULL, control = NULL, init = NULL, ...)
lmrob(formula, data, subset, weights, na.action, method = "MM", model = TRUE, x = !control$compute.rd, y = FALSE, singular.ok = TRUE, contrasts = NULL, offset = NULL, control = NULL, init = NULL, ...)
formula |
a symbolic description of the model to be fit. See
|
data |
an optional data frame, list or environment (or object
coercible by |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
weights |
an optional vector of weights to be used in the fitting process (in addition to the robustness weights computed in the fitting process). |
na.action |
a function which indicates what should happen
when the data contain |
method |
string specifying the estimator-chain. |
model , x , y
|
logicals. If |
singular.ok |
logical. If |
contrasts |
an optional list. See the |
offset |
this can be used to specify an a priori
known component to be included in the linear predictor
during fitting. An |
control |
a |
init |
an optional argument to specify or supply the initial estimate. See Details. |
... |
additional arguments can be used to specify control
parameters directly instead of (but not in addition to!) via |
This function computes an MM-type regression estimator as
described in Yohai (1987) and Koller and Stahel (2011). By
default it uses a bi-square redescending score function, and it
returns a highly robust and highly efficient estimator (with 50%
breakdown point and 95% asymptotic efficiency for normal errors).
The computation is carried out by a call to lmrob.fit()
.
The argument setting
of lmrob.control
is provided
to set alternative defaults as suggested in Koller and Stahel (2011)
(setting="KS2011"
; now do use its extension
setting="KS2014"
). For further details, see lmrob.control
.
init
:The initial estimator may be specified using the argument
init
. This can either be
a string used to specify built in internal
estimators (currently "S"
and "M-S"
, see See also
below);
a function
taking arguments x, y,
control, mf
(where mf
stands for model.frame
) and
returning a list
containing at least the initial coefficients as
component "coefficients"
and the initial scale estimate as
"scale"
.
Or a list
giving the initial coefficients and
scale as components "coefficients"
and "scale"
. See also
Examples.
Note that when init
is a function or list, the
method
argument must not contain the initial estimator, e.g.,
use MDM
instead of SMDM
.
The default, equivalent to init = "S"
, uses as initial
estimator an S-estimator (Rousseeuw and Yohai, 1984) which is
computed using the Fast-S algorithm of Salibian-Barrera and Yohai
(2006), calling lmrob.S()
. That function, since
March 2012, by default uses nonsingular subsampling which
makes the Fast-S algorithm feasible for categorical data as well,
see Koller (2012). Note that convergence problems may still show
up as warnings, e.g.,
S refinements did not converge (to refine.tol=1e-07) in 200 (= k.max) steps
and often can simply be remedied by increasing (i.e. weakening)
refine.tol
or increasing the allowed number of iterations
k.max
, see lmrob.control
.
method
:The following chain of estimates is customizable via the
method
argument.
There are currently two types of estimates available,
"M"
:corresponds to the standard M-regression estimate.
"D"
:stands for the Design Adaptive Scale estimate as proposed in Koller and Stahel (2011).
The method
argument takes a string that specifies the
estimates to be calculated as a chain. Setting
method='SMDM'
will result in an intial S-estimate, followed
by an M-estimate, a Design Adaptive Scale estimate and a final
M-step. For methods involving a D
-step, the default value
of psi
(see lmrob.control
) is changed to
"lqq"
.
By default, standard errors are computed using the formulas of
Croux, Dhaene and Hoorelbeke (2003) (lmrob.control
option cov=".vcov.avar1"
). This method, however, works only
for MM-estimates (i.e., method = "MM"
or = "SM"
). For other
method
arguments, the covariance matrix estimate used is based on the asymptotic
normality of the estimated coefficients (cov=".vcov.w"
) as
described in Koller and Stahel (2011).
The var-cov computation can be skipped by cov = "none"
and
(re)done later by e.g., vcov(<obj>, cov = ".vcov.w")
.
As of robustbase version 0.91-0 (April 2014), the computation of
robust standard errors for method="SMDM"
has been changed.
The old behaviour can be restored by setting the control parameter
cov.corrfact = "tauold"
.
An object of class lmrob
; a list including the following
components:
coefficients |
The estimate of the coefficient vector |
scale |
The scale as used in the M estimator. |
residuals |
Residuals associated with the estimator. |
converged |
|
iter |
number of IRWLS iterations |
rweights |
the “robustness weights” |
fitted.values |
Fitted values associated with the estimator. |
init.S |
The |
init |
A similar list that contains the results of intermediate estimates (not for MM-estimates). |
rank |
the numeric rank of the fitted linear model. |
cov |
The estimated covariance matrix of the regression coefficients |
df.residual |
the residual degrees of freedom. |
weights |
the specified weights (missing if none were used). |
na.action |
(where relevant) information returned by
|
offset |
the offset used (missing if none were used). |
contrasts |
(only where relevant) the contrasts used. |
xlevels |
(only where relevant) a record of the levels of the factors used in fitting. |
call |
the matched call. |
terms |
the |
model |
if requested (the default), the model frame used. |
x |
if requested, the model matrix used. |
y |
if requested, the response used. |
In addition, non-null fits will have components assign
,
and qr
relating to the linear fit, for use by extractor
functions such as summary
.
(mainly:) Matias Salibian-Barrera and Manuel Koller
Croux, C., Dhaene, G. and Hoorelbeke, D. (2003) Robust standard errors for robust estimators, Discussion Papers Series 03.16, K.U. Leuven, CES.
Koller, M. (2012)
Nonsingular subsampling for S-estimators with categorical predictors,
ArXiv e-prints https://arxiv.org/abs/1208.5595;
extended version published as Koller and Stahel (2017), see
lmrob.control
.
Koller, M. and Stahel, W.A. (2011) Sharpening Wald-type inference in robust regression for small samples. Computational Statistics & Data Analysis 55(8), 2504–2515.
Maronna, R. A., and Yohai, V. J. (2000) Robust regression with both continuous and categorical predictors. Journal of Statistical Planning and Inference 89, 197–214.
Rousseeuw, P.J. and Yohai, V.J. (1984) Robust regression by means of S-estimators, In Robust and Nonlinear Time Series, J. Franke, W. Härdle and R. D. Martin (eds.). Lectures Notes in Statistics 26, 256–272, Springer Verlag, New York.
Salibian-Barrera, M. and Yohai, V.J. (2006) A fast algorithm for S-regression estimates, Journal of Computational and Graphical Statistics 15(2), 414–427. doi:10.1198/106186006X113629
Yohai, V.J. (1987) High breakdown-point and high efficiency estimates for regression. The Annals of Statistics 15, 642–65.
Yohai, V., Stahel, W.~A. and Zamar, R. (1991) A procedure for robust estimation and inference in linear regression; in Stahel and Weisberg (eds), Directions in Robust Statistics and Diagnostics, Part II, Springer, New York, 365–374; doi:10.1007/978-1-4612-4444-8_20.
lmrob.control
;
for the algorithms lmrob.S
, lmrob.M.S
and
lmrob.fit
;
and for methods,
summary.lmrob
, for the extra “statistics”,
notably (“R squared”);
predict.lmrob
,
print.lmrob
, plot.lmrob
, and
weights.lmrob
.
data(coleman) set.seed(0) ## Default for a very long time: summary( m1 <- lmrob(Y ~ ., data=coleman) ) ## Nowadays **strongly recommended** for routine use: summary(m2 <- lmrob(Y ~ ., data=coleman, setting = "KS2014") ) ## ------------------ plot(residuals(m2) ~ weights(m2, type="robustness")) ##-> weights.lmrob() abline(h=0, lty=3) data(starsCYG, package = "robustbase") ## Plot simple data and fitted lines plot(starsCYG) lmST <- lm(log.light ~ log.Te, data = starsCYG) (RlmST <- lmrob(log.light ~ log.Te, data = starsCYG)) abline(lmST, col = "red") abline(RlmST, col = "blue") ## --> Least Sq.:/ negative slope \ robust: slope ~= 2.2 % checked in ../tests/lmrob-data.R summary(RlmST) # -> 4 outliers; rest perfect vcov(RlmST) stopifnot(all.equal(fitted(RlmST), predict(RlmST, newdata = starsCYG), tol = 1e-14)) ## FIXME: setting = "KS2011" or setting = "KS2014" **FAIL** here ##--- 'init' argument ----------------------------------- ## 1) string set.seed(0) m3 <- lmrob(Y ~ ., data=coleman, init = "S") stopifnot(all.equal(m1[-18], m3[-18])) ## 2) function initFun <- function(x, y, control, ...) { # no 'mf' needed init.S <- lmrob.S(x, y, control) list(coefficients=init.S$coef, scale = init.S$scale) } set.seed(0) m4 <- lmrob(Y ~ ., data=coleman, method = "M", init = initFun) ## list m5 <- lmrob(Y ~ ., data=coleman, method = "M", init = list(coefficients = m3$init$coef, scale = m3$scale)) stopifnot(all.equal(m4[-17], m5[-17]))
data(coleman) set.seed(0) ## Default for a very long time: summary( m1 <- lmrob(Y ~ ., data=coleman) ) ## Nowadays **strongly recommended** for routine use: summary(m2 <- lmrob(Y ~ ., data=coleman, setting = "KS2014") ) ## ------------------ plot(residuals(m2) ~ weights(m2, type="robustness")) ##-> weights.lmrob() abline(h=0, lty=3) data(starsCYG, package = "robustbase") ## Plot simple data and fitted lines plot(starsCYG) lmST <- lm(log.light ~ log.Te, data = starsCYG) (RlmST <- lmrob(log.light ~ log.Te, data = starsCYG)) abline(lmST, col = "red") abline(RlmST, col = "blue") ## --> Least Sq.:/ negative slope \ robust: slope ~= 2.2 % checked in ../tests/lmrob-data.R summary(RlmST) # -> 4 outliers; rest perfect vcov(RlmST) stopifnot(all.equal(fitted(RlmST), predict(RlmST, newdata = starsCYG), tol = 1e-14)) ## FIXME: setting = "KS2011" or setting = "KS2014" **FAIL** here ##--- 'init' argument ----------------------------------- ## 1) string set.seed(0) m3 <- lmrob(Y ~ ., data=coleman, init = "S") stopifnot(all.equal(m1[-18], m3[-18])) ## 2) function initFun <- function(x, y, control, ...) { # no 'mf' needed init.S <- lmrob.S(x, y, control) list(coefficients=init.S$coef, scale = init.S$scale) } set.seed(0) m4 <- lmrob(Y ~ ., data=coleman, method = "M", init = initFun) ## list m5 <- lmrob(Y ~ ., data=coleman, method = "M", init = list(coefficients = m3$init$coef, scale = m3$scale)) stopifnot(all.equal(m4[-17], m5[-17]))
This function calculates a Design Adaptive Scale estimate
for a given MM-estimate. This is supposed to be a part of a chain of
estimates like SMD
or SMDM
.
lmrob..D..fit(obj, x=obj$x, control = obj$control, mf, method = obj$control$method)
lmrob..D..fit(obj, x=obj$x, control = obj$control, mf, method = obj$control$method)
obj |
|
x |
the design matrix; if |
control |
list of control parameters, as returned
by |
mf |
defunct. |
method |
optional; the |
This function is used by lmrob.fit
and typically not to
be used on its own. Note that lmrob.fit()
specifies
control
potentially differently than the default, but does use
the default for method
.
The given lmrob
-object with the following elements updated:
scale |
The Design Adaptive Scale estimate |
converged |
|
Manuel Koller
Koller, M. and Stahel, W.A. (2011), Sharpening Wald-type inference in robust regression for small samples, Computational Statistics & Data Analysis 55(8), 2504–2515.
data(stackloss) ## Compute manual SMD-estimate: ## 1) MM-estimate m1 <- lmrob(stack.loss ~ ., data = stackloss) ## 2) Add Design Adaptive Scale estimate m2 <- lmrob..D..fit(m1) print(c(m1$scale, m2$scale)) summary(m1) summary(m2) ## the covariance matrix estimate is also updated
data(stackloss) ## Compute manual SMD-estimate: ## 1) MM-estimate m1 <- lmrob(stack.loss ~ ., data = stackloss) ## 2) Add Design Adaptive Scale estimate m2 <- lmrob..D..fit(m1) print(c(m1$scale, m2$scale)) summary(m1) summary(m2) ## the covariance matrix estimate is also updated
This function performs RWLS iterations to find an
M-estimator of regression. When started from an S-estimated
beta.initial
, this results in an MM-estimator.
lmrob..M..fit(x = obj$x, y = obj$y, beta.initial = obj$coef, scale = obj$scale, control = obj$control, obj, mf, method = obj$control$method)
lmrob..M..fit(x = obj$x, y = obj$y, beta.initial = obj$coef, scale = obj$scale, control = obj$control, obj, mf, method = obj$control$method)
x |
design matrix ( |
y |
numeric response vector (of length |
beta.initial |
numeric vector (of length |
scale |
robust residual scale estimate. Usually an S-scale estimator. |
control |
list of control parameters, as returned
by |
obj |
an optional |
mf |
defunct. |
method |
optional; the |
This function is used by lmrob.fit
(and
anova(<lmrob>, type = "Deviance")
) and typically not to be used
on its own.
A list with the following elements:
coef |
the M-estimator (or MM-estim.) of regression |
control |
the |
scale |
The residual scale estimate |
seed |
The random number generator seed |
converged |
|
Matias Salibian-Barrera and Martin Maechler
Yohai, 1987
lmrob.fit
, lmrob
;
rlm
from package MASS.
data(stackloss) X <- model.matrix(stack.loss ~ . , data = stackloss) y <- stack.loss ## Compute manual MM-estimate: ## 1) initial LTS: m0 <- ltsReg(X[,-1], y) ## 2) M-estimate started from LTS: m1 <- lmrob..M..fit(X, y, beta.initial = coef(m0), scale = m0$scale, method = "SM", control = lmrob.control(tuning.psi = 1.6, psi = 'bisquare')) ## no 'method' (nor 'obj'): m1. <- lmrob..M..fit(X, y, beta.initial = coef(m0), scale = m0$scale, control = m1$control) stopifnot(all.equal(m1, m1., tol = 1e-15)) # identical {call *not* stored!} cbind(m0$coef, m1$coef) ## the scale is kept fixed: stopifnot(identical(unname(m0$scale), m1$scale)) ## robustness weights: are r.s <- with(m1, residuals/scale) # scaled residuals m1.wts <- Mpsi(r.s, cc = 1.6, psi="tukey") / r.s summarizeRobWeights(m1.wts) ##--> outliers 1,3,4,13,21 which(m0$lts.wt == 0) # 1,3,4,21 but not 13 ## Manually add M-step to SMD-estimate (=> equivalent to "SMDM"): m2 <- lmrob(stack.loss ~ ., data = stackloss, method = 'SMD') m3 <- lmrob..M..fit(obj = m2) ## Simple function that allows custom initial estimates ## (Deprecated; use init argument to lmrob() instead.) %% MM: why deprecated? lmrob.custom <- function(x, y, beta.initial, scale, terms) { ## initialize object obj <- list(control = lmrob.control("KS2011"), terms = terms) ## terms is needed for summary() ## M-step obj <- lmrob..M..fit(x, y, beta.initial, scale, obj = obj) ## D-step obj <- lmrob..D..fit(obj, x) ## Add some missing elements obj$cov <- TRUE ## enables calculation of cov matrix obj$p <- obj$qr$rank obj$degree.freedom <- length(y) - obj$p ## M-step obj <- lmrob..M..fit(x, y, obj=obj) obj$control$method <- ".MDM" obj } m4 <- lmrob.custom(X, y, m2$init$init.S$coef, m2$init$scale, m2$terms) stopifnot(all.equal(m4$coef, m3$coef)) ## Start from ltsReg: m5 <- ltsReg(stack.loss ~ ., data = stackloss) m6 <- lmrob.custom(m5$X, m5$Y, coef(m5), m5$scale, m5$terms)
data(stackloss) X <- model.matrix(stack.loss ~ . , data = stackloss) y <- stack.loss ## Compute manual MM-estimate: ## 1) initial LTS: m0 <- ltsReg(X[,-1], y) ## 2) M-estimate started from LTS: m1 <- lmrob..M..fit(X, y, beta.initial = coef(m0), scale = m0$scale, method = "SM", control = lmrob.control(tuning.psi = 1.6, psi = 'bisquare')) ## no 'method' (nor 'obj'): m1. <- lmrob..M..fit(X, y, beta.initial = coef(m0), scale = m0$scale, control = m1$control) stopifnot(all.equal(m1, m1., tol = 1e-15)) # identical {call *not* stored!} cbind(m0$coef, m1$coef) ## the scale is kept fixed: stopifnot(identical(unname(m0$scale), m1$scale)) ## robustness weights: are r.s <- with(m1, residuals/scale) # scaled residuals m1.wts <- Mpsi(r.s, cc = 1.6, psi="tukey") / r.s summarizeRobWeights(m1.wts) ##--> outliers 1,3,4,13,21 which(m0$lts.wt == 0) # 1,3,4,21 but not 13 ## Manually add M-step to SMD-estimate (=> equivalent to "SMDM"): m2 <- lmrob(stack.loss ~ ., data = stackloss, method = 'SMD') m3 <- lmrob..M..fit(obj = m2) ## Simple function that allows custom initial estimates ## (Deprecated; use init argument to lmrob() instead.) %% MM: why deprecated? lmrob.custom <- function(x, y, beta.initial, scale, terms) { ## initialize object obj <- list(control = lmrob.control("KS2011"), terms = terms) ## terms is needed for summary() ## M-step obj <- lmrob..M..fit(x, y, beta.initial, scale, obj = obj) ## D-step obj <- lmrob..D..fit(obj, x) ## Add some missing elements obj$cov <- TRUE ## enables calculation of cov matrix obj$p <- obj$qr$rank obj$degree.freedom <- length(y) - obj$p ## M-step obj <- lmrob..M..fit(x, y, obj=obj) obj$control$method <- ".MDM" obj } m4 <- lmrob.custom(X, y, m2$init$init.S$coef, m2$init$scale, m2$terms) stopifnot(all.equal(m4$coef, m3$coef)) ## Start from ltsReg: m5 <- ltsReg(stack.loss ~ ., data = stackloss) m6 <- lmrob.custom(m5$X, m5$Y, coef(m5), m5$scale, m5$terms)
Tuning parameters for lmrob
, the MM-type regression
estimator and the associated S-, M- and D-estimators. Using
setting="KS2011"
sets the defaults as suggested by
Koller and Stahel (2011) and analogously for "KS2014"
.
The .M*.default
function
s and
.M*.defaults
list
s contain default tuning
parameters for all the predefined functions, see also
Mpsi
, etc.
lmrob.control(setting, seed = NULL, nResample = 500, tuning.chi = NULL, bb = 0.5, tuning.psi = NULL, max.it = 50, groups = 5, n.group = 400, k.fast.s = 1, best.r.s = 2, k.max = 200, maxit.scale = 200, k.m_s = 20, refine.tol = 1e-7, rel.tol = 1e-7, scale.tol = 1e-10, solve.tol = 1e-7, zero.tol = 1e-10, trace.lev = 0, mts = 1000, subsampling = c("nonsingular", "simple"), compute.rd = FALSE, method = "MM", psi = "bisquare", numpoints = 10, cov = NULL, split.type = c("f", "fi", "fii"), fast.s.large.n = 2000, # only for outlierStats() : eps.outlier = function(nobs) 0.1 / nobs, eps.x = function(maxx) .Machine$double.eps^(.75)*maxx, compute.outlier.stats = method, warn.limit.reject = 0.5, warn.limit.meanrw = 0.5, ...) ## S3 method for class 'lmrobCtrl' update(object, ...) .Mchi.tuning.defaults .Mchi.tuning.default(psi) .Mpsi.tuning.defaults .Mpsi.tuning.default(psi)
lmrob.control(setting, seed = NULL, nResample = 500, tuning.chi = NULL, bb = 0.5, tuning.psi = NULL, max.it = 50, groups = 5, n.group = 400, k.fast.s = 1, best.r.s = 2, k.max = 200, maxit.scale = 200, k.m_s = 20, refine.tol = 1e-7, rel.tol = 1e-7, scale.tol = 1e-10, solve.tol = 1e-7, zero.tol = 1e-10, trace.lev = 0, mts = 1000, subsampling = c("nonsingular", "simple"), compute.rd = FALSE, method = "MM", psi = "bisquare", numpoints = 10, cov = NULL, split.type = c("f", "fi", "fii"), fast.s.large.n = 2000, # only for outlierStats() : eps.outlier = function(nobs) 0.1 / nobs, eps.x = function(maxx) .Machine$double.eps^(.75)*maxx, compute.outlier.stats = method, warn.limit.reject = 0.5, warn.limit.meanrw = 0.5, ...) ## S3 method for class 'lmrobCtrl' update(object, ...) .Mchi.tuning.defaults .Mchi.tuning.default(psi) .Mpsi.tuning.defaults .Mpsi.tuning.default(psi)
setting |
a string specifying alternative default values. Leave
empty for the defaults or use |
seed |
|
nResample |
number of re-sampling candidates to be used to find the initial S-estimator. Currently defaults to 500 which works well in most situations (see references). |
tuning.chi |
tuning constant vector for the S-estimator. If
|
bb |
expected value under the normal model of the
“chi” (rather |
tuning.psi |
tuning constant vector for the redescending
M-estimator. If |
max.it |
integer specifying the maximum number of IRWLS iterations. |
groups |
(for the fast-S algorithm): Number of random subsets to use when the data set is large. |
n.group |
(for the fast-S algorithm): Size of each of the
|
k.fast.s |
(for the fast-S algorithm): Number of local improvement steps (“I-steps”) for each re-sampling candidate. |
k.m_s |
(for the M-S algorithm): specifies after how many unsuccessful refinement steps the algorithm stops. |
best.r.s |
(for the fast-S algorithm): Number of
of best candidates to be iterated further (i.e.,
“refined”); is denoted |
k.max |
(for the fast-S algorithm): maximal number of refinement steps for the “fully” iterated best candidates. |
maxit.scale |
integer specifying the maximum number of C level
|
refine.tol |
(for the fast-S algorithm): relative convergence tolerance for the fully iterated best candidates. |
rel.tol |
(for the RWLS iterations of the MM algorithm): relative convergence tolerance for the parameter vector. |
scale.tol |
(for the scale estimation iterations of the S algorithm): relative
convergence tolerance for the |
solve.tol |
(for the S algorithm): relative
tolerance for inversion. Hence, this corresponds to
|
zero.tol |
for checking 0-residuals in the S algorithm, non-negative number
|
trace.lev |
integer indicating if the progress of the MM-algorithm
and the fast-S algorithms, see |
mts |
maximum number of samples to try in subsampling algorithm. |
subsampling |
type of subsampling to be used, a string:
|
compute.rd |
logical indicating if robust distances (based on
the MCD robust covariance estimator |
method |
string specifying the estimator-chain. |
psi |
string specifying the type |
numpoints |
number of points used in Gauss quadrature. |
cov |
function or string with function name to be used to
calculate covariance matrix estimate. The default is
|
split.type |
determines how categorical and continuous variables
are split. See |
fast.s.large.n |
minimum number of observations required to switch from ordinary “fast S” algorithm to an efficient “large n” strategy. |
eps.outlier |
limit on the robustness weight below which an observation
is considered to be an outlier.
Either a |
eps.x |
limit on the absolute value of the elements of the design matrix
below which an element is considered zero.
Either a |
compute.outlier.stats |
vector of |
warn.limit.reject |
limit of ratio
|
warn.limit.meanrw |
limit of the mean robustness per factor level
below which ( |
object |
an |
... |
for
|
The option setting="KS2011"
alters the default
arguments. They are changed to method = "SMDM"
, psi = "lqq"
,
max.it = 500
, k.max = 2000
, cov = ".vcov.w"
.
The defaults of all the remaining arguments are not changed.
The option setting="KS2014"
builds upon setting="KS2011"
.
More arguments are changed to best.r.s = 20, k.fast.s = 2,
nResample = 1000
. This setting should produce more stable estimates
for designs with factor
s.
By default, and in .Mpsi.tuning.default()
and .Mchi.tuning.default()
,
tuning.chi
and tuning.psi
are set to yield an
MM-estimate with breakdown point and efficiency of 95% at
the normal.
If numeric tuning.chi
or tuning.psi
are specified, say
cc
, for psi = "ggw"
or "lqq"
,
.psi.const(cc, psi)
is used, see its help page.
To get the defaults, e.g., .Mpsi.tuning.default(psi)
is
equivalent to but more efficient than the formerly widely used
lmrob.control(psi = psi)$tuning.psi
.
These defaults are:
psi |
tuning.chi |
tuning.psi |
bisquare
|
1.54764 |
4.685061 |
welsh |
0.5773502 |
2.11 |
ggw |
c(-0.5, 1.5, NA, 0.5) |
c(-0.5, 1.5, 0.95, NA) |
lqq |
c(-0.5, 1.5, NA, 0.5) |
c(-0.5, 1.5, 0.95, NA) |
optimal |
0.4047 |
1.060158 |
hampel |
c(1.5, 3.5, 8)*0.2119163 |
c(1.5, 3.5, 8)*0.9014
|
The values for the tuning constant for the ggw
and lqq
psi functions are specified differently here by a vector with four
elements: minimal slope, b (controlling the bend at the maximum of the curve),
efficiency, breakdown point.
Use NA
for an unspecified value of either efficiency or
breakdown point, see examples in the tables (above and below).
For these table examples, the respective “inner constants” are
stored precomputed, see .psi.lqq.findc
for more.
The constants for the "hampel"
psi function are chosen to have a
redescending slope of . Constants for a slope of
would be
psi |
tuning.chi |
tuning.psi |
"hampel"
|
c(2, 4, 8) * 0.1981319 |
c(2, 4, 8) * 0.690794
|
Alternative coefficients for an efficiency of 85% at the normal are given in the table below.
psi |
tuning.psi |
bisquare |
3.443689 |
welsh |
1.456 |
ggw , lqq
|
c(-0.5, 1.5, 0.85, NA) |
optimal |
0.8684 |
hampel (-1/3) |
c(1.5, 3.5, 8)* 0.5704545 |
hampel (-1/2) |
c( 2, 4, 8) * 0.4769578
|
.Mchi.tuning.default(psi)
and .Mpsi.tuning.default(psi)
return a short numeric
vector of tuning constants which
are defaults for the corresponding psi-function, see the Details.
They are based on the named list
s
.Mchi.tuning.defaults
and .Mpsi.tuning.defaults
,
respectively.
lmrob.control()
returns a named list
with over
twenty components, corresponding to the arguments, where
tuning.psi
and tuning.chi
are typically computed, as
.Mpsi.tuning.default(psi)
or .Mchi.tuning.default(psi)
,
respectively.
It is of class
"lmrobCtrl"
and we provide
print()
, update()
and within
methods.
update(<lmrobCtrl>, ....)
does not allow a
setting="<...>"
in ....
.
Matias Salibian-Barrera, Martin Maechler and Manuel Koller
Koller, M. and Stahel, W.A. (2011) Sharpening Wald-type inference in robust regression for small samples. Computational Statistics & Data Analysis 55(8), 2504–2515.
Koller, M. and Stahel, W.A. (2017)
Nonsingular subsampling for regression S estimators with categorical predictors,
Computational Statistics 32(2): 631–646.
doi:10.1007/s00180-016-0679-x.
Referred as "KS2014"
everywhere in robustbase; A shorter first
version, Koller (2012) has been available from https://arxiv.org/abs/1208.5595.
Mpsi
, etc, for the (fast!) psi function computations;
lmrob
, also for references and examples.
## Show the default settings: str(lmrob.control()) ## Artificial data for a simple "robust t test": set.seed(17) y <- y0 <- rnorm(200) y[sample(200,20)] <- 100*rnorm(20) gr <- as.factor(rbinom(200, 1, prob = 1/8)) lmrob(y0 ~ 0+gr) ## Use Koller & Stahel(2011)'s recommendation but a larger 'max.it': str(ctrl <- lmrob.control("KS2011", max.it = 1000)) str(.Mpsi.tuning.defaults) stopifnot(identical(.Mpsi.tuning.defaults, sapply(names(.Mpsi.tuning.defaults), .Mpsi.tuning.default))) ## Containing (names!) all our (pre-defined) redescenders: str(.Mchi.tuning.defaults) ## Difference between settings: Cdef <- lmrob.control() C11 <- lmrob.control("KS2011") C14 <- lmrob.control("KS2014") str(C14) ## Differences: diffD <- names(which(!mapply(identical, Cdef,C11, ignore.environment=TRUE))) diffC <- names(which(!mapply(identical, C11, C14, ignore.environment=TRUE))) ## KS2011 vs KS2014: Apart from `setting` itself, they only differ in three places: cbind(KS11 = unlist(C11[diffC[-1]]), KS14 = unlist(C14[diffC[-1]])) ## KS11 KS14 ## nResample 500 1000 ## best.r.s 2 20 ## k.fast.s 1 2 ## default vs KS2011: a bit more: setting + 8 str2simpLang <- function(x) { r <- if(is.null(x)) quote((NULL)) else str2lang(deparse(x)) if(is.call(r)) format(r) else r } cbind(deflt= lapply(Cdef[diffD], str2simpLang), KS11 = lapply(C11 [diffD], str2simpLang)) ## update()ing a lmrob.control() , e.g., C14mod <- update(C14, trace.lev = 2) # the same as C14m.d <- C14; C14m.d$trace.lev <- 2 stopifnot(identical(C14mod, C14m.d)) ## changing psi --> updates tuning.{psi,chi}: C14mp <- update(C14, psi = "hampel", seed=101) ## updating 'method' is "smart" : C.SMDM <- update(Cdef, method="SMDM") all.equal(Cdef, C.SMDM) # changed also psi, tuning.{psi,chi} and cov ! chgd <- c("method", "psi", "tuning.chi", "tuning.psi", "cov") str(Cdef [chgd]) str(C.SMDM[chgd]) C14m <- update(C14, method="SMM") (ae <- all.equal(C14, C14mp))# changed tuning.psi & tuning.chi, too stopifnot(exprs = { identical(C14, update(C14, method="SMDM")) # no change! identical(c("psi", "seed", "tuning.chi", "tuning.psi"), sort(gsub("[^.[:alpha:]]", "", sub(":.*", "", sub("^Component ", "", ae))))) identical(C14m, local({C <- C14; C$method <- "SMM"; C})) }) ## try( update(C14, setting="KS2011") ) #--> Error: .. not allowed
## Show the default settings: str(lmrob.control()) ## Artificial data for a simple "robust t test": set.seed(17) y <- y0 <- rnorm(200) y[sample(200,20)] <- 100*rnorm(20) gr <- as.factor(rbinom(200, 1, prob = 1/8)) lmrob(y0 ~ 0+gr) ## Use Koller & Stahel(2011)'s recommendation but a larger 'max.it': str(ctrl <- lmrob.control("KS2011", max.it = 1000)) str(.Mpsi.tuning.defaults) stopifnot(identical(.Mpsi.tuning.defaults, sapply(names(.Mpsi.tuning.defaults), .Mpsi.tuning.default))) ## Containing (names!) all our (pre-defined) redescenders: str(.Mchi.tuning.defaults) ## Difference between settings: Cdef <- lmrob.control() C11 <- lmrob.control("KS2011") C14 <- lmrob.control("KS2014") str(C14) ## Differences: diffD <- names(which(!mapply(identical, Cdef,C11, ignore.environment=TRUE))) diffC <- names(which(!mapply(identical, C11, C14, ignore.environment=TRUE))) ## KS2011 vs KS2014: Apart from `setting` itself, they only differ in three places: cbind(KS11 = unlist(C11[diffC[-1]]), KS14 = unlist(C14[diffC[-1]])) ## KS11 KS14 ## nResample 500 1000 ## best.r.s 2 20 ## k.fast.s 1 2 ## default vs KS2011: a bit more: setting + 8 str2simpLang <- function(x) { r <- if(is.null(x)) quote((NULL)) else str2lang(deparse(x)) if(is.call(r)) format(r) else r } cbind(deflt= lapply(Cdef[diffD], str2simpLang), KS11 = lapply(C11 [diffD], str2simpLang)) ## update()ing a lmrob.control() , e.g., C14mod <- update(C14, trace.lev = 2) # the same as C14m.d <- C14; C14m.d$trace.lev <- 2 stopifnot(identical(C14mod, C14m.d)) ## changing psi --> updates tuning.{psi,chi}: C14mp <- update(C14, psi = "hampel", seed=101) ## updating 'method' is "smart" : C.SMDM <- update(Cdef, method="SMDM") all.equal(Cdef, C.SMDM) # changed also psi, tuning.{psi,chi} and cov ! chgd <- c("method", "psi", "tuning.chi", "tuning.psi", "cov") str(Cdef [chgd]) str(C.SMDM[chgd]) C14m <- update(C14, method="SMM") (ae <- all.equal(C14, C14mp))# changed tuning.psi & tuning.chi, too stopifnot(exprs = { identical(C14, update(C14, method="SMDM")) # no change! identical(c("psi", "seed", "tuning.chi", "tuning.psi"), sort(gsub("[^.[:alpha:]]", "", sub(":.*", "", sub("^Component ", "", ae))))) identical(C14m, local({C <- C14; C$method <- "SMM"; C})) }) ## try( update(C14, setting="KS2011") ) #--> Error: .. not allowed
Compute MM-type estimators of regression: An S-estimator is used as starting value, and an M-estimator with fixed scale and redescending psi-function is used from there. Optionally a D-step (Design Adaptive Scale estimate) as well as a second M-step is calculated.
lmrob.fit(x, y, control, init = NULL, mf = NULL, bare.only = FALSE)
lmrob.fit(x, y, control, init = NULL, mf = NULL, bare.only = FALSE)
x |
design matrix ( |
y |
numeric response vector (of length |
control |
a list of control parameters as returned
by |
init |
optional |
mf |
defunct. |
bare.only |
logical indicating if the result should be
|
This function is the basic fitting function for MM-type estimation,
called by lmrob
and typically not to be used on its own.
If given, init
must be a list of initial estimates containing
at least the initial coefficients and scale as coefficients
and
scale
. Otherwise it calls lmrob.S(..)
and uses it
as initial estimator.
A list with components (some missing in case bare.only
is true)
fitted.values |
|
residuals |
the raw residuals, |
rweights |
robustness weights derived from the final M-estimator residuals (even when not converged). |
rank |
|
degree.freedom |
|
coefficients |
estimated regression coefficient vector |
scale |
the robustly estimated error standard deviation |
cov |
variance-covariance matrix of |
control |
|
iter |
|
converged |
logical indicating if the RWLS iterations have converged. |
init.S |
the whole initial S-estimator result, including its own
|
init |
A similar list that contains the results of intermediate estimates (not for MM-estimates). |
Matias Salibian-Barrera, Martin Maechler and Manuel Koller
lmrob
,
lmrob..M..fit
,
lmrob..D..fit
,
lmrob.S
To compute least absolute residuals (LAR) or “L1” regression,
lmrob.lar
implements the routine L1 in Barrodale and Roberts (1974),
which is based on the simplex method of linear programming. It is a
copy of lmRob.lar
(in early 2012) from the robust package.
lmrob.lar(x, y, control, ...)
lmrob.lar(x, y, control, ...)
x |
numeric matrix for the predictors. |
y |
numeric vector for the response. |
control |
|
... |
(unused but needed when called as |
This method is used for computing the M-S estimate and typically not to be used on its own.
A description of the Fortran subroutines used can be found in Marazzi
(1993). In the book, the main method is named RILARS
.
A list that includes the following components:
coef |
The L1-estimate of the coefficient vector |
scale |
The residual scale estimate (mad) |
resid |
The residuals |
iter |
The number of iterations required by the simplex algorithm |
status |
Return status (0: optimal, but non unique solution, 1: optimal unique solution) |
converged |
Convergence status (always |
Manuel Koller
Marazzi, A. (1993). Algorithms, routines, and S functions for robust statistics. Wadsworth & Brooks/Cole, Pacific Grove, CA.
rq
from CRAN package quantreg.
data(stackloss) X <- model.matrix(stack.loss ~ . , data = stackloss) y <- stack.loss (fm.L1 <- lmrob.lar(X, y)) with(fm.L1, stopifnot(converged , status == 1L , all.equal(scale, 1.5291576438) , sum(abs(residuals) < 1e-15) == 4 # p=4 exactly fitted obs. ))
data(stackloss) X <- model.matrix(stack.loss ~ . , data = stackloss) y <- stack.loss (fm.L1 <- lmrob.lar(X, y)) with(fm.L1, stopifnot(converged , status == 1L , all.equal(scale, 1.5291576438) , sum(abs(residuals) < 1e-15) == 4 # p=4 exactly fitted obs. ))
Computes an M-S-estimator for linear regression using the “M-S” algorithm.
lmrob.M.S(x, y, control, mf, split = splitFrame(mf, x, control$split.type))
lmrob.M.S(x, y, control, mf, split = splitFrame(mf, x, control$split.type))
x |
numeric matrix (a |
y |
numeric vector for the response |
control |
|
mf |
a model frame as returned by |
split |
(optional) list as returned by |
This function is used by lmrob
and not intended to be
used on its own (because an M-S-estimator has too low efficiency
‘on its own’).
An M-S estimator is a combination of an S-estimator for the
continuous variables and an L1-estimator (i.e. an M-estimator with
) for the categorical variables.
The S-estimator is estimated using a subsampling algorithm. If the
model includes interactions between categorical (factor
)
and continuous variables, the subsampling algorithm might fail. In
this case, one can choose to assign the interaction to the categorical
side of variables rather than to the continuous side. This can be
accomplished via the control argument split.type
or by
specifying split
, see splitFrame
.
Note that the return status converged
does not refer to the
actual convergence status. The algorithm used does not guarantee
convergence and thus true convergence is almost never reached. This
is, however, not a problem if the estimate is only used as initial
estimate part of an MM or SMDM estimate.
The algorithm sometimes produces the warning message “Skipping design matrix equilibration (dgeequ): row ?? is exactly zero.”. This is just an artifact of the algorithm and can be ignored safely.
A list with components
coefficients |
numeric vector (length |
scale |
the M-S-scale residual estimate |
residuals |
numeric vector (legnth |
rweights |
numeric vector (length |
control |
the same list as the |
converged |
Convergence status (always |
descent.cov |
|
Manuel Koller
Maronna, R. A., and Yohai, V. J. (2000). Robust regression with both continuous and categorical predictors. Journal of Statistical Planning and Inference 89, 197–214.
lmrob
; for a description of the available split types, see
splitFrame
.
lmRob
in package robust uses a version of
the M-S algorithm automatically when the formula contains factors.
Our version however follows Maronna and Yohai (2000) more closely.
data(education) education <- within(education, Region <- factor(Region)) flm <- lm(Y ~ Region + X1 + X2 + X3, education) x <- model.matrix(flm) y <- education$Y # == model.response(model.frame(flm)) set.seed(17) f.MS <- lmrob.M.S(x, y, control = lmrob.control(), mf = model.frame(flm)) ## The typical use of the "M-S" estimator -- as initial estimate : fmMS <- lmrob(Y ~ Region + X1 + X2 + X3, education, init = "M-S")
data(education) education <- within(education, Region <- factor(Region)) flm <- lm(Y ~ Region + X1 + X2 + X3, education) x <- model.matrix(flm) y <- education$Y # == model.response(model.frame(flm)) set.seed(17) f.MS <- lmrob.M.S(x, y, control = lmrob.control(), mf = model.frame(flm)) ## The typical use of the "M-S" estimator -- as initial estimate : fmMS <- lmrob(Y ~ Region + X1 + X2 + X3, education, init = "M-S")
Computes an S-estimator for linear regression, using the “fast S” algorithm.
lmrob.S(x, y, control, trace.lev = control$trace.lev, only.scale = FALSE, mf)
lmrob.S(x, y, control, trace.lev = control$trace.lev, only.scale = FALSE, mf)
x |
design matrix ( |
y |
numeric vector of responses (or residuals for |
control |
list as returned by |
trace.lev |
integer indicating if the progress of the algorithm
should be traced (increasingly); default |
only.scale |
|
mf |
defunct. |
This function is used by lmrob.fit
and typically not
to be used on its own (because an S-estimator has too low
efficiency ‘on its own’).
By default, the subsampling algorithm uses a customized LU decomposition which ensures a non singular subsample (if this is at all possible). This makes the Fast-S algorithm also feasible for categorical and mixed continuous-categorical data.
One can revert to the old subsampling scheme by setting the parameter
subsampling
in control
to "simple"
.
By default (when only.scale
is false), a list with components
coefficients |
numeric vector (length |
scale |
the S-scale residual estimate |
fitted.values |
numeric vector (length |
residuals |
numeric vector (length |
rweights |
numeric vector (length |
k.iter |
(maximal) number of refinement iterations used. |
converged |
logical indicating if all refinement iterations had converged. |
control |
the same list as the |
If only.scale
is true, the computed scale (a number) is returned.
Matias Salibian-Barrera and Manuel Koller; Martin Maechler for minor new options and more documentation.
lmrob
, also for references.
set.seed(33) x1 <- sort(rnorm(30)); x2 <- sort(rnorm(30)); x3 <- sort(rnorm(30)) X. <- cbind(x1, x2, x3) y <- 10 + X. %*% (10*(2:4)) + rnorm(30)/10 y[1] <- 500 # a moderate outlier X.[2,1] <- 20 # an X outlier X1 <- cbind(1, X.) (m.lm <- lm(y ~ X.)) set.seed(12) m.lmS <- lmrob.S(x=X1, y=y, control = lmrob.control(nRes = 20), trace.lev=1) m.lmS[c("coefficients","scale")] all.equal(unname(m.lmS$coef), 10 * (1:4), tolerance = 0.005) stopifnot(all.equal(unname(m.lmS$coef), 10 * (1:4), tolerance = 0.005), all.equal(m.lmS$scale, 1/10, tolerance = 0.09)) ## only.scale = TRUE: Compute the S scale, given residuals; s.lmS <- lmrob.S(X1, y=residuals(m.lmS), only.scale = TRUE, control = lmrob.control(trace.lev = 3)) all.equal(s.lmS, m.lmS$scale) # close: 1.89e-6 [64b Lnx]
set.seed(33) x1 <- sort(rnorm(30)); x2 <- sort(rnorm(30)); x3 <- sort(rnorm(30)) X. <- cbind(x1, x2, x3) y <- 10 + X. %*% (10*(2:4)) + rnorm(30)/10 y[1] <- 500 # a moderate outlier X.[2,1] <- 20 # an X outlier X1 <- cbind(1, X.) (m.lm <- lm(y ~ X.)) set.seed(12) m.lmS <- lmrob.S(x=X1, y=y, control = lmrob.control(nRes = 20), trace.lev=1) m.lmS[c("coefficients","scale")] all.equal(unname(m.lmS$coef), 10 * (1:4), tolerance = 0.005) stopifnot(all.equal(unname(m.lmS$coef), 10 * (1:4), tolerance = 0.005), all.equal(m.lmS$scale, 1/10, tolerance = 0.09)) ## only.scale = TRUE: Compute the S scale, given residuals; s.lmS <- lmrob.S(X1, y=residuals(m.lmS), only.scale = TRUE, control = lmrob.control(trace.lev = 3)) all.equal(s.lmS, m.lmS$scale) # close: 1.89e-6 [64b Lnx]
Length of stay for 201 patients that stayed at the University Hospital of Lausanne during the year 2000.
data(los, package="robustbase")
data(los, package="robustbase")
Vector of integer values giving the length of stay (days):
int [1:201] 16 13 17 4 15 24 59 18 33 8 ...
These data may be used to estimate and predict the total resource consumption of this group of patients.
Cf. Ruffieux, Paccaud and Marazzi (2000).
The data were kindly provided by A. Marazzi.
Cf. Hubert, M. and Vandervieren, E. (2006), p. 13–15.
Ruffieux, C., Paccaud, F. and A. Marazzi (2000) Comparing rules for truncating hospital length of stay; Casemix Quarterly 2, n. 1.
See also those for adjbox
.
summary(los) # quite skewed, with median(.) = 8 plot(table(los)) boxplot(los, horizontal=TRUE, add=TRUE, col = "red", axes=FALSE) ##-> "outliers" instead of "just skewed" hist(log(los)) boxplot(log(los), add=TRUE, col=2, border=2, horizontal = TRUE, at = -1) ## Hubert and Vandervieren (2006), p. 15, Fig. 11. adjbox(los, col = "gray", staplecol="red", outcol = "red", main = "(Skewness-)Adjusted and original boxplot for 'los' data") boxplot(los, add = TRUE, staplewex= 0.2, outcex= 0.5, outpch= 4, staplecol = "blue", outcol = "blue", staplelwd=2) legend("topright", c("adjbox(los)", "boxplot(los)"), col=c("red","blue"), lwd = 1:2, bty="n")
summary(los) # quite skewed, with median(.) = 8 plot(table(los)) boxplot(los, horizontal=TRUE, add=TRUE, col = "red", axes=FALSE) ##-> "outliers" instead of "just skewed" hist(log(los)) boxplot(log(los), add=TRUE, col=2, border=2, horizontal = TRUE, at = -1) ## Hubert and Vandervieren (2006), p. 15, Fig. 11. adjbox(los, col = "gray", staplecol="red", outcol = "red", main = "(Skewness-)Adjusted and original boxplot for 'los' data") boxplot(los, add = TRUE, staplewex= 0.2, outcex= 0.5, outpch= 4, staplecol = "blue", outcol = "blue", staplelwd=2) legend("topright", c("adjbox(los)", "boxplot(los)"), col=c("red","blue"), lwd = 1:2, bty="n")
Carries out least trimmed squares (LTS) robust (high breakdown point) regression.
ltsReg(x, ...) ## S3 method for class 'formula' ltsReg(formula, data, subset, weights, na.action, model = TRUE, x.ret = FALSE, y.ret = FALSE, contrasts = NULL, offset, ...) ## Default S3 method: ltsReg(x, y, intercept = TRUE, alpha = , nsamp = , adjust = , mcd = TRUE, qr.out = FALSE, yname = NULL, seed = , trace = , use.correction = , wgtFUN = , control = rrcov.control(), ...)
ltsReg(x, ...) ## S3 method for class 'formula' ltsReg(formula, data, subset, weights, na.action, model = TRUE, x.ret = FALSE, y.ret = FALSE, contrasts = NULL, offset, ...) ## Default S3 method: ltsReg(x, y, intercept = TRUE, alpha = , nsamp = , adjust = , mcd = TRUE, qr.out = FALSE, yname = NULL, seed = , trace = , use.correction = , wgtFUN = , control = rrcov.control(), ...)
formula |
a |
data |
data frame from which variables specified in
|
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
weights |
an optional vector of weights to be used in the fitting process. NOT USED YET. |
na.action |
a function which indicates what should happen
when the data contain |
model , x.ret , y.ret
|
|
contrasts |
an optional list. See the |
offset |
this can be used to specify an a priori
known component to be included in the linear predictor
during fitting. An |
x |
a matrix or data frame containing the explanatory variables. |
y |
the response: a vector of length the number of rows of |
.
intercept |
if true, a model with constant term will be
estimated; otherwise no constant term will be included. Default is
|
alpha |
the percentage (roughly) of squared residuals whose sum will be
minimized, by default 0.5. In general, |
nsamp |
number of subsets used for initial estimates or
|
adjust |
whether to perform intercept adjustment at each step.
Since this can be time consuming, the default is |
mcd |
whether to compute robust distances using Fast-MCD. |
qr.out |
whether to return the QR decomposition (see
|
yname |
the name of the dependent variable. Default is |
seed |
initial seed for random generator, like
|
trace |
logical (or integer) indicating if intermediate results
should be printed; defaults to |
use.correction |
whether to use finite sample correction factors.
Default is |
wgtFUN |
a character string or |
control |
a list with estimation options - same as these provided in the function specification. If the control object is supplied, the parameters from it will be used. If parameters are passed also in the invocation statement, they will override the corresponding elements of the control object. |
... |
arguments passed to or from other methods. |
The LTS regression method minimizes the sum of the smallest
squared residuals, where
, i.e. at least half the number of
observations must be used. The default value of
(when
alpha=1/2
) is roughly , more precisely,
(n+p+1) %/% 2
where is the
total number of observations, but by setting
alpha
, the user
may choose higher values up to n, where
h.alpha.n(alpha,n,p)
. The LTS
estimate of the error scale is given by the minimum of the objective
function multiplied by a consistency factor
and a finite sample correction factor – see Pison et al. (2002)
for details. The rescaling factors for the raw and final estimates are
returned also in the vectors raw.cnp2
and cnp2
of
length 2 respectively. The finite sample corrections can be suppressed
by setting use.correction=FALSE
. The computations are performed
using the Fast LTS algorithm proposed by Rousseeuw and Van Driessen (1999).
As always, the formula interface has an implied intercept term which can be
removed either by y ~ x - 1
or y ~ 0 + x
. See
formula
for more details.
The function ltsReg
returns an object of class "lts"
.
The summary
method function is used to obtain (and
print) a summary table of the results, and plot()
can be used to plot them, see the the specific help pages.
The generic accessor functions coefficients
,
fitted.values
and residuals
extract various useful features of the value returned by
ltsReg
.
An object of class lts
is a list
containing at
least the following components:
crit |
the value of the objective function of the LTS regression method,
i.e., the sum of the |
coefficients |
vector of coefficient estimates (including the intercept by default when
|
best |
the best subset found and used for computing the raw estimates, with
|
fitted.values |
vector like |
residuals |
vector like |
scale |
scale estimate of the reweighted residuals. |
alpha |
same as the input parameter |
quan |
the number |
intercept |
same as the input parameter |
cnp2 |
a vector of length two containing the consistency correction factor and the finite sample correction factor of the final estimate of the error scale. |
raw.coefficients |
vector of raw coefficient estimates (including
the intercept, when |
raw.scale |
scale estimate of the raw residuals. |
raw.resid |
vector like |
raw.cnp2 |
a vector of length two containing the consistency correction factor and the finite sample correction factor of the raw estimate of the error scale. |
lts.wt |
vector like y containing weights that can be used in a weighted least squares. These weights are 1 for points with reasonably small residuals, and 0 for points with large residuals. |
raw.weights |
vector containing the raw weights based on the raw residuals and raw scale. |
method |
character string naming the method (Least Trimmed Squares). |
X |
the input data as a matrix (including intercept column if applicable). |
Y |
the response variable as a vector. |
Valentin Todorov [email protected], based on work written for S-plus by Peter Rousseeuw and Katrien van Driessen from University of Antwerp.
Peter J. Rousseeuw (1984), Least Median of Squares Regression. Journal of the American Statistical Association 79, 871–881.
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection. Wiley.
P. J. Rousseeuw and K. van Driessen (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223.
Pison, G., Van Aelst, S., and Willems, G. (2002) Small Sample Corrections for LTS and MCD. Metrika 55, 111-123.
covMcd
;
summary.lts
for summaries,
lmrob()
for alternative robust estimator with HBDP.
The generic functions coef
, residuals
,
fitted
.
data(heart) ## Default method works with 'x'-matrix and y-var: heart.x <- data.matrix(heart[, 1:2]) # the X-variables heart.y <- heart[,"clength"] ltsReg(heart.x, heart.y) data(stackloss) ltsReg(stack.loss ~ ., data = stackloss)
data(heart) ## Default method works with 'x'-matrix and y-var: heart.x <- data.matrix(heart[, 1:2]) # the X-variables heart.y <- heart[,"clength"] ltsReg(heart.x, heart.y) data(stackloss) ltsReg(stack.loss ~ ., data = stackloss)
Compute the ‘medcouple’, a robust concept and estimator of skewness. The medcouple is defined as a scaled median difference of the left and right half of distribution, and hence not based on the third moment as the classical skewness.
mc(x, na.rm = FALSE, doReflect = (length(x) <= 100), doScale = FALSE, # was hardwired=TRUE, then default=TRUE c.huberize = 1e11, # was implicitly = Inf originally eps1 = 1e-14, eps2 = 1e-15, # << new in 0.93-2 (2018-07..) maxit = 100, trace.lev = 0, full.result = FALSE)
mc(x, na.rm = FALSE, doReflect = (length(x) <= 100), doScale = FALSE, # was hardwired=TRUE, then default=TRUE c.huberize = 1e11, # was implicitly = Inf originally eps1 = 1e-14, eps2 = 1e-15, # << new in 0.93-2 (2018-07..) maxit = 100, trace.lev = 0, full.result = FALSE)
x |
a numeric vector |
na.rm |
logical indicating how missing values ( |
doReflect |
logical indicating if the internal MC should also be
computed on the reflected sample |
doScale |
logical indicating if the internal algorithm should
also scale the data (using the most distant value from the
median which is unrobust and numerically dangerous); scaling has been
hardwired in the original algorithm and R's |
c.huberize |
a positive number (default: |
eps1 , eps2
|
tolerance in the algorithm; |
maxit |
maximal number of iterations; typically a few should be sufficient. |
trace.lev |
integer specifying how much diagnostic output the
algorithm (in C) should produce. No output by default, most output
for |
full.result |
logical indicating if the full return values (from
C) should be returned as a list via |
a number between -1 and 1, which is the medcouple, .
For
r <- mc(x, full.result = TRUE, ....)
, then
attr(r, "mcComp")
is a list with components
medc |
the medcouple |
medc2 |
the medcouple |
eps |
tolerances used. |
iter , iter2
|
number of iterations used. |
converged , converged2
|
logical specifying “convergence”. |
For extreme cases there were convergence problems which should not
happen anymore as we now use doScale=FALSE
and huberization (when
c.huberize < Inf
).
The original algorithm and mc(*, doScale=TRUE)
not only centers
the data around the median but
also scales them by the extremes which may have a negative effect
e.g., when changing an extreme outlier to even more extreme, the
result changes wrongly; see the 'mc10x' example.
Guy Brys; modifications by Tobias Verbeke and bug fixes and extensions by Manuel Koller and Martin Maechler.
The new default doScale=FALSE
, and the new c.huberize
were
introduced as consequence of Lukas Graz' BSc thesis.
Guy Brys, Mia Hubert and Anja Struyf (2004) A Robust Measure of Skewness; JCGS 13 (4), 996–1017.
Hubert, M. and Vandervieren, E. (2008). An adjusted boxplot for skewed distributions, Computational Statistics and Data Analysis 52, 5186–5201.
Lukas Graz (2021). Improvement of the Algorithms for the Medcoule and the Adjusted Outlyingness; unpublished BSc thesis, supervised by M.Maechler, ETH Zurich.
Qn
for a robust measure of scale (aka
“dispersion”), ....
mc(1:5) # 0 for a symmetric sample x1 <- c(1, 2, 7, 9, 10) mc(x1) # = -1/3 data(cushny) mc(cushny) # 0.125 stopifnot(mc(c(-20, -5, -2:2, 5, 20)) == 0, mc(x1, doReflect=FALSE) == -mc(-x1, doReflect=FALSE), all.equal(mc(x1, doReflect=FALSE), -1/3, tolerance = 1e-12)) ## Susceptibility of the current algorithm to large outliers : dX10 <- function(X) c(1:5,7,10,15,25, X) # generate skewed size-10 with 'X' x <- c(10,20,30, 100^(1:20)) ## (doScale=TRUE, c.huberize=Inf) were (implicit) defaults in earlier {robustbase}: (mc10x <- vapply(x, function(X) mc(dX10(X), doScale=TRUE, c.huberize=Inf), 1)) ## limit X -> Inf should be 7/12 = 0.58333... but that "breaks down a bit" : plot(x, mc10x, type="b", main = "mc( c(1:5,7,10,15,25, X) )", xlab="X", log="x") ## The new behavior is much preferable {shows message about new 'doScale=FALSE'}: (mc10N <- vapply(x, function(X) mc(dX10(X)), 1)) lines(x, mc10N, col=adjustcolor(2, 3/4), lwd=3) mtext("mc(*, c.huberize=1e11)", col=2) stopifnot(all.equal(c(4, 6, rep(7, length(x)-2))/12, mc10N)) ## Here, huberization already solves the issue: mc10NS <- vapply(x, function(X) mc(dX10(X), doScale=TRUE), 1) stopifnot(all.equal(mc10N, mc10NS))
mc(1:5) # 0 for a symmetric sample x1 <- c(1, 2, 7, 9, 10) mc(x1) # = -1/3 data(cushny) mc(cushny) # 0.125 stopifnot(mc(c(-20, -5, -2:2, 5, 20)) == 0, mc(x1, doReflect=FALSE) == -mc(-x1, doReflect=FALSE), all.equal(mc(x1, doReflect=FALSE), -1/3, tolerance = 1e-12)) ## Susceptibility of the current algorithm to large outliers : dX10 <- function(X) c(1:5,7,10,15,25, X) # generate skewed size-10 with 'X' x <- c(10,20,30, 100^(1:20)) ## (doScale=TRUE, c.huberize=Inf) were (implicit) defaults in earlier {robustbase}: (mc10x <- vapply(x, function(X) mc(dX10(X), doScale=TRUE, c.huberize=Inf), 1)) ## limit X -> Inf should be 7/12 = 0.58333... but that "breaks down a bit" : plot(x, mc10x, type="b", main = "mc( c(1:5,7,10,15,25, X) )", xlab="X", log="x") ## The new behavior is much preferable {shows message about new 'doScale=FALSE'}: (mc10N <- vapply(x, function(X) mc(dX10(X)), 1)) lines(x, mc10N, col=adjustcolor(2, 3/4), lwd=3) mtext("mc(*, c.huberize=1e11)", col=2) stopifnot(all.equal(c(4, 6, rep(7, length(x)-2))/12, mc10N)) ## Here, huberization already solves the issue: mc10NS <- vapply(x, function(X) mc(dX10(X), doScale=TRUE), 1) stopifnot(all.equal(mc10N, mc10NS))
Daudin et al.(1988) give 8 readings on the composition of 86 containers of milk. They speak about 85 observations, but this can be explained with the fact that observations 63 and 64 are identical (as noted by Rocke (1996)).
The data set was used for analysing the stability of principal component analysis by the bootstrap method. In the same context, but using high breakdown point robust PCA, these data were analysed by Todorov et al. (1994). Atkinson (1994) used these data for ilustration of the forward search algorithm for identifying of multiple outliers.
data(milk, package="robustbase")
data(milk, package="robustbase")
A data frame with 86 observations on the following 8 variables, all but the first measure units in grams / liter.
X1
density
X2
fat content
X3
protein content
X4
casein content
X5
cheese dry substance measured in the factory
X6
cheese dry substance measured in the laboratory
X7
milk dry substance
X8
cheese product
Daudin, J.J. Duby, C. and Trecourt, P. (1988) Stability of Principal Component Analysis Studied by the Bootstrap Method; Statistics 19, 241–258.
Todorov, V., Neyko, N., Neytchev, P. (1994) Stability of High Breakdown Point Robust PCA, in Short Communications, COMPSTAT'94; Physica Verlag, Heidelberg.
Atkinson, A.C. (1994) Fast Very Robust Methods for the Detection of Multiple Outliers. J. Amer. Statist. Assoc. 89 1329–1339.
Rocke, D. M. and Woodruff, D. L. (1996) Identification of Outliers in Multivariate Data; J. Amer. Statist. Assoc. 91 (435), 1047–1061.
data(milk) (c.milk <- covMcd(milk)) summarizeRobWeights(c.milk $ mcd.wt)# 19..20 outliers umilk <- unique(milk) # dropping obs.64 (== obs.63) summary(cumilk <- covMcd(umilk, nsamp = "deterministic")) # 20 outliers
data(milk) (c.milk <- covMcd(milk)) summarizeRobWeights(c.milk $ mcd.wt)# 19..20 outliers umilk <- unique(milk) # dropping obs.64 (== obs.63) summary(cumilk <- covMcd(umilk, nsamp = "deterministic")) # 20 outliers
Compute Psi / Chi / Wgt / Rho functions for M-estimation,
i.e., including MM, etc. For definitions and details, please use the
vignette
“-Functions Available in Robustbase”.
MrhoInf(x)
computes , i.e., the
normalizing or scaling constant for the transformation
from
to
, where the latter, aka as
fulfills
which makes only sense for “redescending” psi functions, i.e.,
not for
"huber"
.
Mwgt(x, *)
computes (fast and numerically accurately).
Mpsi(x, cc, psi, deriv = 0) Mchi(x, cc, psi, deriv = 0) Mwgt(x, cc, psi) MrhoInf(cc, psi) .Mwgt.psi1(psi, cc = .Mpsi.tuning.default(psi)) .regularize.Mpsi(psi, redescending = TRUE)
Mpsi(x, cc, psi, deriv = 0) Mchi(x, cc, psi, deriv = 0) Mwgt(x, cc, psi) MrhoInf(cc, psi) .Mwgt.psi1(psi, cc = .Mpsi.tuning.default(psi)) .regularize.Mpsi(psi, redescending = TRUE)
x |
numeric (“abscissa” values) vector, possibly with
|
cc |
numeric tuning constant, for some |
psi |
a string specifying the psi / chi / rho / wgt function;
either |
deriv |
an integer, specifying the order of derivative to
consider; particularly, |
redescending |
logical indicating in |
Theoretically, Mchi()
would not be needed explicitly as it can be computed
from Mpsi()
and MrhoInf()
, namely, by
Mchi(x, *, deriv = d) == Mpsi(x, *, deriv = d-1) / MrhoInf(*)
for (and ‘*’ containing
par, psi
, and
equality is in the sense of all.equal(x,y, tol)
with a
small tol
.
Similarly, Mwgt
would not be needed strictly, as it could be
defined via Mpsi
), but the explicit definition takes care of
0/0 and typically is of a more simple form.
For experts, there are slightly even faster versions,
.Mpsi()
, .Mwgt()
, etc.
.Mwgt.psi1()
mainly a utility for nlrob()
,
returns a function
with similar semantics as
psi.hampel
, psi.huber
, or
psi.bisquare
from package MASS. Namely,
a function with arguments (x, deriv=0)
, which for
deriv=0
computes Mwgt(x, cc, psi)
and otherwise computes
Mpsi(x, cc, psi, deriv=deriv)
.
.Mpsi()
, .Mchi()
, .Mwgt()
, and .MrhoInf()
are
low-level versions of
Mpsi()
, Mchi()
, Mwgt()
, and MrhoInf()
, respectively,
and .psi2ipsi()
provides the psi-function integer codes needed
for ipsi
argument of the .M*()
functions.
For psi = "ggw"
, the function has no closed
form and must be computed via numerical integration, apart from 6
special cases including the defaults, see the ‘Details’ in
help(.psi.ggw.findc)
.
.Mpsi.regularize()
may (rarely) be used to regularize a psi function.
a numeric vector of the same length as x
, with corresponding
function (or derivative) values.
Manuel Koller, notably for the original C implementation;
tweaks and speedup via .Call
and .M*()
etc by
Martin Maechler.
See the vignette about
“-Functions Available in Robustbase”.
psiFunc
and the psi_func
class, both
of which provide considerably more on the R side, but are less
optimized for speed.
.Mpsi.tuning.defaults
, etc, for tuning constants'
defaults forlmrob()
, and .psi.ggw.findc()
utilities to construct such constants' vectors.
x <- seq(-5,7, by=1/8) matplot(x, cbind(Mpsi(x, 4, "biweight"), Mchi(x, 4, "biweight"), Mwgt(x, 4, "biweight")), type = "l") abline(h=0, v=0, lty=2, col=adjustcolor("gray", 0.6)) hampelPsi (ccHa <- hampelPsi @ xtras $ tuningP $ k) psHa <- hampelPsi@psi(x) ## using Mpsi(): Mp.Ha <- Mpsi(x, cc = ccHa, psi = "hampel") stopifnot(all.equal(Mp.Ha, psHa, tolerance = 1e-15)) psi.huber <- .Mwgt.psi1("huber") if(getRversion() >= "3.0.0") stopifnot(identical(psi.huber, .Mwgt.psi1("huber", 1.345), ignore.env=TRUE)) curve(psi.huber(x), -3, 5, col=2, ylim = 0:1) curve(psi.huber(x, deriv=1), add=TRUE, col=3) ## and show that this is indeed the same as MASS::psi.huber() : x <- runif(256, -2,3) stopifnot(all.equal(psi.huber(x), MASS::psi.huber(x)), all.equal( psi.huber(x, deriv=1), as.numeric(MASS::psi.huber(x, deriv=1)))) ## and how to get MASS::psi.hampel(): psi.hampel <- .Mwgt.psi1("Hampel", c(2,4,8)) x <- runif(256, -4, 10) stopifnot(all.equal(psi.hampel(x), MASS::psi.hampel(x)), all.equal( psi.hampel(x, deriv=1), as.numeric(MASS::psi.hampel(x, deriv=1)))) ## "lqq" / "LQQ" and its tuning constants: ctl0 <- lmrob.control(psi = "lqq", tuning.psi=c(-0.5, 1.5, 0.95, NA)) ctl <- lmrob.control(psi = "lqq", tuning.psi=c(-0.5, 1.5, 0.90, NA)) ctl0$tuning.psi ## keeps the vector _and_ has "constants" attribute: ## [1] -0.50 1.50 0.95 NA ## attr(,"constants") ## [1] 1.4734061 0.9822707 1.5000000 ctl$tuning.psi ## ditto: ## [1] -0.5 1.5 0.9 NA \ .."constants" 1.213726 0.809151 1.500000 stopifnot(all.equal(Mpsi(0:2, cc = ctl$tuning.psi, psi = ctl$psi), c(0, 0.977493, 1.1237), tol = 6e-6)) x <- seq(-4,8, by = 1/16) ## Show how you can use .Mpsi() equivalently to Mpsi() stopifnot(all.equal( Mpsi(x, cc = ctl$tuning.psi, psi = ctl$psi), .Mpsi(x, ccc = attr(ctl$tuning.psi, "constants"), ipsi = .psi2ipsi("lqq")))) stopifnot(all.equal( Mpsi(x, cc = ctl0$tuning.psi, psi = ctl0$psi, deriv=1), .Mpsi(x, ccc = attr(ctl0$tuning.psi, "constants"), ipsi = .psi2ipsi("lqq"), deriv=1))) ## M*() preserving attributes : x <- matrix(x, 32, 8, dimnames=list(paste0("r",1:32), col=letters[1:8])) comment(x) <- "a vector which is a matrix" px <- Mpsi(x, cc = ccHa, psi = "hampel") stopifnot(identical(attributes(x), attributes(px))) ## The "optimal" psi exists in two versions "in the litterature": --- ## Maronna et al. 2006, 5.9.1, p.144f: psi.M2006 <- function(x, c = 0.013) sign(x) * pmax(0, abs(x) - c/dnorm(abs(x))) ## and the other is the one in robustbase from 'robust': via Mpsi(.., "optimal") ## Here are both for 95% efficiency: (c106 <- .Mpsi.tuning.default("optimal")) c1 <- curve(Mpsi(x, cc = c106, psi="optimal"), -5, 7, n=1001) c2 <- curve(psi.M2006(x), add=TRUE, n=1001, col=adjustcolor(2,0.4), lwd=2) abline(0,1, v=0, h=0, lty=3) ## the two psi's are similar, but really quite different ## a zoom into Maronna et al's: c3 <- curve(psi.M2006(x), -.5, 1, n=1001); abline(h=0,v=0, lty=3);abline(0,1, lty=2)
x <- seq(-5,7, by=1/8) matplot(x, cbind(Mpsi(x, 4, "biweight"), Mchi(x, 4, "biweight"), Mwgt(x, 4, "biweight")), type = "l") abline(h=0, v=0, lty=2, col=adjustcolor("gray", 0.6)) hampelPsi (ccHa <- hampelPsi @ xtras $ tuningP $ k) psHa <- hampelPsi@psi(x) ## using Mpsi(): Mp.Ha <- Mpsi(x, cc = ccHa, psi = "hampel") stopifnot(all.equal(Mp.Ha, psHa, tolerance = 1e-15)) psi.huber <- .Mwgt.psi1("huber") if(getRversion() >= "3.0.0") stopifnot(identical(psi.huber, .Mwgt.psi1("huber", 1.345), ignore.env=TRUE)) curve(psi.huber(x), -3, 5, col=2, ylim = 0:1) curve(psi.huber(x, deriv=1), add=TRUE, col=3) ## and show that this is indeed the same as MASS::psi.huber() : x <- runif(256, -2,3) stopifnot(all.equal(psi.huber(x), MASS::psi.huber(x)), all.equal( psi.huber(x, deriv=1), as.numeric(MASS::psi.huber(x, deriv=1)))) ## and how to get MASS::psi.hampel(): psi.hampel <- .Mwgt.psi1("Hampel", c(2,4,8)) x <- runif(256, -4, 10) stopifnot(all.equal(psi.hampel(x), MASS::psi.hampel(x)), all.equal( psi.hampel(x, deriv=1), as.numeric(MASS::psi.hampel(x, deriv=1)))) ## "lqq" / "LQQ" and its tuning constants: ctl0 <- lmrob.control(psi = "lqq", tuning.psi=c(-0.5, 1.5, 0.95, NA)) ctl <- lmrob.control(psi = "lqq", tuning.psi=c(-0.5, 1.5, 0.90, NA)) ctl0$tuning.psi ## keeps the vector _and_ has "constants" attribute: ## [1] -0.50 1.50 0.95 NA ## attr(,"constants") ## [1] 1.4734061 0.9822707 1.5000000 ctl$tuning.psi ## ditto: ## [1] -0.5 1.5 0.9 NA \ .."constants" 1.213726 0.809151 1.500000 stopifnot(all.equal(Mpsi(0:2, cc = ctl$tuning.psi, psi = ctl$psi), c(0, 0.977493, 1.1237), tol = 6e-6)) x <- seq(-4,8, by = 1/16) ## Show how you can use .Mpsi() equivalently to Mpsi() stopifnot(all.equal( Mpsi(x, cc = ctl$tuning.psi, psi = ctl$psi), .Mpsi(x, ccc = attr(ctl$tuning.psi, "constants"), ipsi = .psi2ipsi("lqq")))) stopifnot(all.equal( Mpsi(x, cc = ctl0$tuning.psi, psi = ctl0$psi, deriv=1), .Mpsi(x, ccc = attr(ctl0$tuning.psi, "constants"), ipsi = .psi2ipsi("lqq"), deriv=1))) ## M*() preserving attributes : x <- matrix(x, 32, 8, dimnames=list(paste0("r",1:32), col=letters[1:8])) comment(x) <- "a vector which is a matrix" px <- Mpsi(x, cc = ccHa, psi = "hampel") stopifnot(identical(attributes(x), attributes(px))) ## The "optimal" psi exists in two versions "in the litterature": --- ## Maronna et al. 2006, 5.9.1, p.144f: psi.M2006 <- function(x, c = 0.013) sign(x) * pmax(0, abs(x) - c/dnorm(abs(x))) ## and the other is the one in robustbase from 'robust': via Mpsi(.., "optimal") ## Here are both for 95% efficiency: (c106 <- .Mpsi.tuning.default("optimal")) c1 <- curve(Mpsi(x, cc = c106, psi="optimal"), -5, 7, n=1001) c2 <- curve(psi.M2006(x), add=TRUE, n=1001, col=adjustcolor(2,0.4), lwd=2) abline(0,1, v=0, h=0, lty=3) ## the two psi's are similar, but really quite different ## a zoom into Maronna et al's: c3 <- curve(psi.M2006(x), -.5, 1, n=1001); abline(h=0,v=0, lty=3);abline(0,1, lty=2)
nlrob
fits a nonlinear regression model by robust methods.
Per default, by an M-estimator, using iterated reweighted least
squares (called “IRLS” or also “IWLS”).
nlrob(formula, data, start, lower, upper, weights = NULL, na.action = na.fail, method = c("M", "MM", "tau", "CM", "mtl"), psi = .Mwgt.psi1("huber", cc=1.345), scale = NULL, test.vec = c("resid", "coef", "w"), maxit = 20, tol = 1e-06, acc, algorithm = "default", doCov = FALSE, model = FALSE, control = if(method == "M") nls.control() else nlrob.control(method, optArgs = list(trace=trace), ...), trace = FALSE, ...) ## S3 method for class 'nlrob' fitted(object, ...) ## S3 method for class 'nlrob' residuals(object, type = , ...) ## S3 method for class 'nlrob' predict(object, newdata, ...)
nlrob(formula, data, start, lower, upper, weights = NULL, na.action = na.fail, method = c("M", "MM", "tau", "CM", "mtl"), psi = .Mwgt.psi1("huber", cc=1.345), scale = NULL, test.vec = c("resid", "coef", "w"), maxit = 20, tol = 1e-06, acc, algorithm = "default", doCov = FALSE, model = FALSE, control = if(method == "M") nls.control() else nlrob.control(method, optArgs = list(trace=trace), ...), trace = FALSE, ...) ## S3 method for class 'nlrob' fitted(object, ...) ## S3 method for class 'nlrob' residuals(object, type = , ...) ## S3 method for class 'nlrob' predict(object, newdata, ...)
formula |
a nonlinear |
data |
an optional data frame containing the variables
in the model. If not found in |
start |
a named numeric vector of starting parameters estimates,
only for |
lower , upper
|
numeric vectors of lower and upper bounds; if
needed, will be replicated to be as long as the longest of For all other methods, currently these bounds must be
specified as finite values, and one of them must have
For methods |
weights |
an optional vector of weights to be used in the fitting
process (for intrinsic weights, not the weights |
na.action |
a function which indicates what should happen when the data
contain |
method |
a character string specifying which method to use. The
default is
Note that all methods but |
psi |
a function (possibly by name) of the form Note that this has been a deliberately non-backcompatible change for robustbase version 0.90-0 (summer 2013 – early 2014). |
scale |
when not |
test.vec |
character string specifying the convergence
criterion. The relative change is tested for residuals with a value
of |
maxit |
maximum number of iterations in the robust loop. |
tol |
non-negative convergence tolerance for the robust fit. |
acc |
previous name for |
algorithm |
character string specifying the algorithm to use for
|
doCov |
a logical specifying if |
model |
a |
control |
an optional list of control settings.
|
trace |
logical value indicating if a “trace” of
the |
object |
an R object of class |
... |
for |
type |
a string specifying the type of residuals desired.
Currently, |
newdata |
a data frame (or list) with the same names as the
original |
For method = "M"
, iterated reweighted least squares
(“IRLS” or “IWLS”) is used, calling nls(*,
weights= .)
where weights
are proportional to
.
All other methods minimize differently, and work without
nls
. See nlrob.algorithms
for details.
nlrob()
returns an object of S3 class "nlrob"
,
for method = "M"
also inheriting from class "nls"
, (see
nls
).
It is a list with several components; they are not documented yet,
as some of them will probably change.
Instead, rather use “accessor” methods, where possible:
There are methods (at least) for the generic accessor functions
summary()
, coefficients()
(aka coef()
)
fitted.values()
, residuals()
, sigma()
and
vcov()
, the latter for the variance-covariance matrix of
the estimated parameters, as returned by coef()
, i.e., not
including the variance of the errors.
For nlrob()
results, estimethod()
returns the
“estimation method”, which coincides with the method
argument used.
residuals(.)
, by default type = "response"
, returns
the residuals , defined above as
.
These differ from the standardized or weighted residuals which, e.g.,
are assumed to be normally distributed, and a version of which is
returned in
working.residuals
component.
This function (with the only method "M"
) used to be named
rnls
and has been in package sfsmisc in the past, but
been dropped there.
method = "M"
:Andreas Ruckstuhl (inspired by rlm
() and
nls
()), in July 1994 for S-plus.
Christian Sangiorgio did the update to R and corrected some errors,
from June 2002 to January 2005, and Andreas contributed slight changes
and the first methods in August 2005.
method = "MM"
, etc:Originally all by Eduardo
L. T. Conceicao, see nlrob.algorithms
:
Since then, the help page, testing, more cleanup, new methods: Martin Maechler.
DNase1 <- DNase[ DNase$Run == 1, ] ## note that selfstarting models don't work yet % <<< FIXME !!! ##--- without conditional linearity --- ## classical fmNase1 <- nls( density ~ Asym/(1 + exp(( xmid - log(conc) )/scal ) ), data = DNase1, start = list( Asym = 3, xmid = 0, scal = 1 ), trace = TRUE ) summary( fmNase1 ) ## robust RmN1 <- nlrob( density ~ Asym/(1 + exp(( xmid - log(conc) )/scal ) ), data = DNase1, trace = TRUE, start = list( Asym = 3, xmid = 0, scal = 1 )) summary( RmN1 ) ##--- using conditional linearity --- ## classical fm2DNase1 <- nls( density ~ 1/(1 + exp(( xmid - log(conc) )/scal ) ), data = DNase1, start = c( xmid = 0, scal = 1 ), alg = "plinear", trace = TRUE ) summary( fm2DNase1 ) ## robust frm2DNase1 <- nlrob(density ~ 1/(1 + exp(( xmid - log(conc) )/scal ) ), data = DNase1, start = c( xmid = 0, scal = 1 ), alg = "plinear", trace = TRUE ) summary( frm2DNase1 ) ## Confidence for linear parameter is quite smaller than "Asym" above c1 <- coef(summary(RmN1)) c2 <- coef(summary(frm2DNase1)) rownames(c2)[rownames(c2) == ".lin"] <- "Asym" stopifnot(all.equal(c1[,1:2], c2[rownames(c1), 1:2], tol = 0.09)) # 0.07315 ### -- new examples -- "moderate outlier": DN2 <- DNase1 DN2[10,"density"] <- 2*DN2[10,"density"] fm3DN2 <- nls(density ~ Asym/(1 + exp(( xmid - log(conc) )/scal ) ), data = DN2, trace = TRUE, start = list( Asym = 3, xmid = 0, scal = 1 )) ## robust Rm3DN2 <- nlrob(density ~ Asym/(1 + exp(( xmid - log(conc) )/scal ) ), data = DN2, trace = TRUE, start = list( Asym = 3, xmid = 0, scal = 1 )) Rm3DN2 summary(Rm3DN2) # -> robustness weight of obs. 10 ~= 0.037 confint(Rm3DN2, method = "Wald") stopifnot(identical(Rm3DN2$dataClasses, c(density = "numeric", conc = "numeric"))) ## utility function sfsmisc::lseq() : lseq <- function (from, to, length) 2^seq(log2(from), log2(to), length.out = length) ## predict() {and plot}: h.x <- lseq(min(DN2$conc), max(DN2$conc), length = 100) nDat <- data.frame(conc = h.x) h.p <- predict(fm3DN2, newdata = nDat)# classical h.rp <- predict(Rm3DN2, newdata = nDat)# robust plot(density ~ conc, data=DN2, log="x", main = format(formula(Rm3DN2))) lines(h.x, h.p, col="blue") lines(h.x, h.rp, col="magenta") legend("topleft", c("classical nls()", "robust nlrob()"), lwd = 1, col= c("blue", "magenta"), inset = 0.05) ## See ?nlrob.algorithms for examples DNase1 <- DNase[DNase$Run == 1,] form <- density ~ Asym/(1 + exp(( xmid -log(conc) )/scal )) gMM <- nlrob(form, data = DNase1, method = "MM", lower = c(Asym = 0, xmid = 0, scal = 0), upper = 3, trace = TRUE) ## "CM" (and "mtl") additionally need bounds for "sigma" : gCM <- nlrob(form, data = DNase1, method = "CM", lower = c(Asym = 0, xmid = 0, scal = 0, sigma = 0), upper = c(3,3,3, sigma = 0.8)) summary(gCM)# did fail; note it has NA NA NA (std.err, t val, P val) stopifnot(identical(Rm3DN2$dataClasses, gMM$dataClasses), identical( gCM$dataClasses, gMM$dataClasses))
DNase1 <- DNase[ DNase$Run == 1, ] ## note that selfstarting models don't work yet % <<< FIXME !!! ##--- without conditional linearity --- ## classical fmNase1 <- nls( density ~ Asym/(1 + exp(( xmid - log(conc) )/scal ) ), data = DNase1, start = list( Asym = 3, xmid = 0, scal = 1 ), trace = TRUE ) summary( fmNase1 ) ## robust RmN1 <- nlrob( density ~ Asym/(1 + exp(( xmid - log(conc) )/scal ) ), data = DNase1, trace = TRUE, start = list( Asym = 3, xmid = 0, scal = 1 )) summary( RmN1 ) ##--- using conditional linearity --- ## classical fm2DNase1 <- nls( density ~ 1/(1 + exp(( xmid - log(conc) )/scal ) ), data = DNase1, start = c( xmid = 0, scal = 1 ), alg = "plinear", trace = TRUE ) summary( fm2DNase1 ) ## robust frm2DNase1 <- nlrob(density ~ 1/(1 + exp(( xmid - log(conc) )/scal ) ), data = DNase1, start = c( xmid = 0, scal = 1 ), alg = "plinear", trace = TRUE ) summary( frm2DNase1 ) ## Confidence for linear parameter is quite smaller than "Asym" above c1 <- coef(summary(RmN1)) c2 <- coef(summary(frm2DNase1)) rownames(c2)[rownames(c2) == ".lin"] <- "Asym" stopifnot(all.equal(c1[,1:2], c2[rownames(c1), 1:2], tol = 0.09)) # 0.07315 ### -- new examples -- "moderate outlier": DN2 <- DNase1 DN2[10,"density"] <- 2*DN2[10,"density"] fm3DN2 <- nls(density ~ Asym/(1 + exp(( xmid - log(conc) )/scal ) ), data = DN2, trace = TRUE, start = list( Asym = 3, xmid = 0, scal = 1 )) ## robust Rm3DN2 <- nlrob(density ~ Asym/(1 + exp(( xmid - log(conc) )/scal ) ), data = DN2, trace = TRUE, start = list( Asym = 3, xmid = 0, scal = 1 )) Rm3DN2 summary(Rm3DN2) # -> robustness weight of obs. 10 ~= 0.037 confint(Rm3DN2, method = "Wald") stopifnot(identical(Rm3DN2$dataClasses, c(density = "numeric", conc = "numeric"))) ## utility function sfsmisc::lseq() : lseq <- function (from, to, length) 2^seq(log2(from), log2(to), length.out = length) ## predict() {and plot}: h.x <- lseq(min(DN2$conc), max(DN2$conc), length = 100) nDat <- data.frame(conc = h.x) h.p <- predict(fm3DN2, newdata = nDat)# classical h.rp <- predict(Rm3DN2, newdata = nDat)# robust plot(density ~ conc, data=DN2, log="x", main = format(formula(Rm3DN2))) lines(h.x, h.p, col="blue") lines(h.x, h.rp, col="magenta") legend("topleft", c("classical nls()", "robust nlrob()"), lwd = 1, col= c("blue", "magenta"), inset = 0.05) ## See ?nlrob.algorithms for examples DNase1 <- DNase[DNase$Run == 1,] form <- density ~ Asym/(1 + exp(( xmid -log(conc) )/scal )) gMM <- nlrob(form, data = DNase1, method = "MM", lower = c(Asym = 0, xmid = 0, scal = 0), upper = 3, trace = TRUE) ## "CM" (and "mtl") additionally need bounds for "sigma" : gCM <- nlrob(form, data = DNase1, method = "CM", lower = c(Asym = 0, xmid = 0, scal = 0, sigma = 0), upper = c(3,3,3, sigma = 0.8)) summary(gCM)# did fail; note it has NA NA NA (std.err, t val, P val) stopifnot(identical(Rm3DN2$dataClasses, gMM$dataClasses), identical( gCM$dataClasses, gMM$dataClasses))
Compute an MM-estimator for nonlinear robust (constrained) regression.
Compute a Tau-estimator for nonlinear robust (constrained) regression.
Compute a “Constrained M” (=: CM) estimator for nonlinear robust (constrained) regression.
Compute a “Maximum Trimmed Likelihood” (=: MTL) estimator for nonlinear robust (constrained) regression.
## You can *not* call the nlrob(*, method = <M>) like this ==> see help(nlrob) ## ------- ===== ------------------------------------------ nlrob.MM(formula, data, lower, upper, tol = 1e-06, psi = c("bisquare", "lqq", "optimal", "hampel"), init = c("S", "lts"), ctrl = nlrob.control("MM", psi = psi, init = init, fnscale = NULL, tuning.chi.scale = .psi.conv.cc(psi, .Mchi.tuning.defaults[[psi]]), tuning.psi.M = .psi.conv.cc(psi, .Mpsi.tuning.defaults[[psi]]), optim.control = list(), optArgs = list(...)), ...) nlrob.tau(formula, data, lower, upper, tol = 1e-06, psi = c("bisquare", "optimal"), ctrl = nlrob.control("tau", psi = psi, fnscale = NULL, tuning.chi.scale = NULL, tuning.chi.tau = NULL, optArgs = list(...)), ...) nlrob.CM(formula, data, lower, upper, tol = 1e-06, psi = c("bisquare", "lqq", "welsh", "optimal", "hampel", "ggw"), ctrl = nlrob.control("CM", psi = psi, fnscale = NULL, tuning.chi = NULL, optArgs = list(...)), ...) nlrob.mtl(formula, data, lower, upper, tol = 1e-06, ctrl = nlrob.control("mtl", cutoff = 2.5, optArgs = list(...)), ...)
## You can *not* call the nlrob(*, method = <M>) like this ==> see help(nlrob) ## ------- ===== ------------------------------------------ nlrob.MM(formula, data, lower, upper, tol = 1e-06, psi = c("bisquare", "lqq", "optimal", "hampel"), init = c("S", "lts"), ctrl = nlrob.control("MM", psi = psi, init = init, fnscale = NULL, tuning.chi.scale = .psi.conv.cc(psi, .Mchi.tuning.defaults[[psi]]), tuning.psi.M = .psi.conv.cc(psi, .Mpsi.tuning.defaults[[psi]]), optim.control = list(), optArgs = list(...)), ...) nlrob.tau(formula, data, lower, upper, tol = 1e-06, psi = c("bisquare", "optimal"), ctrl = nlrob.control("tau", psi = psi, fnscale = NULL, tuning.chi.scale = NULL, tuning.chi.tau = NULL, optArgs = list(...)), ...) nlrob.CM(formula, data, lower, upper, tol = 1e-06, psi = c("bisquare", "lqq", "welsh", "optimal", "hampel", "ggw"), ctrl = nlrob.control("CM", psi = psi, fnscale = NULL, tuning.chi = NULL, optArgs = list(...)), ...) nlrob.mtl(formula, data, lower, upper, tol = 1e-06, ctrl = nlrob.control("mtl", cutoff = 2.5, optArgs = list(...)), ...)
formula |
nonlinear regression |
data |
data to be used, a |
lower , upper
|
bounds aka “box constraints” for all the
parameters, in the case "CM" and "mtl" these must include the error
standard deviation as Note that one of these two must be a properly “named”, e.g.,
|
tol |
numerical convergence tolerance. |
psi , init
|
see |
ctrl |
a |
tuning.psi.M |
.. |
optim.control |
.. |
optArgs |
a |
... |
alternative way to pass the |
Copyright 2013, Eduardo L. T. Conceicao. Available under the GPL (>= 2)
Currently, all four methods use JDEoptim()
from DEoptimR, which subsamples using sample()
.
From R version 3.6.0, sample
depends on
RNGkind(*, sample.kind)
, such that exact reproducibility of
results from R versions 3.5.3 and earlier requires setting
RNGversion("3.5.0")
.
In any case, do use set.seed()
additionally for reproducibility!
an R object of class
"nlrob.<meth>"
, basically a
list with components
Eduardo L. T. Conceicao; compatibility (to nlrob
)
tweaks and generalizations, inference, by Martin Maechler.
For "MTL"
:
Maronna, Ricardo A., Martin, R. Douglas, and Yohai, Victor J. (2006).
Robust Statistics: Theory and Methods Wiley, Chichester, p. 133.
Yohai, V.J. (1987) High breakdown-point and high efficiency robust estimates for regression. The Annals of Statistics 15, 642–656.
Yohai, V.J., and Zamar, R.H. (1988). High breakdown-point estimates of regression by means of the minimization of an efficient scale. Journal of the American Statistical Association 83, 406–413.
Mendes, B.V.M., and Tyler, D.E. (1996) Constrained M-estimation for regression.
In: Robust Statistics, Data Analysis and Computer Intensive Methods, Lecture Notes in Statistics 109, Springer, New York, 299–320.
Hadi, Ali S., and Luceno, Alberto (1997). Maximum trimmed likelihood estimators: a unified approach, examples, and algorithms. Computational Statistics & Data Analysis 25, 251–272.
Gervini, Daniel, and Yohai, Victor J. (2002). A class of robust and fully efficient regression estimators. The Annals of Statistics 30, 583–616.
DNase1 <- DNase[DNase$Run == 1,] form <- density ~ Asym/(1 + exp(( xmid -log(conc) )/scal )) pnms <- c("Asym", "xmid", "scal") set.seed(47) # as these by default use randomized optimization: fMM <- robustbase:::nlrob.MM(form, data = DNase1, lower = setNames(c(0,0,0), pnms), upper = 3, ## call to nlrob.control to pass 'optim.control': ctrl = nlrob.control("MM", optim.control = list(trace = 1), optArgs = list(trace = TRUE))) ## The same via nlrob() {recommended; same random seed to necessarily give the same}: set.seed(47) gMM <- nlrob(form, data = DNase1, method = "MM", lower = setNames(c(0,0,0), pnms), upper = 3, trace = TRUE) gMM summary(gMM) ## and they are the same {apart from 'call' and 'ctrl' and new stuff in gMM}: ni <- names(fMM); ni <- ni[is.na(match(ni, c("call","ctrl")))] stopifnot(all.equal(fMM[ni], gMM[ni]))
DNase1 <- DNase[DNase$Run == 1,] form <- density ~ Asym/(1 + exp(( xmid -log(conc) )/scal )) pnms <- c("Asym", "xmid", "scal") set.seed(47) # as these by default use randomized optimization: fMM <- robustbase:::nlrob.MM(form, data = DNase1, lower = setNames(c(0,0,0), pnms), upper = 3, ## call to nlrob.control to pass 'optim.control': ctrl = nlrob.control("MM", optim.control = list(trace = 1), optArgs = list(trace = TRUE))) ## The same via nlrob() {recommended; same random seed to necessarily give the same}: set.seed(47) gMM <- nlrob(form, data = DNase1, method = "MM", lower = setNames(c(0,0,0), pnms), upper = 3, trace = TRUE) gMM summary(gMM) ## and they are the same {apart from 'call' and 'ctrl' and new stuff in gMM}: ni <- names(fMM); ni <- ni[is.na(match(ni, c("call","ctrl")))] stopifnot(all.equal(fMM[ni], gMM[ni]))
Allow the user to specify details for the different nonlinear robust
regression algorithms in nlrob
.
nlrob.control(method, psi = c("bisquare", "lqq", "welsh", "optimal", "hampel", "ggw"), init = c("S", "lts"), optimizer = "JDEoptim", optArgs = list(), ...)
nlrob.control(method, psi = c("bisquare", "lqq", "welsh", "optimal", "hampel", "ggw"), init = c("S", "lts"), optimizer = "JDEoptim", optArgs = list(), ...)
method |
|
psi |
string specifying the psi-function which defines the estimator. |
init |
for some methods, currently, |
optimizer |
currently only |
optArgs |
a |
... |
optional arguments depending on |
a list
with several named components.
The contents depend quite a bit on the method
.
nlrob
; for some details, nlrob.algorithms
.
## Show how the different 'method's have different smart defaults : str(nlrob.control("MM")) str(nlrob.control("MM", psi = "hampel"))# -> other tuning.psi.M and tuning.chi.scale defaults str(nlrob.control("MM", psi = "lqq", tol = 1e-10))# other tuning.psi.M & tuning.chi.scale defaults str(nlrob.control("tau")) str(nlrob.control("tau",psi= "lqq")) str(nlrob.control("CM")) # tuning.chi undefined, unneeded str(nlrob.control("CM", psi= "optimal")) str(nlrob.control("mtl"))
## Show how the different 'method's have different smart defaults : str(nlrob.control("MM")) str(nlrob.control("MM", psi = "hampel"))# -> other tuning.psi.M and tuning.chi.scale defaults str(nlrob.control("MM", psi = "lqq", tol = 1e-10))# other tuning.psi.M & tuning.chi.scale defaults str(nlrob.control("tau")) str(nlrob.control("tau",psi= "lqq")) str(nlrob.control("CM")) # tuning.chi undefined, unneeded str(nlrob.control("CM", psi= "optimal")) str(nlrob.control("mtl"))
A typical medium sized environmental data set with hourly measurements
of pollution content in the ambient air.
data(NOxEmissions, package="robustbase")
data(NOxEmissions, package="robustbase")
A data frame with 8088 observations on the following 4 variables.
julday
day number, a factor with levels 373
... 730
, typically with 24 hourly measurements.
LNOx
of hourly mean of NOx concentration in
ambient air [ppb] next to a highly frequented motorway.
LNOxEm
of hourly sum of NOx emission of
cars on this motorway in arbitrary units.
sqrtWS
Square root of wind speed [m/s].
The original data set had more observations, but with missing values.
Here, all cases with missing values were omitted
(na.omit(.)
), and then only those were retained that
belonged to days with at least 20 (fully) observed hourly
measurements.
René Locher (at ZHAW, Switzerland).
another NOx dataset, ambientNOxCH
.
data(NOxEmissions) plot(LNOx ~ LNOxEm, data = NOxEmissions, cex = 0.25, col = "gray30") ## Not run: ## these take too much time -- ## p = 340 ==> already Least Squares is not fast (lmNOx <- lm(LNOx ~ . ,data = NOxEmissions)) plot(lmNOx) #-> indication of 1 outlier M.NOx <- MASS::rlm(LNOx ~ . , data = NOxEmissions) ## M-estimation works ## whereas MM-estimation fails: try(MM.NOx <- MASS::rlm(LNOx ~ . , data = NOxEmissions, method = "MM")) ## namely because S-estimation fails: try(lts.NOx <- ltsReg(LNOx ~ . , data = NOxEmissions)) try(lmR.NOx <- lmrob (LNOx ~ . , data = NOxEmissions)) ## End(Not run)
data(NOxEmissions) plot(LNOx ~ LNOxEm, data = NOxEmissions, cex = 0.25, col = "gray30") ## Not run: ## these take too much time -- ## p = 340 ==> already Least Squares is not fast (lmNOx <- lm(LNOx ~ . ,data = NOxEmissions)) plot(lmNOx) #-> indication of 1 outlier M.NOx <- MASS::rlm(LNOx ~ . , data = NOxEmissions) ## M-estimation works ## whereas MM-estimation fails: try(MM.NOx <- MASS::rlm(LNOx ~ . , data = NOxEmissions, method = "MM")) ## namely because S-estimation fails: try(lts.NOx <- ltsReg(LNOx ~ . , data = NOxEmissions)) try(lmR.NOx <- lmrob (LNOx ~ . , data = NOxEmissions)) ## End(Not run)
Simple statistics about observations with robustness weight of almost zero for models that include factor terms. The number of rejected observations and the mean robustness weights are computed for each level of each factor included in the model.
outlierStats(object, x = object$x, control = object$control , epsw = control$eps.outlier , epsx = control$eps.x , warn.limit.reject = control$warn.limit.reject , warn.limit.meanrw = control$warn.limit.meanrw , shout = NA)
outlierStats(object, x = object$x, control = object$control , epsw = control$eps.outlier , epsx = control$eps.x , warn.limit.reject = control$warn.limit.reject , warn.limit.meanrw = control$warn.limit.meanrw , shout = NA)
object |
object of class |
x |
design matrix |
control |
list as returned by |
epsw |
limit on the robustness weight below which an observation is considered
to be an outlier. Either a |
epsx |
limit on the absolute value of the elements of the design matrix below which an element is considered zero. Either a numeric(1) or a function that takes the maximum absolute value in the design matrix as an argument. |
warn.limit.reject |
limit of ratio
|
warn.limit.meanrw |
limit of the mean robustness per factor level
below which ( |
shout |
a |
For models that include factors, the fast S-algorithm used by
lmrob
can produce “bad” fits for some of the
factor levels, especially if there are many levels with only a
few observations. Such a “bad” fit is characterized as a
fit where most of the observations in a level of a factor are
rejected, i.e., are assigned robustness weights of zero or nearly
zero. We call such a fit a “local exact fit”.
If a local exact fit is detected, then we recommend to increase some
of the control parameters of the “fast S”-algorithm. As a first
aid solution in such cases, one can use setting="KS2014"
, see also
lmrob.control
.
This function is called internally by lmrob
to issue a
warning if a local exact fit is detected. The output is available as
ostats
in objects of class "lmrob"
(only if the statistic
is computed).
A data frame for each column with any zero elements as well as an
overall statistic. The data frame consist of the names of the
coefficients in question, the number of non-zero observations in that
level (N.nonzero
), the number of rejected observations
(N.rejected
), the ratio of rejected observations to the
number of observations in that level (Ratio
) and the mean
robustness weight of all the observations in the corresponding level
(Mean.RobWeight
).
Manuel Koller
Koller, M. and Stahel, W.A. (2017) Nonsingular subsampling for regression S estimators with categorical predictors, Computational Statistics 32(2): 631–646. doi:10.1007/s00180-016-0679-x
lmrob.control
for the default values of the control
parameters; summarizeRobWeights
.
## artificial data example data <- expand.grid(grp1 = letters[1:5], grp2 = letters[1:5], rep=1:3) set.seed(101) data$y <- c(rt(nrow(data), 1)) ## compute outlier statistics for all the estimators control <- lmrob.control(method = "SMDM", compute.outlier.stats = c("S", "MM", "SMD", "SMDM")) ## warning is only issued for some seeds set.seed(2) fit1 <- lmrob(y ~ grp1*grp2, data, control = control) ## do as suggested: fit2 <- lmrob(y ~ grp1*grp2, data, setting = "KS2014") ## the plot function should work for such models as well plot(fit1) ## Not run: ## access statistics: fit1$ostats ## SMDM fit1$init$ostats ## SMD fit1$init$init$ostats ## SM fit1$init$init$init.S$ostats ## S ## End(Not run)
## artificial data example data <- expand.grid(grp1 = letters[1:5], grp2 = letters[1:5], rep=1:3) set.seed(101) data$y <- c(rt(nrow(data), 1)) ## compute outlier statistics for all the estimators control <- lmrob.control(method = "SMDM", compute.outlier.stats = c("S", "MM", "SMD", "SMDM")) ## warning is only issued for some seeds set.seed(2) fit1 <- lmrob(y ~ grp1*grp2, data, control = control) ## do as suggested: fit2 <- lmrob(y ~ grp1*grp2, data, setting = "KS2014") ## the plot function should work for such models as well plot(fit1) ## Not run: ## access statistics: fit1$ostats ## SMDM fit1$init$ostats ## SMD fit1$init$init$ostats ## SM fit1$init$init$init.S$ostats ## S ## End(Not run)
The total 1981 premium income of pension funds of Dutch firms, for 18 Professional Branches, from de Wit (1982).
data(pension, package="robustbase")
data(pension, package="robustbase")
A data frame with 18 observations on the following 2 variables.
Income
Premium Income (in millions of guilders)
Reserves
Premium Reserves (in millions of guilders)
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, p.76, table 13.
data(pension) plot(pension) summary(lm.p <- lm(Reserves ~., data=pension)) summary(lmR.p <- lmrob(Reserves ~., data=pension)) summary(lts.p <- ltsReg(Reserves ~., data=pension)) abline( lm.p) abline(lmR.p, col=2) abline(lts.p, col=2, lty=2) ## MM: "the" solution is much simpler: plot(pension, log = "xy") lm.lp <- lm(log(Reserves) ~ log(Income), data=pension) lmR.lp <- lmrob(log(Reserves) ~ log(Income), data=pension) plot(log(Reserves) ~ log(Income), data=pension) ## no difference between LS and robust: abline( lm.lp) abline(lmR.lp, col=2)
data(pension) plot(pension) summary(lm.p <- lm(Reserves ~., data=pension)) summary(lmR.p <- lmrob(Reserves ~., data=pension)) summary(lts.p <- ltsReg(Reserves ~., data=pension)) abline( lm.p) abline(lmR.p, col=2) abline(lts.p, col=2, lty=2) ## MM: "the" solution is much simpler: plot(pension, log = "xy") lm.lp <- lm(log(Reserves) ~ log(Income), data=pension) lmR.lp <- lmrob(log(Reserves) ~ log(Income), data=pension) plot(log(Reserves) ~ log(Income), data=pension) ## no difference between LS and robust: abline( lm.lp) abline(lmR.lp, col=2)
This dataset investigates the effect from inorganic and organic Phosphorus in the soil upon the phosphorus content of the corn grown in this soil, from Prescott (1975).
data(phosphor, package="robustbase")
data(phosphor, package="robustbase")
A data frame with 18 observations on the following 3 variables.
inorg
Inorganic soil Phosphorus
organic
Organic soil Phosphorus
plant
Plant Phosphorus content
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection. Wiley, p.156, table 24.
data(phosphor) plot(phosphor) summary(lm.phosphor <- lm(plant ~ ., data = phosphor)) summary(lts.phosphor <- ltsReg(plant ~ ., data = phosphor)) phosphor.x <- data.matrix(phosphor[, 1:2]) cPh <- covMcd(phosphor.x) plot(cPh, "dd")
data(phosphor) plot(phosphor) summary(lm.phosphor <- lm(plant ~ ., data = phosphor)) summary(lts.phosphor <- ltsReg(plant ~ ., data = phosphor)) phosphor.x <- data.matrix(phosphor[, 1:2]) cPh <- covMcd(phosphor.x) plot(cPh, "dd")
Pilot-Plant data from Daniel and Wood (1971). The response variable corresponds to the acid content determined by titration and the explanatory variable is the organic acid content determined by extraction and weighing. This data set was analyzed also by Yale and Forsythe (1976).
data(pilot, package="robustbase")
data(pilot, package="robustbase")
A data frame with 20 observations on the following 2 variables.
X
Organic acid content - extraction
Y
Acid content - titration
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, page 21, table 1.
data(pilot) summary(lm.pilot <- lm(Y ~.,data=pilot))
data(pilot) summary(lm.pilot <- lm(Y ~.,data=pilot))
The plot
method objects of class
psi_func
simply visualizes the
,
, and weight functions and their
derivatives.
## S4 method for signature 'psi_func' plot(x, y, which = c("rho", "psi", "Dpsi", "wgt", "Dwgt"), main = "full", col = c("black", "red3", "blue3", "dark green", "light green"), leg.loc = "right", ...)
## S4 method for signature 'psi_func' plot(x, y, which = c("rho", "psi", "Dpsi", "wgt", "Dwgt"), main = "full", col = c("black", "red3", "blue3", "dark green", "light green"), leg.loc = "right", ...)
x |
object of class |
y |
(optional) vector of abscissa values (to plot object at). |
which |
|
main |
string or logical indicating the kind of plot title;
either |
col |
colors to be used for the different slots |
leg.loc |
legend placement, see also |
... |
passed to |
An earlier version had argument shortMain
which is deprecated
now. Use main = "short"
instead of shortMain = TRUE
.
If you want to specify your own title, use main=FALSE
, and a
subsequent title(...)
call.
psiFunc()
and the class
psi_func
.
plot(huberPsi) plot(huberPsi, which=c("psi", "Dpsi", "wgt"), main="short", leg = "topleft") plot(hampelPsi) ## Plotting aspect ratio = 1:1 : plot(hampelPsi, asp=1, main="short", which = c("psi", "Dpsi", "wgt", "Dwgt"))
plot(huberPsi) plot(huberPsi, which=c("psi", "Dpsi", "wgt"), main="short", leg = "topleft") plot(hampelPsi) ## Plotting aspect ratio = 1:1 : plot(hampelPsi, asp=1, main="short", which = c("psi", "Dpsi", "wgt", "Dwgt"))
Diagnostic plots for elements of class lmrob
## S3 method for class 'lmrob' plot(x, which = 1:5, caption = c("Standardized residuals vs. Robust Distances", "Normal Q-Q vs. Residuals", "Response vs. Fitted Values", "Residuals vs. Fitted Values" , "Sqrt of abs(Residuals) vs. Fitted Values"), panel = if(add.smooth) panel.smooth else points, sub.caption = deparse(x$call), main = "", compute.MD = TRUE, ask = prod(par("mfcol")) < length(which) && dev.interactive(), id.n = 3, labels.id = names(residuals(x)), cex.id = 0.75, label.pos = c(4,2), qqline = TRUE, add.smooth = getOption("add.smooth"), ..., p=0.025)
## S3 method for class 'lmrob' plot(x, which = 1:5, caption = c("Standardized residuals vs. Robust Distances", "Normal Q-Q vs. Residuals", "Response vs. Fitted Values", "Residuals vs. Fitted Values" , "Sqrt of abs(Residuals) vs. Fitted Values"), panel = if(add.smooth) panel.smooth else points, sub.caption = deparse(x$call), main = "", compute.MD = TRUE, ask = prod(par("mfcol")) < length(which) && dev.interactive(), id.n = 3, labels.id = names(residuals(x)), cex.id = 0.75, label.pos = c(4,2), qqline = TRUE, add.smooth = getOption("add.smooth"), ..., p=0.025)
x |
an object as created by |
which |
integer number between 1 and 5 to specify which plot is desired |
caption |
Caption for the different plots |
panel |
panel function. The useful alternative to
|
main |
main title |
sub.caption |
sub titles |
compute.MD |
logical indicating if the robust Mahalanobis
distances should be recomputed, using |
ask |
waits for user input before displaying each plot |
id.n |
number of points to be labelled in each plot, starting with the most extreme. |
labels.id |
vector of labels, from which the labels for extreme
points will be chosen. |
cex.id |
magnification of point labels. |
label.pos |
positioning of labels, for the left half and right half of the graph respectively. |
qqline |
logical indicating if a |
add.smooth |
logical indicating if a smoother should be added to
most plots; see also |
... |
|
p |
threshold for distance-distance plot |
if compute.MD = TRUE
and the robust Mahalanobis distances need
to be computed, they are stored (“cached”) with the object
x
when this function has been called from top-level.
Robust diagnostic plots as in Rousseeuw and van Zomeren (1990), see
‘References’ in ltsPlot
.
lmrob
, also for examples, plot.lm
.
data(starsCYG) ## Plot simple data and fitted lines plot(starsCYG) lmST <- lm(log.light ~ log.Te, data = starsCYG) RlmST <- lmrob(log.light ~ log.Te, data = starsCYG) RlmST abline(lmST, col = "red") abline(RlmST, col = "blue") op <- par(mfrow = c(2,2), mgp = c(1.5, 0.6, 0), mar= .1+c(3,3,3,1)) plot(RlmST, which = c(1:2, 4:5)) par(op)
data(starsCYG) ## Plot simple data and fitted lines plot(starsCYG) lmST <- lm(log.light ~ log.Te, data = starsCYG) RlmST <- lmrob(log.light ~ log.Te, data = starsCYG) RlmST abline(lmST, col = "red") abline(RlmST, col = "blue") op <- par(mfrow = c(2,2), mgp = c(1.5, 0.6, 0), mar= .1+c(3,3,3,1)) plot(RlmST, which = c(1:2, 4:5)) par(op)
Four plots (selectable by which
) are currently provided:
a plot of the standardized residuals versus their index,
a plot of the standardized residuals versus fitted values,
a Normal Q-Q plot of the standardized residuals, and
a regression diagnostic plot (standardized residuals versus robust distances of the predictor variables).
## S3 method for class 'lts' plot(x, which = c("all","rqq","rindex","rfit","rdiag"), classic=FALSE, ask = (which[1] == "all" && dev.interactive()), id.n, ...)
## S3 method for class 'lts' plot(x, which = c("all","rqq","rindex","rfit","rdiag"), classic=FALSE, ask = (which[1] == "all" && dev.interactive()), id.n, ...)
x |
a |
which |
string indicating which plot to show. See the
Details section for a description of the options. Defaults
to |
.
classic |
whether to plot the classical distances too. Default is
|
.
ask |
logical indicating if the user should be asked
before each plot, see |
id.n |
number of observations to be identified by a label starting with the most extreme. Default is the number of identified outliers (can be different for the different plots - see Details). |
... |
other parameters to be passed through to plotting functions. |
This function produces several plots based on the robust and classical
regression estimates.
Which of them to select is specified by the attribute which
.
The possible options are:
rqq
:Normal Q-Q plot of the standardized residuals;
rindex
:plot of the standardized residuals versus their index;
rfit
:plot of the standardized residuals versus fitted values;
rdiag
:regression diagnostic plot.
The normal quantile plot produces a normal Q-Q plot of the
standardized residuals.
A line is drawn which passes through the first and third quantile.
The id.n
residuals with largest distances from this line are
identified by labels (the observation number). The default for
id.n
is the number of regression outliers (lts.wt==0).
In the Index plot and in the Fitted values plot the standardized residuals are displayed against the observation number or the fitted value respectively. A horizontal dashed line is drawn at 0 and two solid horizontal lines are located at +2.5 and -2.5. The id.n residuals with largest absolute values are identified by labels (the observation number). The default for id.n is the number regression outliers (lts.wt==0).
The regression diagnostic plot, introduced by Rousseeuw and van Zomeren (1990), displays the standardized residuals versus robust distances. Following Rousseeuw and van Zomeren (1990), the horizontal dashed lines are located at +2.5 and -2.5 and the vertical line is located at the upper 0.975 percent point of the chi-squared distribution with p degrees of freedom. The id.n residuals with largest absolute values and/or largest robust Mahalanobis distances are identified by labels (the observation number). The default for id.n is the number of all outliers: regression outliers (lts.wt==0) + leverage (bad and good) points (RD > 0.975 percent point of the chi-squared distribution with p degrees of freedom).
P. J. Rousseeuw and van Zomeren, B. C. (1990). Unmasking Multivariate Outliers and Leverage Points. Journal of the American Statistical Association 85, 633–639.
P. J. Rousseeuw and K. van Driessen (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223.
data(hbk) lts <- ltsReg(Y ~ ., data = hbk) lts plot(lts, which = "rqq")
data(hbk) lts <- ltsReg(Y ~ ., data = hbk) lts plot(lts, which = "rqq")
Shows the Mahalanobis distances based on robust and classical estimates of the location and the covariance matrix in different plots. The following plots are available:
index plot of the robust and mahalanobis distances
distance-distance plot
Chisquare QQ-plot of the robust and mahalanobis distances
plot of the tolerance ellipses (robust and classic)
Scree plot - Eigenvalues comparison plot
## S3 method for class 'mcd' plot(x, which = c("all", "dd", "distance", "qqchi2", "tolEllipsePlot", "screeplot"), classic = FALSE, ask = (which[1] == "all" && dev.interactive()), cutoff, id.n, labels.id = rownames(x$X), cex.id = 0.75, label.pos = c(4,2), tol = 1e-7, ...) covPlot(x, which = c("all", "dd", "distance", "qqchi2", "tolEllipsePlot", "screeplot"), classic = FALSE, ask = (which[1] == "all" && dev.interactive()), m.cov = covMcd(x), cutoff = NULL, id.n, labels.id = rownames(x), cex.id = 0.75, label.pos = c(4,2), tol = 1e-07, ...)
## S3 method for class 'mcd' plot(x, which = c("all", "dd", "distance", "qqchi2", "tolEllipsePlot", "screeplot"), classic = FALSE, ask = (which[1] == "all" && dev.interactive()), cutoff, id.n, labels.id = rownames(x$X), cex.id = 0.75, label.pos = c(4,2), tol = 1e-7, ...) covPlot(x, which = c("all", "dd", "distance", "qqchi2", "tolEllipsePlot", "screeplot"), classic = FALSE, ask = (which[1] == "all" && dev.interactive()), m.cov = covMcd(x), cutoff = NULL, id.n, labels.id = rownames(x), cex.id = 0.75, label.pos = c(4,2), tol = 1e-07, ...)
x |
For the |
which |
string indicating which plot to show. See the
Details section for a description of the options. Defaults to |
.
classic |
whether to plot the classical distances too.
Defaults to |
.
ask |
logical indicating if the user should be asked
before each plot, see |
cutoff |
the cutoff value for the distances. |
id.n |
number of observations to be identified by a label. If
not supplied, the number of observations with distance larger than
|
labels.id |
vector of labels, from which the labels for extreme
points will be chosen. |
cex.id |
magnification of point labels. |
label.pos |
positioning of labels, for the left half and right
half of the graph respectively (used as |
tol |
tolerance to be used for computing the inverse, see
|
m.cov |
an object similar to those of class |
... |
other parameters to be passed through to plotting functions. |
These functions produce several plots based on the robust and classical
location and covariance matrix. Which of them to select is specified
by the attribute which
. The plot
method for
"mcd"
objects is calling covPlot()
directly, whereas
covPlot()
should also be useful for plotting other (robust)
covariance estimates. The possible options are:
distance
index plot of the robust distances
dd
distance-distance plot
qqchi2
a qq-plot of the robust distances versus the quantiles of the chi-squared distribution
tolEllipsePlot
a tolerance ellipse plot, via
tolEllipsePlot()
screeplot
an eigenvalues comparison plot - screeplot
The Distance-Distance Plot, introduced by Rousseeuw and van Zomeren (1990), displays the robust distances versus the classical Mahalanobis distances. The dashed line is the set of points where the robust distance is equal to the classical distance. The horizontal and vertical lines are drawn at values equal to the cutoff which defaults to square root of the 97.5% quantile of a chi-squared distribution with p degrees of freedom. Points beyond these lines can be considered outliers.
P. J. Rousseeuw and van Zomeren, B. C. (1990). Unmasking Multivariate Outliers and Leverage Points. Journal of the American Statistical Association 85, 633–639.
P. J. Rousseeuw and K. van Driessen (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223.
data(Animals, package ="MASS") brain <- Animals[c(1:24, 26:25, 27:28),] mcd <- covMcd(log(brain)) plot(mcd, which = "distance", classic = TRUE)# 2 plots plot(mcd, which = "dd") plot(mcd, which = "tolEllipsePlot", classic = TRUE) op <- par(mfrow = c(2,3)) plot(mcd) ## -> which = "all" (5 plots) par(op) ## same plots for another robust Cov estimate: data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cOGK <- covOGK(hbk.x, n.iter = 2, sigmamu = scaleTau2, weight.fn = hard.rejection) covPlot(hbk.x, m.cov = cOGK, classic = TRUE)
data(Animals, package ="MASS") brain <- Animals[c(1:24, 26:25, 27:28),] mcd <- covMcd(log(brain)) plot(mcd, which = "distance", classic = TRUE)# 2 plots plot(mcd, which = "dd") plot(mcd, which = "tolEllipsePlot", classic = TRUE) op <- par(mfrow = c(2,3)) plot(mcd) ## -> which = "all" (5 plots) par(op) ## same plots for another robust Cov estimate: data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) cOGK <- covOGK(hbk.x, n.iter = 2, sigmamu = scaleTau2, weight.fn = hard.rejection) covPlot(hbk.x, m.cov = cOGK, classic = TRUE)
Possum diversity data: As issued from a study of the diversity of possum (arboreal marsupials) in the Montane ash forest (Australia), this dataset was collected in view of the management of hardwood forest to take conservation and recreation values, as well as wood production, into account.
The study is fully described in the two references.
The number of different species of arboreal marsupials (possum) was
observed on 151 different 3ha sites with uniform vegetation. For each
site the nine variable measures (see below) were recorded.
The problem is to model the relationship between diversity
and these
other variables.
data(possumDiv, package="robustbase")
data(possumDiv, package="robustbase")
Two different representations of the same data are available:
possumDiv
is a data frame of 151 observations
of 9 variables, where the last two are factors, eucalyptus
with
3 levels and aspect
with 4 levels.
possum.mat
is a numeric (integer) matrix of 151 rows
(observations) and 14 columns (variables) where the last seven ones
are 0-1 dummy variables, three (E.*
) are coding for the kind of
eucalyptus
and the last four are 0-1 coding for the
aspect
factor.
The variables have the following meaning:
main variable of interest is the number of different species of arboreal marsupial (possum) observed, with values in 0:5.
the number of shrubs.
the number of cut stumps from past logging operations.
the number of stags (hollow-bearing trees).
bark index (integer) vector reflecting the quantity of decorticating bark.
an integer score indicating the suitability of nesting and foraging habitat for Leadbeater's possum.
a numeric vector giving the basal area of acacia species.
a 3-level factor
specifying the species of eucalypt with the greatest stand basal
area. This has the same information as the following three variables
0-1 indicator for Eucalyptus regnans
0-1 indicator for Eucalyptus deleg.
0-1 indicator for Eucalyptus nitens
a 4-level factor
specifying the aspect
of the site. It is the same information as the following four
variables.
0-1 indicator
0-1 indicator
0-1 indicator
0-1 indicator
Eva Cantoni (2004) Analysis of Robust Quasi-deviances for Generalized Linear Models. Journal of Statistical Software 10, 04, https://www.jstatsoft.org/article/view/v010i04
Lindenmayer, D. B., Cunningham, R. B., Tanton, M. T., Nix, H. A. and Smith, A. P. (1991) The conservation of arboreal marsupials in the montane ash forests of the central highlands of victoria, south-east australia: III. The habitat requirements of leadbeater's possum gymnobelideus leadbeateri and models of the diversity and abundance of arboreal marsupials. Biological Conservation 56, 295–315.
Lindenmayer, D. B., Cunningham, R. B., Tanton, M. T., Smith, A. P. and Nix, H. A. (1990) The conservation of arboreal marsupials in the montane ash forests of the victoria, south-east australia, I. Factors influencing the occupancy of trees with hollows, Biological Conservation 54, 111–131.
See also the references in glmrob
.
data(possumDiv) head(possum.mat) str(possumDiv) ## summarize all variables as multilevel factors: summary(as.data.frame(lapply(possumDiv, function(v) if(is.integer(v)) factor(v) else v))) ## Following Cantoni & Ronchetti (2001), JASA, p.1026 f.:% cf. ../tests/poisson-ex.R pdFit <- glmrob(Diversity ~ . , data = possumDiv, family=poisson, tcc = 1.6, weights.on.x = "hat", acc = 1e-15) summary(pdFit) summary(pdF2 <- update(pdFit, ~ . -Shrubs)) summary(pdF3 <- update(pdF2, ~ . -eucalyptus)) summary(pdF4 <- update(pdF3, ~ . -Stumps)) summary(pdF5 <- update(pdF4, ~ . -BAcacia)) summary(pdF6 <- update(pdF5, ~ . -aspect))# too much .. anova(pdFit, pdF3, pdF4, pdF5, pdF6, test = "QD") # indeed, ## indeed, the last simplification is too much possumD.2 <- within(possumDiv, levels(aspect)[1:3] <- rep("other", 3)) ## and use this binary 'aspect' instead of the 4-level one: summary(pdF5.1 <- update(pdF5, data = possumD.2)) if(FALSE) # not ok, as formually not nested. anova(pdF5, pdF5.1) summarizeRobWeights(weights(pdF5.1, type="rob"), eps = 0.73) ##-> "outliers" (1, 59, 110) wrob <- setNames(weights(pdF5.1, type="rob"), rownames(possumDiv)) head(sort(wrob))
data(possumDiv) head(possum.mat) str(possumDiv) ## summarize all variables as multilevel factors: summary(as.data.frame(lapply(possumDiv, function(v) if(is.integer(v)) factor(v) else v))) ## Following Cantoni & Ronchetti (2001), JASA, p.1026 f.:% cf. ../tests/poisson-ex.R pdFit <- glmrob(Diversity ~ . , data = possumDiv, family=poisson, tcc = 1.6, weights.on.x = "hat", acc = 1e-15) summary(pdFit) summary(pdF2 <- update(pdFit, ~ . -Shrubs)) summary(pdF3 <- update(pdF2, ~ . -eucalyptus)) summary(pdF4 <- update(pdF3, ~ . -Stumps)) summary(pdF5 <- update(pdF4, ~ . -BAcacia)) summary(pdF6 <- update(pdF5, ~ . -aspect))# too much .. anova(pdFit, pdF3, pdF4, pdF5, pdF6, test = "QD") # indeed, ## indeed, the last simplification is too much possumD.2 <- within(possumDiv, levels(aspect)[1:3] <- rep("other", 3)) ## and use this binary 'aspect' instead of the 4-level one: summary(pdF5.1 <- update(pdF5, data = possumD.2)) if(FALSE) # not ok, as formually not nested. anova(pdF5, pdF5.1) summarizeRobWeights(weights(pdF5.1, type="rob"), eps = 0.73) ##-> "outliers" (1, 59, 110) wrob <- setNames(weights(pdF5.1, type="rob"), rownames(possumDiv)) head(sort(wrob))
Obtains predictions and optionally estimates standard errors of those predictions from a fitted robust generalized linear model (GLM) object.
## S3 method for class 'glmrob' predict(object, newdata = NULL, type = c("link", "response", "terms"), se.fit = FALSE, dispersion = NULL, terms = NULL, na.action = na.pass, ...)
## S3 method for class 'glmrob' predict(object, newdata = NULL, type = c("link", "response", "terms"), se.fit = FALSE, dispersion = NULL, terms = NULL, na.action = na.pass, ...)
object |
a fitted object of class inheriting from |
newdata |
optionally, a data frame in which to look for variables with which to predict. If omitted, the fitted linear predictors are used. |
type |
the type of prediction required. The default is on the
scale of the linear predictors; the alternative The value of this argument can be abbreviated. |
se.fit |
logical switch indicating if standard errors are required. |
dispersion |
the dispersion of the GLM fit to be assumed in
computing the standard errors. If omitted, that returned by
|
terms |
with |
na.action |
function determining what should be done with missing
values in |
... |
optional further arguments, currently simply passed to
|
If se = FALSE
, a vector or matrix of predictions.
If se = TRUE
, a list with components
fit |
Predictions |
se.fit |
Estimated standard errors |
residual.scale |
A scalar giving the square root of the dispersion used in computing the standard errors. |
Andreas Ruckstuhl
glmrob()
to fit these robust GLM models,
residuals.glmrob()
and other methods;
predict.lm()
, the method used for a non-robust fit.
data(carrots) ## simplistic testing & training: i.tr <- sample(24, 20) fm1 <- glmrob(cbind(success, total-success) ~ logdose + block, family = binomial, data = carrots, subset = i.tr) fm1 predict(fm1, carrots[-i.tr, ]) # --> numeric vector predict(fm1, carrots[-i.tr, ], type="response", se = TRUE)# -> a list data(vaso) Vfit <- glmrob(Y ~ log(Volume) + log(Rate), family=binomial, data=vaso) newd <- expand.grid(Volume = (V. <- seq(.5, 4, by = 0.5)), Rate = (R. <- seq(.25,4, by = 0.25))) p <- predict(Vfit, newd) filled.contour(V., R., matrix(p, length(V.), length(R.)), main = "predict(glmrob(., data=vaso))", xlab="Volume", ylab="Rate")
data(carrots) ## simplistic testing & training: i.tr <- sample(24, 20) fm1 <- glmrob(cbind(success, total-success) ~ logdose + block, family = binomial, data = carrots, subset = i.tr) fm1 predict(fm1, carrots[-i.tr, ]) # --> numeric vector predict(fm1, carrots[-i.tr, ], type="response", se = TRUE)# -> a list data(vaso) Vfit <- glmrob(Y ~ log(Volume) + log(Rate), family=binomial, data=vaso) newd <- expand.grid(Volume = (V. <- seq(.5, 4, by = 0.5)), Rate = (R. <- seq(.25,4, by = 0.25))) p <- predict(Vfit, newd) filled.contour(V., R., matrix(p, length(V.), length(R.)), main = "predict(glmrob(., data=vaso))", xlab="Volume", ylab="Rate")
Predicted values based on robust linear model object.
## S3 method for class 'lmrob' predict(object, newdata, se.fit = FALSE, scale = NULL, df = NULL, interval = c("none", "confidence", "prediction"), level = 0.95, type = c("response", "terms"), terms = NULL, na.action = na.pass, pred.var = res.var/weights, weights = 1, ...)
## S3 method for class 'lmrob' predict(object, newdata, se.fit = FALSE, scale = NULL, df = NULL, interval = c("none", "confidence", "prediction"), level = 0.95, type = c("response", "terms"), terms = NULL, na.action = na.pass, pred.var = res.var/weights, weights = 1, ...)
object |
object of class inheriting from |
newdata |
an optional data frame in which to look for variables with which to predict. If omitted, the fitted values are used. |
se.fit |
a switch indicating if standard errors are required. |
scale |
scale parameter for std.err. calculation |
df |
degrees of freedom for scale |
interval |
type of interval calculation. |
level |
tolerance/confidence level |
type |
Type of prediction (response or model term). |
terms |
if |
na.action |
function determining what should be done with missing
values in |
pred.var |
the variance(s) for future observations to be assumed for prediction intervals. See ‘Details’. |
weights |
variance weights for prediction. This can be a numeric
vector or a one-sided model formula. In the latter case, it is
interpreted as an expression evaluated in |
... |
further arguments passed to or from other methods. |
Note that this lmrob
method for predict
is
closely modeled after the method for lm()
,
predict.lm
, maybe see there for caveats with missing
value treatment.
The prediction intervals are for a single observation at each case in
newdata
(or by default, the data used for the fit) with error
variance(s) pred.var
. This can be a multiple of res.var
,
the estimated value of : the default is to assume that
future observations have the same error variance as those
used for fitting. If
weights
is supplied, the inverse of this
is used as a scale factor. For a weighted fit, if the prediction
is for the original data frame, weights
defaults to the weights
used for the model fit, with a warning since it might not be the
intended result. If the fit was weighted and newdata
is given, the
default is to assume constant prediction variance, with a warning.
predict.lmrob
produces a vector of predictions or a matrix of
predictions and bounds with column names fit
, lwr
, and
upr
if interval
is set. If se.fit
is
TRUE
, a list with the following components is returned:
fit |
vector or matrix as above |
se.fit |
standard error of predicted means |
residual.scale |
residual standard deviations |
df |
degrees of freedom for residual |
Andreas Ruckstuhl
lmrob
and the (non-robust) traditional
predict.lm
method.
## Predictions --- artificial example -- closely following example(predict.lm) set.seed(5) n <- length(x <- sort(c(round(rnorm(25), 1), 20))) y <- x + rnorm(n) iO <- c(sample(n-1, 3), n) y[iO] <- y[iO] + 10*rcauchy(iO) p.ex <- function(...) { plot(y ~ x, ...); abline(0,1, col="sky blue") points(y ~ x, subset=iO, col="red", pch=2) abline(lm (y ~ x), col = "gray40") abline(lmrob(y ~ x), col = "forest green") legend("topleft", c("true", "Least Squares", "robust"), col = c("sky blue", "gray40", "forest green"), lwd=1.5, bty="n") } p.ex() fm <- lmrob(y ~ x) predict(fm) new <- data.frame(x = seq(-3, 10, 0.25)) str(predict(fm, new, se.fit = TRUE)) pred.w.plim <- predict(fm, new, interval = "prediction") pred.w.clim <- predict(fm, new, interval = "confidence") pmat <- cbind(pred.w.clim, pred.w.plim[,-1]) matlines(new$x, pmat, lty = c(1,2,2,3,3))# add to first plot ## show zoom-in region : rect(xleft = -3, ybottom = -20, xright = 10, ytop = 40, lty = 3, border="orange4") ## now zoom in : p.ex(xlim = c(-3,10), ylim = c(-20, 40)) matlines(new$x, pmat, lty = c(1,2,2,3,3)) box(lty = 3, col="orange4", lwd=3) legend("bottom", c("fit", "lwr CI", "upr CI", "lwr Pred.I", "upr Pred.I"), col = 1:5, lty=c(1,2,2,3,3), bty="n") ## Prediction intervals, special cases ## The first three of these throw warnings w <- 1 + x^2 fit <- lmrob(y ~ x) wfit <- lmrob(y ~ x, weights = w) predict(fit, interval = "prediction") predict(wfit, interval = "prediction") predict(wfit, new, interval = "prediction") predict(wfit, new, interval = "prediction", weights = (new$x)^2) -> p.w2 p.w2 stopifnot(identical(p.w2, ## the same as using formula: predict(wfit, new, interval = "prediction", weights = ~x^2)))
## Predictions --- artificial example -- closely following example(predict.lm) set.seed(5) n <- length(x <- sort(c(round(rnorm(25), 1), 20))) y <- x + rnorm(n) iO <- c(sample(n-1, 3), n) y[iO] <- y[iO] + 10*rcauchy(iO) p.ex <- function(...) { plot(y ~ x, ...); abline(0,1, col="sky blue") points(y ~ x, subset=iO, col="red", pch=2) abline(lm (y ~ x), col = "gray40") abline(lmrob(y ~ x), col = "forest green") legend("topleft", c("true", "Least Squares", "robust"), col = c("sky blue", "gray40", "forest green"), lwd=1.5, bty="n") } p.ex() fm <- lmrob(y ~ x) predict(fm) new <- data.frame(x = seq(-3, 10, 0.25)) str(predict(fm, new, se.fit = TRUE)) pred.w.plim <- predict(fm, new, interval = "prediction") pred.w.clim <- predict(fm, new, interval = "confidence") pmat <- cbind(pred.w.clim, pred.w.plim[,-1]) matlines(new$x, pmat, lty = c(1,2,2,3,3))# add to first plot ## show zoom-in region : rect(xleft = -3, ybottom = -20, xright = 10, ytop = 40, lty = 3, border="orange4") ## now zoom in : p.ex(xlim = c(-3,10), ylim = c(-20, 40)) matlines(new$x, pmat, lty = c(1,2,2,3,3)) box(lty = 3, col="orange4", lwd=3) legend("bottom", c("fit", "lwr CI", "upr CI", "lwr Pred.I", "upr Pred.I"), col = 1:5, lty=c(1,2,2,3,3), bty="n") ## Prediction intervals, special cases ## The first three of these throw warnings w <- 1 + x^2 fit <- lmrob(y ~ x) wfit <- lmrob(y ~ x, weights = w) predict(fit, interval = "prediction") predict(wfit, interval = "prediction") predict(wfit, new, interval = "prediction") predict(wfit, new, interval = "prediction", weights = (new$x)^2) -> p.w2 p.w2 stopifnot(identical(p.w2, ## the same as using formula: predict(wfit, new, interval = "prediction", weights = ~x^2)))
Print method for elements of class "lmrob"
.
## S3 method for class 'lmrob' print(x, digits = max(3, getOption("digits") - 3), ...)
## S3 method for class 'lmrob' print(x, digits = max(3, getOption("digits") - 3), ...)
x |
an R object of class |
digits |
number of digits for printing, see |
... |
potentially more arguments passed to methods. |
lmrob
, summary.lmrob
,
print
and summary
.
data(coleman) ( m1 <- lmrob(Y ~ ., data=coleman) ) # -> print.lmrob() method
data(coleman) ( m1 <- lmrob(Y ~ ., data=coleman) ) # -> print.lmrob() method
The class "psi_func"
is used to store
functions for M-estimation. In particular, an object of the class
contains
, its derivative
, the weight function
, and
first derivative of
,
Dpsi =
.
Objects can be created by calls of the form new("psi_func", ...)
,
but preferably by psiFunc(...)
.
rho
:the function, an object of
class
"functionX"
. This is used to formulate the
objective function; can be regarded as generalized
negative log-likelihood.
psi
: is the derivative of
,
;
also of class
"functionX"
.
wgt
:The weight function ,
of class
"functionX"
.
Dpsi
:the derivative of ,
; of class
"functionX"
.
Dwgt
:the derivative of the weight function,
of class "functionX"
, is generated automatically if
psiFunc
constructor is used.
tDefs
:named numeric vector of tuning parameter Default values.
Erho
:A function of class "functionXal"
for
computing when
is standard normal
.
Epsi2
:A function of class "functionXal"
for
computing when
is standard normal.
EDpsi
:A function of class "functionXal"
for
computing when
is standard normal.
name
:Name of -function used for printing.
xtras
:Potentially further information.
Currently, only chgDefaults()
, plot()
and show()
.
Martin Maechler
str(huberPsi, give.attr = FALSE) plot(hampelPsi)# calling the plot method (nicely showing "all" !)
str(huberPsi, give.attr = FALSE) plot(hampelPsi)# calling the plot method (nicely showing "all" !)
Find psi function tuning constant sets for "LQQ"
and "GGW"
psi () functions by specifying largest descent (minimal
slope), efficiency and or breakdown point.
.psi.const()
is called from lmrob.control()
to
set the tuning constants for psi and chi for "LQQ"
and
"GGW"
psi. Unless the specified tuning constants are from
fixed small set where the computations are stored precomputed,
.psi.const()
calls the corresponding .psi.<psi>.findc()
.
.psi.ggw.findc(ms, b, eff = NA, bp = NA, subdivisions = 100L, rel.tol = .Machine$double.eps^0.25, abs.tol = rel.tol, tol = .Machine$double.eps^0.25, ms.tol = tol/64, maxiter = 1000) .psi.lqq.findc(ms, b.c, eff = NA, bp = NA, interval = c(0.1, 4), subdivisions = 100L, rel.tol = .Machine$double.eps^0.25, abs.tol = rel.tol, tol = .Machine$double.eps^0.25, maxiter = 1000) .psi.const(cc, psi)
.psi.ggw.findc(ms, b, eff = NA, bp = NA, subdivisions = 100L, rel.tol = .Machine$double.eps^0.25, abs.tol = rel.tol, tol = .Machine$double.eps^0.25, ms.tol = tol/64, maxiter = 1000) .psi.lqq.findc(ms, b.c, eff = NA, bp = NA, interval = c(0.1, 4), subdivisions = 100L, rel.tol = .Machine$double.eps^0.25, abs.tol = rel.tol, tol = .Machine$double.eps^0.25, maxiter = 1000) .psi.const(cc, psi)
ms |
number, the minimal slope, typically negative. |
b , b.c
|
number, specifying |
eff |
a number (or |
bp |
a number (or |
interval |
for finding |
subdivisions |
passed to |
rel.tol , abs.tol
|
relative and absolute tolerance for
|
tol |
relative tolerance for |
ms.tol |
relative tolerance for the internal
|
maxiter |
maximal number of iterations for
|
cc |
(for |
psi |
a string, either |
For some important special cases, the result of .psi.*.findc()
are stored precomputed for efficiency reasons. These cases are (the
defaults for tuning.chi
and tuning.psi
respectively in
lmrob.control()
s result,
tuning.chi |
tuning.psi |
c(-0.5, 1.5, NA, 0.5) |
c(-0.5, 1.5, 0.95, NA)
|
and for "ggw"
additionally, these four cases:
tuning.chi |
tuning.psi |
c(-0.5, 1.5, 0.85, NA) |
|
c(-0.5, 1, NA, 0.5) |
c(-0.5, 1, 0.95, NA) |
c(-0.5, 1, 0.85, NA)
|
Note that for "ggw"
, exactly these cases also
allow fast
and
(aka
, see
Mchi
),
function evaluations. For all other tuning constant settings, rho()
evaluations are based on numerical integration via R's own
Rdqags()
C function (part of R's official API).
a numeric
vector of constants, for "lqq"
or
"ggw"
psi functions, respectively:
"lqq"
:,
"ggw"
:.
.psi.const(cc, psi)
returns the argument cc
with the
above constant vectors as attribute "constants"
, in the case of
psi = "lqq"
in all cases (since robustbase version >=
0.93), for psi = "ggw"
only in the non-standard cases.
Manuel Koller (original) and Martin Maechler (arguments, export, docs).
See the vignette about
“-Functions Available in Robustbase”.
Mpsi()
etc for the psi function definitions;
.Mpsi.tuning.defaults
, etc, for tuning constants'
defaults for lmrob()
.
(c.ge95 <- .psi.ggw.findc(ms = -0.5, b = 1.5, eff = 0.95)) (c.ge90 <- .psi.ggw.findc(ms = -0.5, b = 1.5, eff = 0.90)) (c.gb50 <- .psi.ggw.findc(ms = -0.5, b = 1.5, bp = 0.50)) stopifnot(all.equal(c.ge95, c(0, 1.386362, 1.5, 1.0628199, 4.7773893), tol = 1e-5), all.equal(c.ge90, c(0, 1.0282811, 1.5, 0.87086259, 3.2075233), tol = 1e-5), all.equal(c.gb50, c(0, 0.20367394, 1.5, 0.29591308, 0.37033962),tol = 1e-5)) (cl.e.95 <- .psi.lqq.findc(ms = -0.5, b.c = 1.5, eff = .95)) (cl.b.50 <- .psi.lqq.findc(ms = -0.5, b.c = 1.5, bp = .50)) stopifnot(all.equal(cl.e.95, c(1.4734061, 0.98227073, 1.5), tol = 1e-5), all.equal(cl.b.50, c(0.40154568, 0.26769712, 1.5), tol = 1e-5))
(c.ge95 <- .psi.ggw.findc(ms = -0.5, b = 1.5, eff = 0.95)) (c.ge90 <- .psi.ggw.findc(ms = -0.5, b = 1.5, eff = 0.90)) (c.gb50 <- .psi.ggw.findc(ms = -0.5, b = 1.5, bp = 0.50)) stopifnot(all.equal(c.ge95, c(0, 1.386362, 1.5, 1.0628199, 4.7773893), tol = 1e-5), all.equal(c.ge90, c(0, 1.0282811, 1.5, 0.87086259, 3.2075233), tol = 1e-5), all.equal(c.gb50, c(0, 0.20367394, 1.5, 0.29591308, 0.37033962),tol = 1e-5)) (cl.e.95 <- .psi.lqq.findc(ms = -0.5, b.c = 1.5, eff = .95)) (cl.b.50 <- .psi.lqq.findc(ms = -0.5, b.c = 1.5, bp = .50)) stopifnot(all.equal(cl.e.95, c(1.4734061, 0.98227073, 1.5), tol = 1e-5), all.equal(cl.b.50, c(0.40154568, 0.26769712, 1.5), tol = 1e-5))
psiFunc(..)
is a convenience interface to
new("psi_func",..)
, i.e. for constructing objects of class
"psi_func"
.
psiFunc(rho, psi, wgt, Dpsi,Dwgt, Erho = NULL, Epsi2 = NULL, EDpsi = NULL, name, ...) huberPsi hampelPsi
psiFunc(rho, psi, wgt, Dpsi,Dwgt, Erho = NULL, Epsi2 = NULL, EDpsi = NULL, name, ...) huberPsi hampelPsi
rho , psi , wgt , Dpsi , Dwgt
|
each a |
Erho , Epsi2 , EDpsi
|
see |
name |
Name of |
... |
potential further arguments for specifying tuning parameter names and defaults. |
Martin Maechler
The description of class psi_func
.
plot(huberPsi) # => shows "all" {as an object with a smart plot() method} ## classical (Gaussian / "least-squares") psi {trivial}: F1 <- function(x, .) rep.int(1, length(x)) FF <- function(.) rep.int(1, length(.)) cPsi <- psiFunc(rho = function(x,.) x^2 / 2, psi = function(x, .) x, wgt = F1, Dpsi = F1, Erho = function(.) rep.int(1/2, length(.)), Epsi2 = FF, EDpsi = FF, name = "classic", . = Inf) show(cPsi) plot(cPsi) ## is the same as the limit of Huber's: plot(chgDefaults(huberPsi, k = Inf)) ## Hampel's psi and rho: H.38 <- chgDefaults(hampelPsi, k = c(1.5, 3.5, 8)) k. <- H.38@xtras$tuningP$k ; k.. <- as.vector(outer(c(-1,1), k.)) c.t <- adjustcolor("skyblue3", .8) .ax.k <- function(side) { abline(h=0, v=0, lty=2) axis(side, at = k.., labels=formatC(k..), pos=0, col=c.t, col.axis=c.t) } op <- par(mfrow=c(2,1), mgp = c(1.5, .6, 0), mar = .6+c(2,2,1,.5)) curve(H.38@psi(x), -10, 10, col=2, lwd=2, n=512) lines(k.., H.38@psi(k..), type = "h", lty=3, col=c.t); .ax.k(1) curve(H.38@rho(x), -10, 10, col=2, lwd=2, n=512); abline(h=0, v=0, lty=2) lines(k.., H.38@rho(k..), type = "h", lty=3, col=c.t); .ax.k(1) title(expression("Hampel's " ~~~ psi(x) ~~ "and" ~~ rho(x) ~~~ " functions")) par(op) ## Not the same, but similar, directly using the plot() method: plot(H.38)
plot(huberPsi) # => shows "all" {as an object with a smart plot() method} ## classical (Gaussian / "least-squares") psi {trivial}: F1 <- function(x, .) rep.int(1, length(x)) FF <- function(.) rep.int(1, length(.)) cPsi <- psiFunc(rho = function(x,.) x^2 / 2, psi = function(x, .) x, wgt = F1, Dpsi = F1, Erho = function(.) rep.int(1/2, length(.)), Epsi2 = FF, EDpsi = FF, name = "classic", . = Inf) show(cPsi) plot(cPsi) ## is the same as the limit of Huber's: plot(chgDefaults(huberPsi, k = Inf)) ## Hampel's psi and rho: H.38 <- chgDefaults(hampelPsi, k = c(1.5, 3.5, 8)) k. <- H.38@xtras$tuningP$k ; k.. <- as.vector(outer(c(-1,1), k.)) c.t <- adjustcolor("skyblue3", .8) .ax.k <- function(side) { abline(h=0, v=0, lty=2) axis(side, at = k.., labels=formatC(k..), pos=0, col=c.t, col.axis=c.t) } op <- par(mfrow=c(2,1), mgp = c(1.5, .6, 0), mar = .6+c(2,2,1,.5)) curve(H.38@psi(x), -10, 10, col=2, lwd=2, n=512) lines(k.., H.38@psi(k..), type = "h", lty=3, col=c.t); .ax.k(1) curve(H.38@rho(x), -10, 10, col=2, lwd=2, n=512); abline(h=0, v=0, lty=2) lines(k.., H.38@rho(k..), type = "h", lty=3, col=c.t); .ax.k(1) title(expression("Hampel's " ~~~ psi(x) ~~ "and" ~~ rho(x) ~~~ " functions")) par(op) ## Not the same, but similar, directly using the plot() method: plot(H.38)
Measurements of aspects pulp fibers and the paper produced from them. Four properties of each are measured in sixty-two samples.
data(pulpfiber, package="robustbase")
data(pulpfiber, package="robustbase")
A data frame with 62 observations on the following 8 variables.
X1
numeric vector of arithmetic fiber length
X2
numeric vector of long fiber fraction
X3
numeric vector of fine fiber fraction
X4
numeric vector of zero span tensile
Y1
numeric vector of breaking length
Y2
numeric vector of elastic modulus
Y3
numeric vector of stress at failure
Y4
numeric vector of burst strength
Cited from the reference article:
The dataset contains measurements of properties of pulp fibers and the
paper made from them. The aim is to investigate relations between pulp
fiber properties and the resulting paper properties. The dataset
contains measurements of the following four pulp fiber
characteristics: arithmetic fiber length, long fiber fraction, fine
fiber fraction, and zero span tensile. The four paper properties that
have been measured are breaking length, elastic modulus, stress at
failure, and burst strength.
The goal is to predict the paper properties from the
fiber characteristics.
port to R and this help page: Martin Maechler
Rousseeuw, P. J., Van Aelst, S., Van Driessen, K., and Agulló, J. (2004) Robust multivariate regression; Technometrics 46, 293–305.
Till 2016 available from http://users.ugent.be/~svaelst/data/pulpfiber.txt
Lee, J. (1992) Relationships Between Properties of Pulp-Fibre and Paper, unpublished doctoral thesis, U. Toronto, Faculty of Forestry.
data(pulpfiber) str(pulpfiber) pairs(pulpfiber, gap=.1) ## 2 blocks of 4 .. c1 <- cov(pulpfiber) cR <- covMcd(pulpfiber) ## how different are they: The robust estimate has more clear high correlations: symnum(cov2cor(c1)) symnum(cov2cor(cR$cov))
data(pulpfiber) str(pulpfiber) pairs(pulpfiber, gap=.1) ## 2 blocks of 4 .. c1 <- cov(pulpfiber) cR <- covMcd(pulpfiber) ## how different are they: The robust estimate has more clear high correlations: symnum(cov2cor(c1)) symnum(cov2cor(cR$cov))
Compute the robust scale estimator , an efficient
alternative to the MAD.
By default, is the
-th
order statistic (a quantile) of the
choose(n, 2)
absolute
differences ,
(for
),
where by default (originally only possible value)
which is about the first quartile (25% quantile) of these
pairwise differences. See the references for more.
Qn(x, constant = NULL, finite.corr = is.null(constant) && missing(k), na.rm = FALSE, k = choose(n %/% 2 + 1, 2), warn.finite.corr = TRUE) s_Qn(x, mu.too = FALSE, ...)
Qn(x, constant = NULL, finite.corr = is.null(constant) && missing(k), na.rm = FALSE, k = choose(n %/% 2 + 1, 2), warn.finite.corr = TRUE) s_Qn(x, mu.too = FALSE, ...)
x |
numeric vector of observations. |
constant |
number by which the result is multiplied; the default
achieves consistency for normally distributed data. Note that until
Nov. 2010, “thanks” to a typo in the very first papers, a slightly
wrong default constant, 2.2219, was used instead of the correct one
which is equal to If you need the old slightly off version for historical
reproducibility, you can use Note that the relative difference is only about 1 in 1000, and that
the correction should not affect the finite sample corrections for
|
finite.corr |
logical indicating if the finite sample bias
correction factor should be applied. Defaults to |
na.rm |
logical specifying if missing values ( |
k |
integer, typically half of n, specifying the “quantile”, i.e., rather the
order statistic that |
warn.finite.corr |
logical indicating if a |
mu.too |
logical indicating if the |
... |
potentially further arguments for |
As the (default, consistency) constant needed to be corrected,
the finite sample correction has been based on a much more extensive
simulation, and on a 3rd or 4th degree polynomial model in
for odd or even n, respectively.
Qn()
returns a number, the robust scale
estimator, scaled to be consistent for
and
i.i.d. Gaussian observations, optionally bias corrected for finite
samples.
s_Qn(x, mu.too=TRUE)
returns a length-2 vector with location
() and scale; this is typically only useful for
covOGK(*, sigmamu = s_Qn)
.
Original Fortran code:
Christophe Croux and Peter Rousseeuw [email protected].
Port to C and R: Martin Maechler, [email protected]
Rousseeuw, P.J. and Croux, C. (1993) Alternatives to the Median Absolute Deviation, Journal of the American Statistical Association 88, 1273–1283. doi:10.2307/2291267
Christophe Croux and Peter J. Rousseeuw (1992) A class of high-breakdown scale estimators based on subranges , Communications in Statistics - Theory and Methods 21, 1935–1951; doi:10.1080/03610929208830889
Christophe Croux and Peter J. Rousseeuw (1992) Time-Efficient Algorithms for Two Highly Robust Estimators of Scale, Computational Statistics, Vol. 1, ed. Dodge and Whittaker, Physica-Verlag Heidelberg, 411–428; available via Springer Link.
About the typo in the constant
:
Christophe Croux (2010)
Private e-mail, Fri Jul 16, w/ Subject
Re: Slight inaccuracy of Qn implementation .......
mad
for the ‘most robust’ but much less efficient
scale estimator; Sn
for a similar faster but less
efficient alternative. Finally, scaleTau2
which some
consider “uniformly” better than Qn or competitors.
set.seed(153) x <- sort(c(rnorm(80), rt(20, df = 1))) s_Qn(x, mu.too = TRUE) Qn(x, finite.corr = FALSE) ## A simple pure-R version of Qn() -- slow and memory-rich for large n: O(n^2) Qn0R <- function(x, k = choose(n %/% 2 + 1, 2)) { n <- length(x <- sort(x)) if(n == 0) return(NA) else if(n == 1) return(0.) stopifnot(is.numeric(k), k == as.integer(k), 1 <= k, k <= n*(n-1)/2) m <- outer(x,x,"-")# abs not needed as x[] is sorted sort(m[lower.tri(m)], partial = k)[k] } (Qx1 <- Qn(x, constant=1)) # 0.5498463 ## the C-algorithm "rounds" to 'float' single precision .. stopifnot(all.equal(Qx1, Qn0R(x), tol = 1e-6)) (qn <- Qn(c(1:4, 10, Inf, NA), na.rm=TRUE)) stopifnot(is.finite(qn), all.equal(4.075672524, qn, tol=1e-10)) ## -- compute for different 'k' : n <- length(x) # = 100 here (k0 <- choose(floor(n/2) + 1, 2)) # 51*50/2 == 1275 stopifnot(identical(Qx1, Qn(x, constant=1, k=k0))) nn2 <- n*(n-1)/2 all.k <- 1:nn2 system.time(Qss <- sapply(all.k, function(k) Qn(x, 1, k=k))) system.time(Qs <- Qn (x, 1, k = all.k)) system.time(Qs0 <- Qn0R(x, k = all.k) ) stopifnot(exprs = { Qs[1] == min(diff(x)) Qs[nn2] == diff(range(x)) all.equal(Qs, Qss, tol = 1e-15) # even exactly all.equal(Qs0, Qs, tol = 1e-7) # see 2.68e-8, as Qn() C-code rounds to (float) }) plot(2:nn2, Qs[-1], type="b", log="y", main = "Qn(*, k), k = 2..n(n-1)/2")
set.seed(153) x <- sort(c(rnorm(80), rt(20, df = 1))) s_Qn(x, mu.too = TRUE) Qn(x, finite.corr = FALSE) ## A simple pure-R version of Qn() -- slow and memory-rich for large n: O(n^2) Qn0R <- function(x, k = choose(n %/% 2 + 1, 2)) { n <- length(x <- sort(x)) if(n == 0) return(NA) else if(n == 1) return(0.) stopifnot(is.numeric(k), k == as.integer(k), 1 <= k, k <= n*(n-1)/2) m <- outer(x,x,"-")# abs not needed as x[] is sorted sort(m[lower.tri(m)], partial = k)[k] } (Qx1 <- Qn(x, constant=1)) # 0.5498463 ## the C-algorithm "rounds" to 'float' single precision .. stopifnot(all.equal(Qx1, Qn0R(x), tol = 1e-6)) (qn <- Qn(c(1:4, 10, Inf, NA), na.rm=TRUE)) stopifnot(is.finite(qn), all.equal(4.075672524, qn, tol=1e-10)) ## -- compute for different 'k' : n <- length(x) # = 100 here (k0 <- choose(floor(n/2) + 1, 2)) # 51*50/2 == 1275 stopifnot(identical(Qx1, Qn(x, constant=1, k=k0))) nn2 <- n*(n-1)/2 all.k <- 1:nn2 system.time(Qss <- sapply(all.k, function(k) Qn(x, 1, k=k))) system.time(Qs <- Qn (x, 1, k = all.k)) system.time(Qs0 <- Qn0R(x, k = all.k) ) stopifnot(exprs = { Qs[1] == min(diff(x)) Qs[nn2] == diff(range(x)) all.equal(Qs, Qss, tol = 1e-15) # even exactly all.equal(Qs0, Qs, tol = 1e-7) # see 2.68e-8, as Qn() C-code rounds to (float) }) plot(2:nn2, Qs[-1], type="b", log="y", main = "Qn(*, k), k = 2..n(n-1)/2")
Compute six initial robust estimators of multivariate location and
“scatter” (scale); then, for each, compute the distances
and take the
h
() observations
with smallest distances. Then compute the statistical distances based
on these h observations.
Return the indices of the observations sorted in increasing order.
r6pack(x, h, full.h, scaled = TRUE, scalefn = rrcov.control()$scalefn)
r6pack(x, h, full.h, scaled = TRUE, scalefn = rrcov.control()$scalefn)
x |
n x p data matrix |
h |
integer, typically around (and slightly larger than) |
full.h |
logical specifying if the full (length n) observation
ordering should be returned; otherwise only the first |
scaled |
logical indicating if the data |
scalefn |
a |
The six initial estimators are
Hyperbolic tangent of standardized data
Spearmann correlation matrix
Tukey normal scores
Spatial sign covariance matrix
BACON
Raw OGK estimate for scatter
a
matrix
of observation
indices, i.e., with values from . If
full.h
is true, , otherwise
.
Valentin Todorov, based on the original Matlab code by
Tim Verdonck and Mia Hubert. Martin Maechler for tweaks
(performance etc), and full.h
.
Hubert, M., Rousseeuw, P. J. and Verdonck, T. (2012) A deterministic algorithm for robust location and scatter. Journal of Computational and Graphical Statistics 21, 618–637.
covMcd(*, nsamp = "deterministic")
;
CovSest(*, nsamp = "sdet")
from package rrcov.
data(pulpfiber) dim(m.pulp <- data.matrix(pulpfiber)) # 62 x 8 dim(fr6 <- r6pack(m.pulp, h = 40, full.h= FALSE)) # h x 6 = 40 x 6 dim(fr6F <- r6pack(m.pulp, h = 40, full.h= TRUE )) # n x 6 = 62 x 6 stopifnot(identical(fr6, fr6F[1:40,]))
data(pulpfiber) dim(m.pulp <- data.matrix(pulpfiber)) # 62 x 8 dim(fr6 <- r6pack(m.pulp, h = 40, full.h= FALSE)) # h x 6 = 40 x 6 dim(fr6F <- r6pack(m.pulp, h = 40, full.h= TRUE )) # n x 6 = 62 x 6 stopifnot(identical(fr6, fr6F[1:40,]))
The data were supplied by A. Frery. They are a part of a synthetic aperture satellite radar image corresponding to a suburb of Munich. Provided are coordinates and values corresponding to three frequency bands for each of 1573 pixels.
data(radarImage, package="robustbase")
data(radarImage, package="robustbase")
A data frame with 1573 observations on the following 5 variables.
X.coord
a numeric vector
Y.coord
a numeric vector
Band.1
a numeric vector
Band.2
a numeric vector
Band.3
a numeric vector
The website accompanying the MMY-book: https://www.wiley.com/legacy/wileychi/robust_statistics/
data(radarImage) plot(Y.coord ~ X.coord, data = radarImage) ## The 8 "clear" outliers (see also below) ii8 <- c(1548:1549, 1553:1554, 1565:1566, 1570:1571) outF <- 1+(seq_len(nrow(radarImage)) %in% ii8) pairs(radarImage[, 3:5], main = "radarImage (n = 1573)", col = outF, pch=outF) ## Finding outliers ----------------------------------------- set.seed(1) system.time(cc.ri <- covMcd(radarImage))# ~ 0.1 sec ## check for covMcd() consistency: iiO <- as.integer( c(262, 450:451, 480:481, 509, 535, 542, 597, 643, 669, 697, 803:804, 832:834, 862:864, 892, 989, 1123, 1145, 1223:1224, 1232:1233, 1249:1250, 1267, 1303, 1347, 1357, 1375, 1411, 1419:1420, 1443, 1453, 1504, 1510:1512, 1518:1521, 1525:1526, 1543:1544, 1546:1555, 1557:1558, 1561:1562, 1564:1566, 1569:1571, 1573)) length(iiO) # 73 -- other seeds sometimes give 72, rarely 71 "outliers" table(isO <- cc.ri$mcd.wt == 0) # 2023-05: 118 stopifnot(exprs = { ## identical(iiO, which(isO)) -- TRUE before 2023-05 covMcd() change ii8 %in% which(isO) # ii8 is subset of isO identical(ii8, which(cc.ri$mah > 200)) length(intersect(cc.ri$best, iiO)) == 0 }) cc <- c(adjustcolor("black", 0.4), adjustcolor("tomato", 0.8)) pairs(radarImage, main = "radarImage (n = 1573) + Outliers", gap=0, col = cc[1+isO], pch = c(1,8)[1+isO], cex = 0.8)
data(radarImage) plot(Y.coord ~ X.coord, data = radarImage) ## The 8 "clear" outliers (see also below) ii8 <- c(1548:1549, 1553:1554, 1565:1566, 1570:1571) outF <- 1+(seq_len(nrow(radarImage)) %in% ii8) pairs(radarImage[, 3:5], main = "radarImage (n = 1573)", col = outF, pch=outF) ## Finding outliers ----------------------------------------- set.seed(1) system.time(cc.ri <- covMcd(radarImage))# ~ 0.1 sec ## check for covMcd() consistency: iiO <- as.integer( c(262, 450:451, 480:481, 509, 535, 542, 597, 643, 669, 697, 803:804, 832:834, 862:864, 892, 989, 1123, 1145, 1223:1224, 1232:1233, 1249:1250, 1267, 1303, 1347, 1357, 1375, 1411, 1419:1420, 1443, 1453, 1504, 1510:1512, 1518:1521, 1525:1526, 1543:1544, 1546:1555, 1557:1558, 1561:1562, 1564:1566, 1569:1571, 1573)) length(iiO) # 73 -- other seeds sometimes give 72, rarely 71 "outliers" table(isO <- cc.ri$mcd.wt == 0) # 2023-05: 118 stopifnot(exprs = { ## identical(iiO, which(isO)) -- TRUE before 2023-05 covMcd() change ii8 %in% which(isO) # ii8 is subset of isO identical(ii8, which(cc.ri$mah > 200)) length(intersect(cc.ri$best, iiO)) == 0 }) cc <- c(adjustcolor("black", 0.4), adjustcolor("tomato", 0.8)) pairs(radarImage, main = "radarImage (n = 1573) + Outliers", gap=0, col = cc[1+isO], pch = c(1,8)[1+isO], cex = 0.8)
Compute the rank of a matrix A
in simple way, based on the SVD,
svd()
, and “the same as Matlab”.
rankMM(A, tol = NULL, sv = svd(A, 0, 0)$d)
rankMM(A, tol = NULL, sv = svd(A, 0, 0)$d)
A |
a numerical matrix, maybe non-square. When |
tol |
numerical tolerance (compared to singular values). By
default, when |
sv |
vector of non-increasing singular values of |
an integer from the set 0:min(dim(A))
.
Martin Maechler, Date: 7 Apr 2007
There are more sophisticated proposals for computing the rank of a
matrix; for a couple of those, see rankMatrix
in the
Matrix package.
rankMM # - note the simple function definition hilbert <- function(n) { i <- seq_len(n); 1/outer(i - 1L, i, "+") } hilbert(4) H12 <- hilbert(12) rankMM(H12) # 11 - numerically more realistic rankMM(H12, tol=0) # -> 12 ## explanation : round(log10(svd(H12, 0,0)$d), 1)
rankMM # - note the simple function definition hilbert <- function(n) { i <- seq_len(n); 1/outer(i - 1L, i, "+") } hilbert(4) H12 <- hilbert(12) rankMM(H12) # 11 - numerically more realistic rankMM(H12, tol=0) # -> 12 ## explanation : round(log10(svd(H12, 0,0)$d), 1)
Compute residuals of a fitted glmrob
model, i.e., robust
generalized linear model fit.
## S3 method for class 'glmrob' residuals(object, type = c("deviance", "pearson", "working", "response", "partial"), ...)
## S3 method for class 'glmrob' residuals(object, type = c("deviance", "pearson", "working", "response", "partial"), ...)
object |
an object of class |
type |
the type of residuals which should be returned.
The alternatives are: |
... |
further arguments passed to or from other methods. |
The references in glm
define the types of residuals:
Davison & Snell is a good reference for the usages of each.
The partial residuals are a matrix of working residuals, with each column formed by omitting a term from the model.
The residuals
(S3) method (see methods
) for
glmrob
models has been modeled to follow closely the
method for classical (non-robust) glm
fitted models.
Possibly, see its documentation, i.e., residuals.glm, for
further details.
See those for the classical GLM's, glm
.
glmrob
for computing object
, anova.glmrob
;
the corresponding generic functions, summary.glmrob
,
coef
,
fitted
,
residuals
.
### -------- Gamma family -- data from example(glm) --- clotting <- data.frame( u = c(5,10,15,20,30,40,60,80,100), lot1 = c(118,58,42,35,27,25,21,19,18), lot2 = c(69,35,26,21,18,16,13,12,12)) summary(cl <- glm (lot1 ~ log(u), data=clotting, family=Gamma)) summary(ro <- glmrob(lot1 ~ log(u), data=clotting, family=Gamma)) clotM5.high <- within(clotting, { lot1[5] <- 60 }) cl5.high <- glm (lot1 ~ log(u), data=clotM5.high, family=Gamma) ro5.high <- glmrob(lot1 ~ log(u), data=clotM5.high, family=Gamma) rr <- range(residuals(ro), residuals(cl), residuals(ro5.high)) plot(residuals(ro5.high) ~ residuals(cl5.high), xlim = rr, ylim = rr, asp = 1) abline(0,1, col=2, lty=3) points(residuals(ro) ~ residuals(cl), col = "gray", pch=3) ## Show all kinds of residuals: r.types <- c("deviance", "pearson", "working", "response") sapply(r.types, residuals, object = ro5.high)
### -------- Gamma family -- data from example(glm) --- clotting <- data.frame( u = c(5,10,15,20,30,40,60,80,100), lot1 = c(118,58,42,35,27,25,21,19,18), lot2 = c(69,35,26,21,18,16,13,12,12)) summary(cl <- glm (lot1 ~ log(u), data=clotting, family=Gamma)) summary(ro <- glmrob(lot1 ~ log(u), data=clotting, family=Gamma)) clotM5.high <- within(clotting, { lot1[5] <- 60 }) cl5.high <- glm (lot1 ~ log(u), data=clotM5.high, family=Gamma) ro5.high <- glmrob(lot1 ~ log(u), data=clotM5.high, family=Gamma) rr <- range(residuals(ro), residuals(cl), residuals(ro5.high)) plot(residuals(ro5.high) ~ residuals(cl5.high), xlim = rr, ylim = rr, asp = 1) abline(0,1, col=2, lty=3) points(residuals(ro) ~ residuals(cl), col = "gray", pch=3) ## Show all kinds of residuals: r.types <- c("deviance", "pearson", "working", "response") sapply(r.types, residuals, object = ro5.high)
Auxiliary function for passing the estimation options as parameters to the estimation functions.
rrcov.control(alpha = 1/2, method = c("covMcd", "covComed", "ltsReg"), nsamp = 500, nmini = 300, kmini = 5, seed = NULL, tolSolve = 1e-14, scalefn = "hrv2012", maxcsteps = 200, trace = FALSE, wgtFUN = "01.original", beta, use.correction = identical(wgtFUN, "01.original"), adjust = FALSE)
rrcov.control(alpha = 1/2, method = c("covMcd", "covComed", "ltsReg"), nsamp = 500, nmini = 300, kmini = 5, seed = NULL, tolSolve = 1e-14, scalefn = "hrv2012", maxcsteps = 200, trace = FALSE, wgtFUN = "01.original", beta, use.correction = identical(wgtFUN, "01.original"), adjust = FALSE)
alpha |
This parameter controls the size of the subsets over
which the determinant is minimized, i.e., |
method |
a string specifying the “main” function for which
|
nsamp |
number of subsets used for initial estimates or |
nmini , kmini
|
for |
seed |
initial seed for R's random number generator; see
|
tolSolve |
numeric tolerance to be used for inversion
( |
scalefn |
(for deterministic |
maxcsteps |
integer specifying the maximal number of concentration steps for the deterministic MCD. |
trace |
logical or integer indicating whether to print
intermediate results. Default is |
wgtFUN |
a character string or |
beta |
a quantile, experimentally used for some of the prespecified
|
use.correction |
whether to use finite sample correction factors.
Defaults to |
adjust |
(for |
A list with components, as the parameters passed by the invocation
Valentin Todorov
For details, see the documentation about ltsReg
and
covMcd
, respectively.
data(Animals, package = "MASS") brain <- Animals[c(1:24, 26:25, 27:28),] data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) ctrl <- rrcov.control(alpha=0.75, trace=TRUE) covMcd(hbk.x, control = ctrl) covMcd(log(brain), control = ctrl)
data(Animals, package = "MASS") brain <- Animals[c(1:24, 26:25, 27:28),] data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) ctrl <- rrcov.control(alpha=0.75, trace=TRUE) covMcd(hbk.x, control = ctrl) covMcd(log(brain), control = ctrl)
This is a data set consisting of measurements of water salinity (i.e., its salt concentration) and river discharge taken in North Carolina's Pamlico Sound, recording some bi-weekly averages in March, April, and May from 1972 to 1977. This dataset was listed by Ruppert and Carroll (1980). In Carrol and Ruppert (1985) the physical background of the data is described. They indicated that observations 5 and 16 correspond to periods of very heavy discharge and showed that the discrepant observation 5 was masked by observations 3 and 16, i.e., only after deletion of these observations it was possible to identify the influential observation 5.
This data set is a prime example of the masking effect.
data(salinity, package="robustbase")
data(salinity, package="robustbase")
A data frame with 28 observations on the following 4 variables (in parentheses are the names used in the 1980 reference).
X1
:Lagged Salinity (‘SALLAG’)
X2
:Trend (‘TREND’)
X3
:Discharge (‘H2OFLOW’)
Y
:Salinity (‘SALINITY’)
The boot package contains another version of this salinity data set, also attributed to Ruppert and Carroll (1980), but with two clear transcription errors, see the examples.
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, p.82, table 5.
Ruppert, D. and Carroll, R.J. (1980) Trimmed least squares estimation in the linear model. JASA 75, 828–838; table 3, p.835.
Carroll, R.J. and Ruppert, D. (1985) Transformations in regression: A robust analysis. Technometrics 27, 1–12
data(salinity) summary(lm.sali <- lm(Y ~ . , data = salinity)) summary(rlm.sali <- MASS::rlm(Y ~ . , data = salinity)) summary(lts.sali <- ltsReg(Y ~ . , data = salinity)) salinity.x <- data.matrix(salinity[, 1:3]) c_sal <- covMcd(salinity.x) plot(c_sal, "tolEllipsePlot") ## Connection with boot package's version : if(requireNamespace("boot")) { ## 'always' print( head(boot.sal <- boot::salinity ) ) print( head(robb.sal <- salinity [, c(4, 1:3)]) ) # difference: has one digit more ## Otherwise the same ? dimnames(robb.sal) <- dimnames(boot.sal) ## apart from the 4th column, they are "identical": stopifnot( all.equal(boot.sal[, -4], robb.sal[, -4], tol = 1e-15) ) ## But the discharge ('X3', 'dis' or 'H2OFLOW') __differs__ in two places: plot(cbind(robustbase = robb.sal[,4], boot = boot.sal[,4])) abline(0,1, lwd=3, col=adjustcolor("red", 1/4)) D.sal <- robb.sal[,4] - boot.sal[,4] stem(robb.sal[,4] - boot.sal[,4]) which(abs(D.sal) > 0.01) ## 2 8 ## *two* typos (=> difference ~= 1) in the version of 'boot': obs. 2 & 8 !!! cbind(robb = robb.sal[,4], boot = boot.sal[,4], D.sal) }# boot
data(salinity) summary(lm.sali <- lm(Y ~ . , data = salinity)) summary(rlm.sali <- MASS::rlm(Y ~ . , data = salinity)) summary(lts.sali <- ltsReg(Y ~ . , data = salinity)) salinity.x <- data.matrix(salinity[, 1:3]) c_sal <- covMcd(salinity.x) plot(c_sal, "tolEllipsePlot") ## Connection with boot package's version : if(requireNamespace("boot")) { ## 'always' print( head(boot.sal <- boot::salinity ) ) print( head(robb.sal <- salinity [, c(4, 1:3)]) ) # difference: has one digit more ## Otherwise the same ? dimnames(robb.sal) <- dimnames(boot.sal) ## apart from the 4th column, they are "identical": stopifnot( all.equal(boot.sal[, -4], robb.sal[, -4], tol = 1e-15) ) ## But the discharge ('X3', 'dis' or 'H2OFLOW') __differs__ in two places: plot(cbind(robustbase = robb.sal[,4], boot = boot.sal[,4])) abline(0,1, lwd=3, col=adjustcolor("red", 1/4)) D.sal <- robb.sal[,4] - boot.sal[,4] stem(robb.sal[,4] - boot.sal[,4]) which(abs(D.sal) > 0.01) ## 2 8 ## *two* typos (=> difference ~= 1) in the version of 'boot': obs. 2 & 8 !!! cbind(robb = robb.sal[,4], boot = boot.sal[,4], D.sal) }# boot
Computes the robust -estimate of univariate scale, as
proposed by Maronna and Zamar (2002); improved by a consistency factor,
scaleTau2(x, c1 = 4.5, c2 = 3.0, na.rm = FALSE, consistency = TRUE, mu0 = median(x), sigma0 = median(x.), mu.too = FALSE, iter = 1, tol.iter = 1e-7)
scaleTau2(x, c1 = 4.5, c2 = 3.0, na.rm = FALSE, consistency = TRUE, mu0 = median(x), sigma0 = median(x.), mu.too = FALSE, iter = 1, tol.iter = 1e-7)
x |
numeric vector |
c1 , c2
|
non-negative numbers, specifying cutoff values for the biweighting of the mean and the rho function respectively. |
na.rm |
a logical value indicating whether |
consistency |
logical indicating if the consistency correction factor (for the scale) should be applied. |
mu0 |
the initial location estimate |
sigma0 |
the initial scale estimate |
mu.too |
logical indicating if both location and scale should be
returned or just the scale (when |
iter |
positive integer or logical indicating if and how many
iterations should be done. The default, |
tol.iter |
if |
First, := MAD, i.e. the equivalent of
mad(x,
constant=1)
is computed. Robustness weights
are computed, where
. The robust location
measure is defined as
,
and the robust
-estimate is
,
where
.
When iter=TRUE
or iter > 1
, the above estimate is
iterated in a fixpoint iteration, setting to the current
estimate
and iterating until the number of iterations is
larger than
iter
or the fixpoint is found in the sense that
\
scaleTau2(*, consistency=FALSE)
returns , whereas
this value is divided by its asymptotic limit when
consistency =
TRUE
as by default.
Note that for n = length(x) == 2
, all equivariant scale estimates are
proportional, and specifically, scaleTau2(x, consistency=FALSE)
== mad(x, constant=1)
. See also the reference.
numeric vector of length one (if mu.too
is FALSE
as by
default) or two (when mu.too = TRUE
) with robust scale or
(location,scale) estimators
or
.
Original by Kjell Konis with substantial modifications by Martin Maechler.
Maronna, R.A. and Zamar, R.H. (2002) Robust estimates of location and dispersion of high-dimensional datasets; Technometrics 44(4), 307–317.
Yohai, V.J., and Zamar, R.H. (1988). High breakdown-point estimates of regression by means of the minimization of an efficient scale. Journal of the American Statistical Association 83, 406–413.
Sn
, Qn
, mad
;
further covOGK
for which scaleTau2
was designed.
x <- c(1:7, 1000) sd(x) # non-robust std.deviation scaleTau2(x) scaleTau2(x, mu.too = TRUE) (sI <- scaleTau2(c(x,Inf), mu.too = TRUE)) (sIN <- scaleTau2(c(x,Inf,NA), mu.too = TRUE, na.rm=TRUE)) stopifnot({ identical(sI, sIN) all.equal(scaleTau2(c(x, 999), mu.too = TRUE), sIN, tol = 1e-15) }) if(doExtras <- robustbase:::doExtras()) { set.seed(11) ## show how much faster this is, compared to Qn x <- sample(c(rnorm(1e6), rt(5e5, df=3))) (system.time(Qx <- Qn(x))) ## 2.04 [2017-09, lynne] (system.time(S2x <- scaleTau2(x))) ## 0.25 (ditto) cbind(Qn = Qx, sTau2 = S2x) }## Qn sTau2 ## 1.072556 1.071258
x <- c(1:7, 1000) sd(x) # non-robust std.deviation scaleTau2(x) scaleTau2(x, mu.too = TRUE) (sI <- scaleTau2(c(x,Inf), mu.too = TRUE)) (sIN <- scaleTau2(c(x,Inf,NA), mu.too = TRUE, na.rm=TRUE)) stopifnot({ identical(sI, sIN) all.equal(scaleTau2(c(x, 999), mu.too = TRUE), sIN, tol = 1e-15) }) if(doExtras <- robustbase:::doExtras()) { set.seed(11) ## show how much faster this is, compared to Qn x <- sample(c(rnorm(1e6), rt(5e5, df=3))) (system.time(Qx <- Qn(x))) ## 2.04 [2017-09, lynne] (system.time(S2x <- scaleTau2(x))) ## 0.25 (ditto) cbind(Qn = Qx, sTau2 = S2x) }## Qn sTau2 ## 1.072556 1.071258
A small counterexample data set devised by Andrew Siegel.
Six (out of nine) data points lie on the line such that
some robust regression estimators exhibit the “exact fit”
property.
data(SiegelsEx, package="robustbase")
data(SiegelsEx, package="robustbase")
A data frame with 9 observations on the following 2 variables.
x
a numeric vector
y
a numeric vector
Emerson and Hoaglin (1983, p.139)
Peter J. Rousseeuw and Annick M. Leroy (1987) Robust Regression and Outlier Detection Wiley, p.60–61
data(SiegelsEx) plot(SiegelsEx, main = "Siegel's example for 'exact fit'") abline( lm(y ~ x, data = SiegelsEx)) abline(MASS::lqs(y ~ x, data = SiegelsEx, method = "lms"), col = 2) legend("topright", leg = c("lm", "LMS"), col=1:2, lwd=1, inset = 1/20)
data(SiegelsEx) plot(SiegelsEx, main = "Siegel's example for 'exact fit'") abline( lm(y ~ x, data = SiegelsEx)) abline(MASS::lqs(y ~ x, data = SiegelsEx, method = "lms"), col = 2) legend("topright", leg = c("lm", "LMS"), col=1:2, lwd=1, inset = 1/20)
Extract the estimated standard deviation of the errors, the “residual standard deviation” (misnomed also “residual standard error”) from a fitted model.
## S3 method for class 'lmrob' sigma(object, ...)
## S3 method for class 'lmrob' sigma(object, ...)
object |
a fitted model. |
... |
additional, optional arguments. (None are used in our methods) |
For R <= 3.2.x
, we provide an (S3) generic function (as e.g.,
package lme4) and methods for lmrob
,
nlrob
, and nls
.
From R >= 3.3.0
, we provide methods for our
lmrob
and nlrob
models.
the residual standard error as a scalar
m.cl <- lm (Y ~ ., data=coleman) if(getRversion() >= "3.3.0") sigma(m.cl) else summary(m.cl)$sigma sigma( m1 <- lmrob(Y ~ ., data=coleman) ) sigma( m2 <- lmrob(Y ~ ., data=coleman, setting = "KS2014") )
m.cl <- lm (Y ~ ., data=coleman) if(getRversion() >= "3.3.0") sigma(m.cl) else summary(m.cl)$sigma sigma( m1 <- lmrob(Y ~ ., data=coleman) ) sigma( m2 <- lmrob(Y ~ ., data=coleman, setting = "KS2014") )
“The Biweight on a Stick” —
Compute a smooth (when ) weight function typically for
computing weights from large (robust) “distances” using a
piecewise polynomial function which in fact is a
2-parameter generalization of Tukey's 1-parameter “biweight”.
smoothWgt(x, c, h)
smoothWgt(x, c, h)
x |
numeric vector of abscissa values |
c |
“cutoff”, a typically positive number. |
h |
“bandwidth”, a positive number. |
Let smoothWgt(x, c, h)
. Then,
smoothWgt()
is scale invariant in the sense that
when .
a numeric vector of the same length as x
with weights between
zero and one. Currently all attributes
including
dim
and names
are dropped.
Martin Maechler
Mwgt(.., psi = "bisquare")
of which smoothWgt()
is a generalization, and
Mwgt(.., psi = "optimal")
which looks similar for larger
c
with its constant one part around zero, but also has only
one parameter.
## a somewhat typical picture: curve(smoothWgt(x, c=3, h=1), -5,7, n = 1000) csW <- curve(smoothWgt(x, c=1/2, h=1), -2,2) # cutoff 1/2, bandwidth 1 ## Show that the above is the same as ## Tukey's "biweight" or "bi-square" weight function: bw <- function(x) pmax(0, (1 - x^2))^2 cbw <- curve(bw, col=adjustcolor(2, 1/2), lwd=2, add=TRUE) cMw <- curve(Mwgt(x,c=1,"biweight"), col=adjustcolor(3, 1/2), lwd=2, add=TRUE) stopifnot(## proving they are all the same: all.equal(csW, cbw, tol=1e-15), all.equal(csW, cMw, tol=1e-15))
## a somewhat typical picture: curve(smoothWgt(x, c=3, h=1), -5,7, n = 1000) csW <- curve(smoothWgt(x, c=1/2, h=1), -2,2) # cutoff 1/2, bandwidth 1 ## Show that the above is the same as ## Tukey's "biweight" or "bi-square" weight function: bw <- function(x) pmax(0, (1 - x^2))^2 cbw <- curve(bw, col=adjustcolor(2, 1/2), lwd=2, add=TRUE) cMw <- curve(Mwgt(x,c=1,"biweight"), col=adjustcolor(3, 1/2), lwd=2, add=TRUE) stopifnot(## proving they are all the same: all.equal(csW, cbw, tol=1e-15), all.equal(csW, cMw, tol=1e-15))
Compute the robust scale estimator , an efficient
alternative to the MAD.
Sn(x, constant = 1.1926, finite.corr = missing(constant), na.rm = FALSE) s_Sn(x, mu.too = FALSE, ...)
Sn(x, constant = 1.1926, finite.corr = missing(constant), na.rm = FALSE) s_Sn(x, mu.too = FALSE, ...)
x |
numeric vector of observations. |
constant |
number by which the result is multiplied; the default achieves consisteny for normally distributed data. |
finite.corr |
logical indicating if the finite sample bias
correction factor should be applied. Default to |
na.rm |
logical specifying if missing values ( |
mu.too |
logical indicating if the |
... |
potentially further arguments for |
............ FIXME ........
Sn()
returns a number, the robust scale estimator, scaled to be
consistent for
and i.i.d. Gaussian observations,
optionally bias corrected for finite samples.
s_Sn(x, mu.too=TRUE)
returns a length-2 vector with location
() and scale; this is typically only useful for
covOGK(*, sigmamu = s_Sn)
.
Original Fortran code:
Christophe Croux and Peter Rousseeuw [email protected].
Port to C and R: Martin Maechler, [email protected]
Rousseeuw, P.J. and Croux, C. (1993) Alternatives to the Median Absolute Deviation, Journal of the American Statistical Association 88, 1273–1283.
mad
for the ‘most robust’ but much
less efficient scale estimator;
Qn
for a similar more efficient but slower alternative;
scaleTau2
.
x <- c(1:10, 100+1:9)# 9 outliers out of 19 Sn(x) Sn(x, c=1)# 9 Sn(x[1:18], c=1)# 9 set.seed(153) x <- sort(c(rnorm(80), rt(20, df = 1))) s_Sn(x, mu.too=TRUE) (s <- Sn(c(1:4, 10, Inf, NA), na.rm=TRUE)) stopifnot(is.finite(s), all.equal(3.5527554, s, tol=1e-10))
x <- c(1:10, 100+1:9)# 9 outliers out of 19 Sn(x) Sn(x, c=1)# 9 Sn(x[1:18], c=1)# 9 set.seed(153) x <- sort(c(rnorm(80), rt(20, df = 1))) s_Sn(x, mu.too=TRUE) (s <- Sn(c(1:4, 10, Inf, NA), na.rm=TRUE)) stopifnot(is.finite(s), all.equal(3.5527554, s, tol=1e-10))
Splits the design matrix into categorical and continuous
predictors. Categorical variables are variables that are factor
s,
ordered
factors, or character
.
splitFrame(mf, x = model.matrix(mt, mf), type = c("f","fi", "fii"))
splitFrame(mf, x = model.matrix(mt, mf), type = c("f","fi", "fii"))
mf |
model frame (as returned by |
x |
(optional) design matrix, defaulting to the derived
|
type |
a character string specifying the split type (see details). |
Which split type is used can be controlled with the setting
split.type
in lmrob.control
.
There are three split types. The only differences between the types are how interactions between categorical and continuous variables are handled. The extra types of splitting can be used to avoid Too many singular resamples errors.
Type "f"
, the default, assigns only the intercept, categorical and
interactions of categorical variables to x1
. Interactions of
categorical and continuous variables are assigned to x2
.
Type "fi"
assigns also interactions between categorical and
continuous variables to x1
.
Type "fii"
assigns not only interactions between categorical and
continuous variables to x1
, but also the (corresponding)
continuous variables themselves.
A list that includes the following components:
x1 |
design matrix containing only categorical variables |
x1.idx |
logical vectors of the variables considered categorical in the original design matrix |
x2 |
design matrix containing the continuous variables |
Manuel Koller
Maronna, R. A., and Yohai, V. J. (2000). Robust regression with both continuous and categorical predictors. Journal of Statistical Planning and Inference 89, 197–214.
data(education) education <- within(education, Region <- factor(Region)) educaCh <- within(education, Region <- as.character(Region)) ## no interactions -- same split for all types: fm1 <- lm(Y ~ Region + X1 + X2 + X3, education) fmC <- lm(Y ~ Region + X1 + X2 + X3, educaCh ) splt <- splitFrame(fm1$model) ; str(splt) splC <- splitFrame(fmC$model) stopifnot(identical(splt, splC)) ## with interactions: fm2 <- lm(Y ~ Region:X1:X2 + X1*X2, education) s1 <- splitFrame(fm2$model, type="f" ) s2 <- splitFrame(fm2$model, type="fi" ) s3 <- splitFrame(fm2$model, type="fii") cbind(s1$x1.idx, s2$x1.idx, s3$x1.idx) rbind(p.x1 = c(ncol(s1$x1), ncol(s2$x1), ncol(s3$x1)), p.x2 = c(ncol(s1$x2), ncol(s2$x2), ncol(s3$x2)))
data(education) education <- within(education, Region <- factor(Region)) educaCh <- within(education, Region <- as.character(Region)) ## no interactions -- same split for all types: fm1 <- lm(Y ~ Region + X1 + X2 + X3, education) fmC <- lm(Y ~ Region + X1 + X2 + X3, educaCh ) splt <- splitFrame(fm1$model) ; str(splt) splC <- splitFrame(fmC$model) stopifnot(identical(splt, splC)) ## with interactions: fm2 <- lm(Y ~ Region:X1:X2 + X1*X2, education) s1 <- splitFrame(fm2$model, type="f" ) s2 <- splitFrame(fm2$model, type="fi" ) s3 <- splitFrame(fm2$model, type="fii") cbind(s1$x1.idx, s2$x1.idx, s3$x1.idx) rbind(p.x1 = c(ncol(s1$x1), ncol(s2$x1), ncol(s3$x1)), p.x2 = c(ncol(s1$x2), ncol(s2$x2), ncol(s3$x2)))
Data for the Hertzsprung-Russell Diagram of the Star Cluster CYG OB1,
which contains 47 stars in the direction of Cygnus, from C.Doom. The
first variable is the logarithm of the effective temperature at the
surface of the star (Te) and the second one is the logarithm of its
light intencity ().
In the Hertzsprung-Russell diagram, which is the scatterplot of these
data points, where the log temperature is plotted from left to right,
two groups of points are seen:
the majority which tend to follow a steep band and four stars in the
upper corner. In the astronomy the 43 stars are said to lie on the
main sequence and the four remaining stars are called “giants” (the
points 11, 20, 30, 34).
data(starsCYG, package="robustbase")
data(starsCYG, package="robustbase")
A data frame with 47 observations on the following 2 variables
log.Te
Logarithm of the effective temperature at the surface of the star (Te).
log.light
Logarithm of its light intencity ()
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, p.27, table 3.
data(starsCYG) plot(starsCYG) cst <- covMcd(starsCYG) lm.stars <- lm(log.light ~ log.Te, data = starsCYG) summary(lm.stars) plot(lm.stars) lts.stars <- ltsReg(log.light ~ log.Te, data = starsCYG) plot(lts.stars)
data(starsCYG) plot(starsCYG) cst <- covMcd(starsCYG) lm.stars <- lm(log.light ~ log.Te, data = starsCYG) summary(lm.stars) plot(lm.stars) lts.stars <- ltsReg(log.light ~ log.Te, data = starsCYG) plot(lts.stars)
The monthly use of steam (Steam
) in a factory may be
modeled and described as function of the
operating days per month (Operating.Days
) and
mean outside temperature per month (Temperature
).
data("steamUse", package="robustbase")
data("steamUse", package="robustbase")
A data frame with 25 observations on the following 9 variables.
Steam
:regression response , the poinds of
steam used monthly.
fattyAcid
:pounds of Real Fatty Acid in storage per month.
glycerine
:pounds of crude glycerine made.
wind
:average wind velocity in miles per hour (a numeric vector).
days
:an integer vector with number of days of that
month, i.e., in .
op.days
:the number of operating days for the given month (integer).
freeze.d
:the number of days below
32 degrees Fahrenheit (°C (C=Celsius)
freezing temperature of water).
temperature
:a numeric vector of average outside temperature in Fahrenheit (F).
startups
:the number of startups (of production in that month).
Nor further information is given in Draper and Smith, about the place and exacts years of the measurements, though some educated guesses should be possible, see the examples.
Data from Draper and Smith, 1st ed, 1966; appendix A.
A version of this has been used in teaching at SfS ETH Zurich, since at least 1996, https://stat.ethz.ch/Teaching/Datasets/NDK/dsteam.dat
The package aprean3 contains all data sets from the 3rd
edition of Draper and Smith (1998), and this data set with variable
names x1 .. x10
(x9
being wind^2
, hence extraneous).
Draper and Smith (1981) Applied Regression Analysis (2nd ed., p. 615 ff)
## Not run: if(require("aprean3")) { # show how 'steamUse' is related to 'dsa01a' stm <- dsa01a names(stm) <- c("Steam", "fattyAcid", "glycerine", "wind", "days", "op.days", "freeze.d", "temperature", "wind.2", "startups") ## prove that wind.2 is wind^2, "traditionally" rounded to 1 digit: stopifnot(all.equal(floor(0.5 + 10*stm[,"wind"]^2)/10, stm[,"wind.2"], tol = 1e-14)) ## hence drop it steamUse <- stm[, names(stm) != "wind.2"] } ## End(Not run) data(steamUse) str(steamUse) ## Looking at this, cbind(M=rep_len(month.abb, 25), steamUse[,5:8, drop=FALSE]) ## one will conjecture that these were 25 months, Jan--Jan in a row, ## starting in a leap year (perhaps 1960 ?). plot(steamUse) summary(fm1 <- lmrob(Steam ~ temperature + op.days, data=steamUse)) ## diagnoses 2 outliers: month of July, maybe company-wide summer vacations ## KS2014 alone seems not robust enough: summary(fm.14 <- lmrob(Steam ~ temperature + op.days, data=steamUse, setting="KS2014")) pairs(Steam ~ temperature+op.days, steamUse)
## Not run: if(require("aprean3")) { # show how 'steamUse' is related to 'dsa01a' stm <- dsa01a names(stm) <- c("Steam", "fattyAcid", "glycerine", "wind", "days", "op.days", "freeze.d", "temperature", "wind.2", "startups") ## prove that wind.2 is wind^2, "traditionally" rounded to 1 digit: stopifnot(all.equal(floor(0.5 + 10*stm[,"wind"]^2)/10, stm[,"wind.2"], tol = 1e-14)) ## hence drop it steamUse <- stm[, names(stm) != "wind.2"] } ## End(Not run) data(steamUse) str(steamUse) ## Looking at this, cbind(M=rep_len(month.abb, 25), steamUse[,5:8, drop=FALSE]) ## one will conjecture that these were 25 months, Jan--Jan in a row, ## starting in a leap year (perhaps 1960 ?). plot(steamUse) summary(fm1 <- lmrob(Steam ~ temperature + op.days, data=steamUse)) ## diagnoses 2 outliers: month of July, maybe company-wide summer vacations ## KS2014 alone seems not robust enough: summary(fm.14 <- lmrob(Steam ~ temperature + op.days, data=steamUse, setting="KS2014")) pairs(Steam ~ temperature+op.days, steamUse)
Print a nice “summary” about a numeric vector of robustness weights. Observations with weights around zero are marked as outliers.
summarizeRobWeights(w, digits = getOption("digits"), header = "Robustness weights:", eps = 0.1 / length(w), eps1 = 1e-3, ...)
summarizeRobWeights(w, digits = getOption("digits"), header = "Robustness weights:", eps = 0.1 / length(w), eps1 = 1e-3, ...)
w |
numeric vector of robustness weigths. |
digits |
digits to be used for |
header |
string to be printed as header line. |
eps |
numeric tolerance |
eps1 |
numeric tolerance: values of |
... |
potential further arguments, passed to
|
none; the function is used for its side effect of printing.
Martin Maechler
The summary
methods for lmrob
and glmrob
make use of summarizeRobWeights()
.
Our methods for weights()
,
weights.lmrob(*, type="robustness")
and
weights.glmrob(*, type="robustness")
.
w <- c(1,1,1,1,0,1,1,1,1,0,1,1,.9999,.99999, .5,.6,1e-12) summarizeRobWeights(w) # two outside ~= {0,1} summarizeRobWeights(w, eps1 = 5e-5)# now three outside {0,1} ## See the summary(<lmrob>) outputs
w <- c(1,1,1,1,0,1,1,1,1,0,1,1,.9999,.99999, .5,.6,1e-12) summarizeRobWeights(w) # two outside ~= {0,1} summarizeRobWeights(w, eps1 = 5e-5)# now three outside {0,1} ## See the summary(<lmrob>) outputs
The summary
method for class "glmrob"
summarizes robust fits of (currently only discrete) generalized linear
models.
## S3 method for class 'glmrob' summary(object, correlation = FALSE, symbolic.cor = FALSE, ...) ## S3 method for class 'glmrob' vcov(object, ...) ## S3 method for class 'summary.glmrob' print(x, digits = max(3, getOption("digits") - 3), symbolic.cor = x$symbolic.cor, signif.stars = getOption("show.signif.stars"), ...)
## S3 method for class 'glmrob' summary(object, correlation = FALSE, symbolic.cor = FALSE, ...) ## S3 method for class 'glmrob' vcov(object, ...) ## S3 method for class 'summary.glmrob' print(x, digits = max(3, getOption("digits") - 3), symbolic.cor = x$symbolic.cor, signif.stars = getOption("show.signif.stars"), ...)
object |
an object of class |
correlation |
logical; if |
symbolic.cor |
logical. If |
... |
further arguments passed to or from other methods. |
x |
an object of class |
digits |
the number of digits to use for printing. |
signif.stars |
logical indicating if the P-values should be visualized by so called “significance stars”. |
summary.glmrob
returns an object of class
"summary.glmrob"
.
Its print()
method tries to be smart about formatting the
coefficients, standard errors, etc, and gives
“significance stars” if signif.stars
is TRUE
(as per default when options
where not changed).
The function summary.glmrob
computes and returns a list
of summary statistics of the robustly fitted linear model given in
object
. The following elements are in the list:
... |
FIXME |
Andreas Ruckstuhl
glmrob
; the generic summary
and
also summary.glm
.
data(epilepsy) Rmod <- glmrob(Ysum ~ Age10 + Base4*Trt, family = poisson, data = epilepsy, method= "Mqle") ss <- summary(Rmod) ss ## calls print.summary.glmrob() str(ss) ## internal STRucture of summary object
data(epilepsy) Rmod <- glmrob(Ysum ~ Age10 + Base4*Trt, family = poisson, data = epilepsy, method= "Mqle") ss <- summary(Rmod) ss ## calls print.summary.glmrob() str(ss) ## internal STRucture of summary object
Summary method for R object of class "lmrob"
and
print
method for the summary object.
Further, methods fitted()
, residuals()
work (via the default methods), and
predict()
(see predict.lmrob
,
vcov()
, weights()
(see
weights.lmrob
), model.matrix()
,
confint()
, dummy.coef()
,
hatvalues()
, etc.,
have explicitly defined lmrob
methods. .lmrob.hat()
is
the lower level “work horse” of the hatvalues()
method.
## S3 method for class 'lmrob' summary(object, correlation = FALSE, symbolic.cor = FALSE, ...) ## S3 method for class 'summary.lmrob' print(x, digits = max(3, getOption("digits") - 3), symbolic.cor= x$symbolic.cor, signif.stars = getOption("show.signif.stars"), showAlgo = TRUE, ...) ## S3 method for class 'lmrob' vcov(object, cov = object$control$cov, complete = TRUE, ...) ## S3 method for class 'lmrob' model.matrix(object, ...)
## S3 method for class 'lmrob' summary(object, correlation = FALSE, symbolic.cor = FALSE, ...) ## S3 method for class 'summary.lmrob' print(x, digits = max(3, getOption("digits") - 3), symbolic.cor= x$symbolic.cor, signif.stars = getOption("show.signif.stars"), showAlgo = TRUE, ...) ## S3 method for class 'lmrob' vcov(object, cov = object$control$cov, complete = TRUE, ...) ## S3 method for class 'lmrob' model.matrix(object, ...)
object |
an R object of class |
correlation |
logical variable indicating whether to compute the correlation matrix of the estimated coefficients. |
symbolic.cor |
logical indicating whether to use symbols to display the above correlation matrix. |
x |
an R object of class |
digits |
number of digits for printing, see |
signif.stars |
logical variable indicating whether to use stars to display different levels of significance in the individual t-tests. |
showAlgo |
optional |
cov |
covariance estimation function to use, a
object$cov <- vcov(object, cov = ".vcov.w") allows to update the fitted object. |
complete |
(mainly for R |
... |
potentially more arguments passed to methods. |
summary(object)
returns an object of S3 class
"summary.lmrob"
, basically a list
with components
"call", "terms", "residuals", "scale", "rweights", "converged",
"iter", "control" all copied from object
, and further
components, partly for compatibility with summary.lm
,
coefficients |
a |
df |
degrees of freedom, in an |
sigma |
identical to |
aliased |
.. |
cov |
derived from |
r.squared |
robust “R squared” or |
adj.r.squared |
an adjusted R squared, see |
Renaud, O. and Victoria-Feser, M.-P. (2010). A robust coefficient of determination for regression, Journal of Statistical Planning and Inference 140, 1852-1862.
lmrob
, predict.lmrob
,
weights.lmrob
, summary.lm
,
print
, summary
.
mod1 <- lmrob(stack.loss ~ ., data = stackloss) sa <- summary(mod1) # calls summary.lmrob(....) sa # dispatches to call print.summary.lmrob(....) ## correlation between estimated coefficients: cov2cor(vcov(mod1)) cbind(fit = fitted(mod1), resid = residuals(mod1), wgts= weights(mod1, type="robustness"), predict(mod1, interval="prediction")) data(heart) sm2 <- summary( m2 <- lmrob(clength ~ ., data = heart) ) sm2
mod1 <- lmrob(stack.loss ~ ., data = stackloss) sa <- summary(mod1) # calls summary.lmrob(....) sa # dispatches to call print.summary.lmrob(....) ## correlation between estimated coefficients: cov2cor(vcov(mod1)) cbind(fit = fitted(mod1), resid = residuals(mod1), wgts= weights(mod1, type="robustness"), predict(mod1, interval="prediction")) data(heart) sm2 <- summary( m2 <- lmrob(clength ~ ., data = heart) ) sm2
summary
method for class "lts"
.
## S3 method for class 'lts' summary(object, correlation = FALSE, ...) ## S3 method for class 'summary.lts' print(x, digits = max(3, getOption("digits") - 3), signif.stars = getOption("show.signif.stars"), ...)
## S3 method for class 'lts' summary(object, correlation = FALSE, ...) ## S3 method for class 'summary.lts' print(x, digits = max(3, getOption("digits") - 3), signif.stars = getOption("show.signif.stars"), ...)
object |
an object of class |
correlation |
logical; if |
x |
an object of class |
digits |
the number of significant digits to use when printing. |
signif.stars |
logical indicating if “significance stars”
should be printer, see |
... |
further arguments passed to or from other methods. |
These functions compute and print summary statistics for weighted least square estimates with weights based on LTS estimates. Therefore the statistics are similar to those for LS but all terms are multiplied by the corresponding weight.
Correlations are printed to two decimal places: to see the actual correlations
print summary(object)$correlation
directly.
The function summary.lts
computes and returns a list of summary
statistics of the fitted linear model given in object
, using
the components of this object (list elements).
residuals |
the residuals - a vector like the response |
coefficients |
a |
sigma |
the estimated scale of the reweighted residuals
where |
df |
degrees of freedom, a 3-vector |
fstatistic |
(for models including non-intercept terms) a 3-vector with the value of the F-statistic with its numerator and denominator degrees of freedom. |
r.squared |
where |
adj.r.squared |
the above |
cov.unscaled |
a |
correlation |
the correlation matrix corresponding to the above
|
data(Animals2) ltsA <- ltsReg(log(brain) ~ log(body), data = Animals2) (slts <- summary(ltsA)) ## non-default options for printing the summary: print(slts, digits = 5, signif.stars = FALSE)
data(Animals2) ltsA <- ltsReg(log(brain) ~ log(body), data = Animals2) (slts <- summary(ltsA)) ## non-default options for printing the summary: print(slts, digits = 5, signif.stars = FALSE)
summary
method for class "mcd"
.
## S3 method for class 'mcd' summary(object, ...) ## S3 method for class 'summary.mcd' print(x, digits = max(3, getOption("digits") - 3), print.gap = 2, ...)
## S3 method for class 'mcd' summary(object, ...) ## S3 method for class 'summary.mcd' print(x, digits = max(3, getOption("digits") - 3), print.gap = 2, ...)
object , x
|
an object of class |
digits |
the number of significant digits to use when printing. |
print.gap |
number of horizontal spaces between numbers; see also
|
... |
further arguments passed to or from other methods. |
summary.mcd()
, the S3 method, simply returns an (S3) object of
class "summary.mcd"
for which there's a
print
method:
print.summary.mcd
prints summary statistics for the weighted covariance
matrix and location estimates with weights based on MCD estimates.
While the function print.mcd
prints only the robust estimates
of the location and the covariance matrix, print.summary.mcd
will
print also the correlation matrix (if requested in the call to
covMcd
with cor=TRUE
), the eigenvalues of the covariance
or the correlation matrix and the robust (“Mahalanobis”) distances.
summary.mcd
returns an summary.mcd
object, whereas the
print
methods returns its first argument via
invisible
, as all print
methods do.
data(Animals, package = "MASS") brain <- Animals[c(1:24, 26:25, 27:28),] lbrain <- log(brain) summary(cLB <- covMcd(lbrain))
data(Animals, package = "MASS") brain <- Animals[c(1:24, 26:25, 27:28),] lbrain <- log(brain) summary(cLB <- covMcd(lbrain))
summary
method for objects of class "nlrob"
, i.e.,
nlrob()
results. Currently it only works for
nlrob(*, method="M")
.
## S3 method for class 'nlrob' summary(object, correlation = FALSE, symbolic.cor = FALSE, ...)
## S3 method for class 'nlrob' summary(object, correlation = FALSE, symbolic.cor = FALSE, ...)
object |
an object of class |
correlation |
logical variable indicating whether to compute the correlation matrix of the estimated coefficients. |
symbolic.cor |
logical indicating whether to use symbols to display the above correlation matrix. |
... |
further arguments passed to or from other methods. |
The function summary.nlrob
computes and returns an
object of class "summary.nlrob"
of summary statistics of the
robustly fitted linear model given in object
.
There is a print method, print.summary.lmrob()
, which nicely
formats the output.
The result keeps a large part of object
's components such as
residuals
, cov
or w
, and additionally contains
coefficients |
the matrix of coefficients, standard errors and p-values. |
correlation |
if the |
Andreas Ruckstuhl
nlrob()
, also for examples.
Number of international calls from Belgium, taken from the Belgian Statistical Survey, published by the Ministry of Economy.
data(telef, package="robustbase")
data(telef, package="robustbase")
A data frame with 24 observations on the following 2 variables.
Calls
Number of Calls (in tens of millions)
Year
Year (1950 - 1973)
P. J. Rousseeuw and A. M. Leroy (1987) Robust Regression and Outlier Detection; Wiley, page 26, table 2.
data(telef) summary(lm.telef <- lm(Year~., data=telef))
data(telef) summary(lm.telef <- lm(Year~., data=telef))
Plots the 0.975 tolerance ellipse of the bivariate data set x
.
The ellipse is defined by those data points whose distance
is equal to the squareroot of the 0.975 chisquare quantile
with 2 degrees of freedom.
tolEllipsePlot(x, m.cov = covMcd(x), cutoff = NULL, id.n = NULL, classic = FALSE, tol = 1e-07, xlab = "", ylab = "", main = "Tolerance ellipse (97.5%)", txt.leg = c("robust", "classical"), col.leg = c("red", "blue"), lty.leg = c("solid","dashed"))
tolEllipsePlot(x, m.cov = covMcd(x), cutoff = NULL, id.n = NULL, classic = FALSE, tol = 1e-07, xlab = "", ylab = "", main = "Tolerance ellipse (97.5%)", txt.leg = c("robust", "classical"), col.leg = c("red", "blue"), lty.leg = c("solid","dashed"))
x |
a two dimensional matrix or data frame. |
m.cov |
an object similar to those of class |
cutoff |
numeric distance needed to flag data points outside the ellipse. |
id.n |
number of observations to be identified by a label. If
not supplied, the number of observations with distance larger than
|
classic |
whether to plot the classical distances as well,
|
tol |
tolerance to be used for computing the inverse, see
|
xlab , ylab , main
|
passed to |
txt.leg , col.leg , lty.leg
|
character vectors of length 2 for the
legend, only used if |
Peter Filzmoser, Valentin Todorov and Martin Maechler
covPlot
which calls tolEllipsePlot()
when
desired.
ellipsoidhull
and
predict.ellipsoid
from package cluster.
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) mcd <- covMcd(hbk.x) # compute mcd in advance ## must be a 2-dimensional data set: take the first two columns : tolEllipsePlot(hbk.x[,1:2]) ## an "impressive" example: data(telef) tolEllipsePlot(telef, classic=TRUE)
data(hbk) hbk.x <- data.matrix(hbk[, 1:3]) mcd <- covMcd(hbk.x) # compute mcd in advance ## must be a 2-dimensional data set: take the first two columns : tolEllipsePlot(hbk.x[,1:2]) ## an "impressive" example: data(telef) tolEllipsePlot(telef, classic=TRUE)
The aim of the experiment was to predict the toxicity of carboxylic acids on the basis of several molecular descriptors.
data(toxicity, package="robustbase")
data(toxicity, package="robustbase")
A data frame with 38 observations on the following 10 variables which are attributes for carboxylic acids:
toxicity
aquatic toxicity, defined as
; typically the “response”.
logKow
, the partition coefficient
pKa
pKa: the dissociation constant
ELUMO
Energy of the lowest unoccupied molecular orbital
Ecarb
Electrotopological state of the carboxylic group
Emet
Electrotopological state of the methyl group
RM
Molar refractivity
IR
Refraction index
Ts
Surface tension
P
Polarizability
The website accompanying the MMY-book: https://www.wiley.com/legacy/wileychi/robust_statistics/
Maguna, F.P., Núñez, M.B., Okulik, N.B. and Castro, E.A. (2003) Improved QSAR analysis of the toxicity of aliphatic carboxylic acids; Russian Journal of General Chemistry 73, 1792–1798.
data(toxicity) summary(toxicity) plot(toxicity) plot(toxicity ~ pKa, data = toxicity) ## robustly scale the data (to scale 1) using Qn (scQ.tox <- sapply(toxicity, Qn)) scTox <- scale(toxicity, center = FALSE, scale = scQ.tox) csT <- covOGK(scTox, n.iter = 2, sigmamu = s_Qn, weight.fn = hard.rejection) as.dist(round(cov2cor(csT$cov), 2))
data(toxicity) summary(toxicity) plot(toxicity) plot(toxicity ~ pKa, data = toxicity) ## robustly scale the data (to scale 1) using Qn (scQ.tox <- sapply(toxicity, Qn)) scTox <- scale(toxicity, center = FALSE, scale = scQ.tox) csT <- covOGK(scTox, n.iter = 2, sigmamu = s_Qn, weight.fn = hard.rejection) as.dist(round(cov2cor(csT$cov), 2))
These are deprecated, replaced by
Mchi(*, psi="tukey")
, Mpsi(*, psi="tukey")
tukeyPsi1()
computes Tukey's bi-square score (psi) function, its first
derivative or it's integral/“principal function”. This is
scaled such that , i.e.,
around 0.
tukeyChi()
computes Tukey's bi-square loss function,
chi(x)
and its first two derivatives. Note that in the general
context of -estimators, these loss functions are called
-functions.
tukeyPsi1(x, cc, deriv = 0) tukeyChi (x, cc, deriv = 0)
tukeyPsi1(x, cc, deriv = 0) tukeyChi (x, cc, deriv = 0)
x |
numeric vector. |
cc |
tuning constant |
deriv |
integer in |
a numeric vector of the same length as x
.
tukeyPsi1(x, d)
and tukeyChi(x, d+1)
are just
re-scaled versions of each other (for d in -1:1
), i.e.,
for .
We use the name ‘tukeyPsi1’, because tukeyPsi
is
reserved for a future “Psi Function” class object, see
psiFunc
.
Matias Salibian-Barrera, Martin Maechler and Andreas Ruckstuhl
lmrob
and Mpsi
; further
anova.lmrob
which needs the deriv = -1
.
op <- par(mfrow = c(3,1), oma = c(0,0, 2, 0), mgp = c(1.5, 0.6, 0), mar= .1+c(3,4,3,2)) x <- seq(-2.5, 2.5, length = 201) cc <- 1.55 # as set by default in lmrob.control() plot. <- function(...) { plot(...); abline(h=0,v=0, col="gray", lty=3)} plot.(x, tukeyChi(x, cc), type = "l", col = 2) plot.(x, tukeyChi(x, cc, deriv = 1), type = "l", col = 2) plot.(x, tukeyChi(x, cc, deriv = 2), type = "l", col = 2) mtext(sprintf("tukeyChi(x, c = %g, deriv), deriv = 0,1,2", cc), outer = TRUE, font = par("font.main"), cex = par("cex.main")) par(op) op <- par(mfrow = c(3,1), oma = c(0,0, 2, 0), mgp = c(1.5, 0.6, 0), mar= .1+c(3,4,1,1)) x <- seq(-5, 5, length = 201) cc <- 4.69 # as set by default in lmrob.control() plot. <- function(...) { plot(..., asp = 1); abline(h=0,v=0, col="gray", lty=3)} plot.(x, tukeyPsi1(x, cc), type = "l", col = 2) abline(0:1, lty = 3, col = "light blue") plot.(x, tukeyPsi1(x, cc, deriv = -1), type = "l", col = 2) plot.(x, tukeyPsi1(x, cc, deriv = 1), type = "l", col = 2); abline(h=1,lty=3) mtext(sprintf("tukeyPsi1(x, c = %g, deriv), deriv = 0, -1, 1", cc), outer = TRUE, font = par("font.main"), cex = par("cex.main")) par(op)
op <- par(mfrow = c(3,1), oma = c(0,0, 2, 0), mgp = c(1.5, 0.6, 0), mar= .1+c(3,4,3,2)) x <- seq(-2.5, 2.5, length = 201) cc <- 1.55 # as set by default in lmrob.control() plot. <- function(...) { plot(...); abline(h=0,v=0, col="gray", lty=3)} plot.(x, tukeyChi(x, cc), type = "l", col = 2) plot.(x, tukeyChi(x, cc, deriv = 1), type = "l", col = 2) plot.(x, tukeyChi(x, cc, deriv = 2), type = "l", col = 2) mtext(sprintf("tukeyChi(x, c = %g, deriv), deriv = 0,1,2", cc), outer = TRUE, font = par("font.main"), cex = par("cex.main")) par(op) op <- par(mfrow = c(3,1), oma = c(0,0, 2, 0), mgp = c(1.5, 0.6, 0), mar= .1+c(3,4,1,1)) x <- seq(-5, 5, length = 201) cc <- 4.69 # as set by default in lmrob.control() plot. <- function(...) { plot(..., asp = 1); abline(h=0,v=0, col="gray", lty=3)} plot.(x, tukeyPsi1(x, cc), type = "l", col = 2) abline(0:1, lty = 3, col = "light blue") plot.(x, tukeyPsi1(x, cc, deriv = -1), type = "l", col = 2) plot.(x, tukeyPsi1(x, cc, deriv = 1), type = "l", col = 2); abline(h=1,lty=3) mtext(sprintf("tukeyPsi1(x, c = %g, deriv), deriv = 0, -1, 1", cc), outer = TRUE, font = par("font.main"), cex = par("cex.main")) par(op)
Finney's data on vaso constriction in the skin of the digits.
data(vaso, package="robustbase")
data(vaso, package="robustbase")
A data frame with 39 observations on the following 3 variables.
Volume
Inhaled volume of air
Rate
Rate of inhalation
Y
vector of 0 or 1 values.
The data taken from Finney (1947) were obtained in a carefully controlled study in human physiology where a reflex “vaso constriction” may occur in the skin of the digits after taking a single deep breath. The response y is the occurence (y = 1) or non-occurence (y = 0) of vaso constriction in the skin of the digits of a subject after he or she inhaled a certain volume of air at a certain rate. The responses of three subjects are available. The first contributed 9 responses, the second contributed 8 responses, and the third contributed 22 responses.
Although the data represent repeated measurements, an analysis that assumes independent observations may be applied, as claimed by Pregibon (1981).
Finney, D.J. (1947) The estimation from individual records of the relationship between dose and quantal response. Biometrika 34, 320–334
Atkinson, A.C. and Riani, M. (2000) Robust Diagnostic Regression Analysis, First Edition. New York: Springer, Table A.23.
Fahrmeir, L. and Tutz, G. (2001) Multivariate Statistical Modelling Based on Generalized Linear Models, Springer, Table 4.2.
Kuensch, H.R., Stefanski, A. and Carrol, R.J. (1989) Conditionally unbiased bounded influence estimation in general regression models, with applications to generalized linear models, JASA 84, 460–466.
Pregibon, D. (1981) Logistic regression diagnostics, Annals of Statistics 9, 705–724.
data(vaso) str(vaso) pairs(vaso) glmV <- glm(Y ~ log(Volume) + log(Rate), family=binomial, data=vaso) summary(glmV) ## --> example(glmrob) showing classical & robust GLM
data(vaso) str(vaso) pairs(vaso) glmV <- glm(Y ~ log(Volume) + log(Rate), family=binomial, data=vaso) summary(glmV) ## --> example(glmrob) showing classical & robust GLM
Wagner (1994) investigates the rate of employment growth (y
) as
function of percentage of people engaged in producation
activities (PA
) and higher services
(HS
) and of the growth of these percentages (GPA
,
GHS
) during three time periods in 21 geographical regions of
the greater Hannover area.
data(wagnerGrowth, package="robustbase")
data(wagnerGrowth, package="robustbase")
A data frame with observations
(one per
Region x Period
) on the following 7 variables.
Region
a factor
with 21 levels, denoting
the corresponding region in Hannover (conceptually a “block
factor”).
PA
numeric: percent of people involved in production activities.
GPA
growth of PA
.
HS
a numeric vector
GHS
a numeric vector
y
a numeric vector
Period
a factor
with levels 1:3
,
denoting the time period, 1 = 1979-1982, 2 = 1983-1988, 3 = 1989-1992.
Hubert, M. and Rousseeuw, P. J. (1997). Robust regression with both continuous and binary regressors, Journal of Statistical Planning and Inference 57, 153–163.
Wagner J. (1994). Regionale Beschäftigungsdynamik und höherwertige Produktionsdienste: Ergebnisse für den Grossraum Hannover (1979-1992). Raumforschung und Raumordnung 52, 146–150.
data(wagnerGrowth) ## maybe str(wagnerGrowth) require(lattice) (xyplot(y ~ Period | Region, data = wagnerGrowth, main = "wagnerGrowth: 21 regions @ Hannover")) (dotplot(y ~ reorder(Region,y,median), data = wagnerGrowth, main = "wagnerGrowth", xlab = "Region [ordered by median(y | Region) ]"))
data(wagnerGrowth) ## maybe str(wagnerGrowth) require(lattice) (xyplot(y ~ Period | Region, data = wagnerGrowth, main = "wagnerGrowth: 21 regions @ Hannover")) (dotplot(y ~ reorder(Region,y,median), data = wagnerGrowth, main = "wagnerGrowth", xlab = "Region [ordered by median(y | Region) ]"))
weights()
extracts robustness weights or fitting
(or prior) weights from a lmrob
or glmrob
object.
## S3 method for class 'lmrob' weights(object, type = c("prior", "robustness"), ...)
## S3 method for class 'lmrob' weights(object, type = c("prior", "robustness"), ...)
object |
an object of class |
type |
the type of weights to be returned. Either
|
... |
not used currently. |
The “prior weights” correspond to the weights specified using
the “weights” argument when calling lmrob
. The
“robustness weights” are the weights assigned by the
M-estimator of regression, . The robust
coefficient estimate then numericarlly corresponds to a weighted least
squares fit using the product of both types of weights as weights.
Weights extracted from the object object
.
Manuel Koller and Martin Maechler.
Compute the weighted Hi-Median of x
.
wgt.himedian(x, weights = rep(1, n))
wgt.himedian(x, weights = rep(1, n))
x |
numeric vector |
weights |
numeric vector of weights; of the same length as |
this is rather a by-product of the code used in Sn
and
Qn
. We currently plan to replace it with more general
weighted quantiles.
median
;
also wtd.quantile
from package Hmisc.
x <- c(1:6, 20) median(x) ## 4 stopifnot(all.equal(4, wgt.himedian(x)), all.equal(6, wgt.himedian(x, c(rep(1,6), 5))))
x <- c(1:6, 20) median(x) ## 4 stopifnot(all.equal(4, wgt.himedian(x)), all.equal(6, wgt.himedian(x, c(rep(1,6), 5))))
The original data are from Draper and Smith (1966) and were used to determine the influence of anatomical factors on wood specific gravity, with five explanatory variables and an intercept. These data were contaminated by replacing a few observations with outliers.
data(wood, package="robustbase")
data(wood, package="robustbase")
A data frame with 20 observations on the following 6 variables.
explanatory “anatomical” wood variables.
wood specific gravity, the target variable.
Draper and Smith (1966, p.227)
Peter J. Rousseeuw and Annick M. Leroy (1987) Robust Regression and Outlier Detection Wiley, p.243, table 8.
data(wood) plot(wood) summary( lm.wood <- lm(y ~ ., data = wood)) summary(rlm.wood <- MASS::rlm(y ~ ., data = wood)) summary(lts.wood <- ltsReg(y ~ ., data = wood)) wood.x <- as.matrix(wood)[,1:5] c_wood <- covMcd(wood.x) c_wood
data(wood) plot(wood) summary( lm.wood <- lm(y ~ ., data = wood)) summary(rlm.wood <- MASS::rlm(y ~ ., data = wood)) summary(lts.wood <- ltsReg(y ~ ., data = wood)) wood.x <- as.matrix(wood)[,1:5] c_wood <- covMcd(wood.x) c_wood
x30o50
, called ‘'XX'’ in the thesis, has been a running
case for which mc()
had failed to converge.
A numeric vector of 50 values, 30 of which are very close to zero,
specifically, their absolute values are less than 1.5e-15
.
The remaining 20 values (11 negative, 9 positive) have absolute values between 0.0022 and 1.66.
data(x30o50, package="robustbase")
data(x30o50, package="robustbase")
A summary is
Min. 1st Qu. Median Mean 3rd Qu. Max. -1.66006 0.00000 0.00000 -0.04155 0.00000 1.29768
notably the 1st to 3rd quartiles are all very close to zero.
a good robust method will treat the 60% “almost zero” values as “good” data and all other as outliers.
This is somewhat counter intuitive to typical human perception where the 30 almost-zero numbers would be considered as inliers and the remaining 20 as “good” data.
The original mc()
algorithm and also the amendments up to 2022
(robustbase versions before 0.95) would fail to converge unless (in
newer versions) eps1
was increased, e.g., only by a factor of 10,
to eps1 = 1e-13
.
Lukas Graz (2021); unpublished BSc thesis, see mc
.
data(x30o50) ## have 4 duplicated values : table(dX <- duplicated(x30o50)) x30o50[dX] # 0 2.77e-17 4.16e-17 2.08e-16 sort(x30o50[dX]) * 2^56 # 0 2 3 15 ## and they are c(0,2,3,15)*2^-56 table(sml <- abs(x30o50) < 1e-11)# 20 30 summary(x30o50[ sml]) # -1.082e-15 ... 1.499e-15 ; mean = 9.2e-19 ~~ 0 summary(x30o50[!sml]) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -1.6601 -0.4689 -0.0550 -0.1039 0.3986 1.2977 op <- par(mfrow=c(3,1), mgp=c(1.5, .6, 0), mar = .3+c(2,3:1)) (Fn. <- ecdf(x30o50)) # <- only 46 knots (as have 4 duplications) plot(Fn.) ## and zoom in (*drastically*) to around x=0 : for(f in c(1e-13, 1.5e-15)) { plot(Fn., xval=f*seq(-1,1, length.out = 1001), ylim=c(0,1), main="[zoomed in]") if(f == 1e-13) rect(-1e-15,0, +1e-15, 1, col="thistle", border=1) plot(Fn., add=TRUE) } par(op) mcOld <- function(x, ..., doScale=TRUE) mc(x, doScale=doScale, c.huberize=Inf, ...) try( mcOld(x30o50) ) # Error: .. not 'converged' in 100 iteration mcOld(x30o50, eps1 = 1e-12) # -0.152 (mcX <- mc(x30o50)) # -7.10849e-13 stopifnot(exprs = { all.equal(-7.10848988e-13, mcX, tol = 1e-9) all.equal(mcX, mc(1e30*x30o50), tol = 4e-4) # not so close }) table(sml <- abs(x30o50) < 1e-8)# 20 30 range(x30o50[sml]) x0o50 <- x30o50; x0o50[sml] <- 0 (mcX0 <- mc(x0o50)) stopifnot(exprs = { all.equal(-0.378445401788, mcX0, tol=1e-12) all.equal(-0.099275805349, mc(x30o50[!sml]) -> mcL, tol=2e-11) all.equal(mcL, mcOld(x30o50[!sml])) }) ## -- some instability also wrt c.huberize: mcHubc <- function(dat, ...) function(cc) vapply(cc, function(c) mc(dat, c.huberize = c, ...), -1.) mcH50 <- mcHubc(x30o50) head(cHs <- c(sort(outer(c(1, 2, 5), 10^(2:15))), Inf), 9) mcXc <- mcH50(cHs) plot( mcXc ~ cHs, type="b", log="x" , xlab=quote(c[huberize])) plot((-mcXc) ~ cHs, type="b", log="xy", xlab=quote(c[huberize])) ## but for "regular" outlier skew data, there's no such dependency: mcXcu <- mcHubc(cushny)(cHs) stopifnot( abs(mcXcu - mcXcu[1]) < 1e-15)
data(x30o50) ## have 4 duplicated values : table(dX <- duplicated(x30o50)) x30o50[dX] # 0 2.77e-17 4.16e-17 2.08e-16 sort(x30o50[dX]) * 2^56 # 0 2 3 15 ## and they are c(0,2,3,15)*2^-56 table(sml <- abs(x30o50) < 1e-11)# 20 30 summary(x30o50[ sml]) # -1.082e-15 ... 1.499e-15 ; mean = 9.2e-19 ~~ 0 summary(x30o50[!sml]) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -1.6601 -0.4689 -0.0550 -0.1039 0.3986 1.2977 op <- par(mfrow=c(3,1), mgp=c(1.5, .6, 0), mar = .3+c(2,3:1)) (Fn. <- ecdf(x30o50)) # <- only 46 knots (as have 4 duplications) plot(Fn.) ## and zoom in (*drastically*) to around x=0 : for(f in c(1e-13, 1.5e-15)) { plot(Fn., xval=f*seq(-1,1, length.out = 1001), ylim=c(0,1), main="[zoomed in]") if(f == 1e-13) rect(-1e-15,0, +1e-15, 1, col="thistle", border=1) plot(Fn., add=TRUE) } par(op) mcOld <- function(x, ..., doScale=TRUE) mc(x, doScale=doScale, c.huberize=Inf, ...) try( mcOld(x30o50) ) # Error: .. not 'converged' in 100 iteration mcOld(x30o50, eps1 = 1e-12) # -0.152 (mcX <- mc(x30o50)) # -7.10849e-13 stopifnot(exprs = { all.equal(-7.10848988e-13, mcX, tol = 1e-9) all.equal(mcX, mc(1e30*x30o50), tol = 4e-4) # not so close }) table(sml <- abs(x30o50) < 1e-8)# 20 30 range(x30o50[sml]) x0o50 <- x30o50; x0o50[sml] <- 0 (mcX0 <- mc(x0o50)) stopifnot(exprs = { all.equal(-0.378445401788, mcX0, tol=1e-12) all.equal(-0.099275805349, mc(x30o50[!sml]) -> mcL, tol=2e-11) all.equal(mcL, mcOld(x30o50[!sml])) }) ## -- some instability also wrt c.huberize: mcHubc <- function(dat, ...) function(cc) vapply(cc, function(c) mc(dat, c.huberize = c, ...), -1.) mcH50 <- mcHubc(x30o50) head(cHs <- c(sort(outer(c(1, 2, 5), 10^(2:15))), Inf), 9) mcXc <- mcH50(cHs) plot( mcXc ~ cHs, type="b", log="x" , xlab=quote(c[huberize])) plot((-mcXc) ~ cHs, type="b", log="xy", xlab=quote(c[huberize])) ## but for "regular" outlier skew data, there's no such dependency: mcXcu <- mcHubc(cushny)(cHs) stopifnot( abs(mcXcu - mcXcu[1]) < 1e-15)