Title: | Infinitesimally Robust Estimators for Preprocessing -Omics Data |
---|---|
Description: | Functions for the determination of optimally robust influence curves and estimators for preprocessing omics data, in particular gene expression data (Kohl and Deigner (2010), <doi:10.1186/1471-2105-11-583>). |
Authors: | Matthias Kohl [aut, cre, cph] |
Maintainer: | Matthias Kohl <[email protected]> |
License: | LGPL-3 |
Version: | 1.2.2 |
Built: | 2024-11-03 05:59:20 UTC |
Source: | https://github.com/r-forge/robast |
Functions for the determination of optimally robust influence curves and estimators for preprocessing omics data, in particular gene expression data (Kohl and Deigner (2010)).
Note: The first two numbers of package versions do not necessarily reflect package-individual development, but rather are chosen for the RobAStXXX family as a whole in order to ease updating "depends" information.
Matthias Kohl [email protected]
Maintainer: Matthias Kohl [email protected]
Kohl, M. (2005) Numerical Contributions to the Asymptotic Theory of Robustness. Bayreuth: Dissertation.
Kohl M. and Deigner H.P. (2010). Preprocessing of gene expression data by optimally robust estimators. BMC Bioinformatics, 11:583.
M. Kohl, P. Ruckdeschel, and H. Rieder (2010). Infinitesimally Robust Estimation in General Smoothly Parametrized Models. Statistical Methods and Application, 19(3):333-354.
Rieder, H. (1994) Robust Asymptotic Statistics. New York: Springer.
Rieder, H., Kohl, M. and Ruckdeschel, P. (2008) The Costs of not Knowing the Radius. Statistical Methods and Applications 17(1) 13-40.
library(RobLoxBioC)
library(RobLoxBioC)
Generic function for computing minimum Kolmogorov distance for biological data.
KolmogorovMinDist(x, D, ...) ## S4 method for signature 'matrix,Norm' KolmogorovMinDist(x, D, mad0 = 1e-4) ## S4 method for signature 'AffyBatch,AbscontDistribution' KolmogorovMinDist(x, D, bg.correct = TRUE, pmcorrect = TRUE, verbose = TRUE) ## S4 method for signature 'beadLevelData,AbscontDistribution' KolmogorovMinDist(x, D, log = FALSE, what = "Grn", probes = NULL, arrays = NULL)
KolmogorovMinDist(x, D, ...) ## S4 method for signature 'matrix,Norm' KolmogorovMinDist(x, D, mad0 = 1e-4) ## S4 method for signature 'AffyBatch,AbscontDistribution' KolmogorovMinDist(x, D, bg.correct = TRUE, pmcorrect = TRUE, verbose = TRUE) ## S4 method for signature 'beadLevelData,AbscontDistribution' KolmogorovMinDist(x, D, log = FALSE, what = "Grn", probes = NULL, arrays = NULL)
x |
biological data. |
D |
object of class |
... |
additional parameters. |
mad0 |
scale estimate used if computed MAD is equal to zero. Median and MAD are used as start parameter for optimization. |
bg.correct |
if |
pmcorrect |
if |
verbose |
logical: if |
log |
if |
what |
character string specifying which intensities/values to summarize;
see |
probes |
Specify particular probes to summarize. If left |
arrays |
integer (scalar or vector) specifying the strips/arrays to summarize.
If |
The minimum Kolmogorov distance is computed for each row of a matrix, each Affymetrix probe, or each Illumina bead, respectively.
So far, only the minimum distance to the set of normal distributions can be computed.
List with components dist
containing a numeric vector
or matrix with minimum Kolmogorov distances and n
a numeric vector
or matrix with the corresponding sample sizes.
Matthias Kohl [email protected]
Huber, P.J. (1981) Robust Statistics. New York: Wiley.
Rieder, H. (1994) Robust Asymptotic Statistics. New York: Springer.
set.seed(123) # to have reproducible results for package checking ## matrix method for KolmogorovMinDist ind <- rbinom(200, size=1, prob=0.05) X <- matrix(rnorm(200, mean=ind*3, sd=(1-ind) + ind*9), nrow = 2) KolmogorovMinDist(X, D = Norm()) ## using Affymetrix data data(SpikeIn) probes <- log2(pm(SpikeIn)) (res <- KolmogorovMinDist(probes, Norm())) boxplot(res$dist) ## \donttest because of check time ## using Affymetrix data library(affydata) data(Dilution) res <- KolmogorovMinDist(Dilution[,1], Norm()) summary(res$dist) boxplot(res$dist) plot(res$n, res$dist, pch = 20, main = "Kolmogorov distance vs. sample size", xlab = "sample size", ylab = "Kolmogorov distance", ylim = c(0, max(res$dist))) uni.n <- min(res$n):max(res$n) lines(uni.n, 1/(2*uni.n), col = "orange", lwd = 2) legend("topright", legend = "minimal possible distance", fill = "orange") ## Illumina bead level data library(beadarrayExampleData) data(exampleBLData) res <- KolmogorovMinDist(exampleBLData, Norm(), arrays = 1) res1 <- KolmogorovMinDist(exampleBLData, Norm(), log = TRUE, arrays = 1) summary(cbind(res$dist, res1$dist)) boxplot(list(res$dist, res1$dist), names = c("raw", "log-raw")) sort(unique(res1$n)) plot(res1$n, res1$dist, pch = 20, main = "Kolmogorov distance vs. sample size", xlab = "sample size", ylab = "Kolmogorov distance", ylim = c(0, max(res1$dist)), xlim = c(min(res1$n), 56)) uni.n <- min(res1$n):56 lines(uni.n, 1/(2*uni.n), col = "orange", lwd = 2) legend("topright", legend = "minimal possible distance", fill = "orange")
set.seed(123) # to have reproducible results for package checking ## matrix method for KolmogorovMinDist ind <- rbinom(200, size=1, prob=0.05) X <- matrix(rnorm(200, mean=ind*3, sd=(1-ind) + ind*9), nrow = 2) KolmogorovMinDist(X, D = Norm()) ## using Affymetrix data data(SpikeIn) probes <- log2(pm(SpikeIn)) (res <- KolmogorovMinDist(probes, Norm())) boxplot(res$dist) ## \donttest because of check time ## using Affymetrix data library(affydata) data(Dilution) res <- KolmogorovMinDist(Dilution[,1], Norm()) summary(res$dist) boxplot(res$dist) plot(res$n, res$dist, pch = 20, main = "Kolmogorov distance vs. sample size", xlab = "sample size", ylab = "Kolmogorov distance", ylim = c(0, max(res$dist))) uni.n <- min(res$n):max(res$n) lines(uni.n, 1/(2*uni.n), col = "orange", lwd = 2) legend("topright", legend = "minimal possible distance", fill = "orange") ## Illumina bead level data library(beadarrayExampleData) data(exampleBLData) res <- KolmogorovMinDist(exampleBLData, Norm(), arrays = 1) res1 <- KolmogorovMinDist(exampleBLData, Norm(), log = TRUE, arrays = 1) summary(cbind(res$dist, res1$dist)) boxplot(list(res$dist, res1$dist), names = c("raw", "log-raw")) sort(unique(res1$n)) plot(res1$n, res1$dist, pch = 20, main = "Kolmogorov distance vs. sample size", xlab = "sample size", ylab = "Kolmogorov distance", ylim = c(0, max(res1$dist)), xlim = c(min(res1$n), 56)) uni.n <- min(res1$n):56 lines(uni.n, 1/(2*uni.n), col = "orange", lwd = 2) legend("topright", legend = "minimal possible distance", fill = "orange")
Generic function for preprocessing biological data using optimally robust (rmx) estimators; confer Rieder (1994), Kohl (2005), Rieder et al (2008).
robloxbioc(x, ...) ## S4 method for signature 'matrix' robloxbioc(x, eps = NULL, eps.lower = 0, eps.upper = 0.05, steps = 3L, fsCor = TRUE, mad0 = 1e-4) ## S4 method for signature 'AffyBatch' robloxbioc(x, bg.correct = TRUE, pmcorrect = TRUE, normalize = FALSE, add.constant = 32, verbose = TRUE, eps = NULL, eps.lower = 0, eps.upper = 0.05, steps = 3L, fsCor = TRUE, mad0 = 1e-4, contrast.tau = 0.03, scale.tau = 10, delta = 2^(-20), sc = 500) ## S4 method for signature 'beadLevelData' robloxbioc(x, channelList = list(greenChannel), probeIDs = NULL, useSampleFac = FALSE, sampleFac = NULL, weightNames = "wts", removeUnMappedProbes = TRUE, eps = NULL, eps.lower = 0, eps.upper = 0.05, steps = 3L, fsCor = TRUE, mad0 = 1e-4)
robloxbioc(x, ...) ## S4 method for signature 'matrix' robloxbioc(x, eps = NULL, eps.lower = 0, eps.upper = 0.05, steps = 3L, fsCor = TRUE, mad0 = 1e-4) ## S4 method for signature 'AffyBatch' robloxbioc(x, bg.correct = TRUE, pmcorrect = TRUE, normalize = FALSE, add.constant = 32, verbose = TRUE, eps = NULL, eps.lower = 0, eps.upper = 0.05, steps = 3L, fsCor = TRUE, mad0 = 1e-4, contrast.tau = 0.03, scale.tau = 10, delta = 2^(-20), sc = 500) ## S4 method for signature 'beadLevelData' robloxbioc(x, channelList = list(greenChannel), probeIDs = NULL, useSampleFac = FALSE, sampleFac = NULL, weightNames = "wts", removeUnMappedProbes = TRUE, eps = NULL, eps.lower = 0, eps.upper = 0.05, steps = 3L, fsCor = TRUE, mad0 = 1e-4)
x |
biological data. |
... |
additional parameters. |
eps |
positive real (0 < |
eps.lower |
positive real (0 <= |
eps.upper |
positive real ( |
steps |
positive integer. k-step is used to compute the optimally robust estimator. |
fsCor |
logical: perform finite-sample correction. See function |
mad0 |
scale estimate used if computed MAD is equal to zero |
bg.correct |
if |
pmcorrect |
method used for PM correction; |
normalize |
logical: if |
add.constant |
constant added to the MAS 5.0 expression values before the normalization step. Improves the variance of the measure one no longer devides by numbers close to 0 when computing fold-changes. |
verbose |
logical: if |
contrast.tau |
a number denoting the contrast tau parameter; confer the MAS 5.0 PM correction algorithm. |
scale.tau |
a number denoting the scale tau parameter; confer the MAS 5.0 PM correction algorithm. |
delta |
a number denoting the delta parameter; confer the MAS 5.0 PM correction algorithm. |
sc |
value at which all arrays will be scaled to. |
channelList |
List of objects of class illuminaChannel that defines the
summarisation to be performed where in our setup only the slots |
probeIDs |
Vector of ArrayAddressIDs to be included in the summarized object. The default is to summarize all probes. |
useSampleFac |
if |
sampleFac |
optional character vector giving which a sample identifer for each section |
weightNames |
name of column in the |
removeUnMappedProbes |
if TRUE and annotation information is stored in the
|
The optimally-robust resp. the radius-minimax (rmx) estimator for normal location and scale is used to preprocess biological data. The computation uses a k-step construction with median and MAD as starting estimators; cf. Rieder (1994) and Kohl (2005).
If the amount of gross errors (contamination) is known, it can be
specified by eps
. The radius of the corresponding infinitesimal
contamination neighborhood (infinitesimal version of Tukey's gross error model)
is obtained by multiplying eps
by the square root of the sample size.
If the amount of gross errors (contamination) is unknown, which is typically
the case, try to find a rough estimate for the amount of gross errors, such that
it lies between eps.lower
and eps.upper
.
If eps
is NULL
, the radius-minimax (rmx) estimator in sense of
Rieder et al. (2001, 2008), respectively Section 2.2 of Kohl (2005) is used.
The algorithm used for Affymetrix data is similar to MAS 5.0 (cf. Affymetrix (2002)). The main difference is the substitution of the Tukey one-step estimator by our rmx k-step (k >= 1) estimator in the PM/MM correction step. The optional scale normalization is performed as given in Affymetrix (2002).
In case of Illumina data, the rmx estimator is used to summarize the bead types.
The implementation for the most part copies summarize
from beadarray.
For sample size <= 2, median and MAD are used for estimation.
If eps = 0
, mean and sd are computed.
Return value depends on the class of x
.
In case of "matrix"
a matrix with columns "mean" and "sd" is returned.
In case of "AffyBatch"
an object of class "ExpressionSet"
is returned.
In case of "BeadLevelData"
an object of class "ExpressionSetIllumina"
is returned.
Matthias Kohl [email protected],
update for beadarray versions >= 2.0.0 with support by Mark Dunnings and Andy Lynch
Affymetrix, Inc. (2002). Statistical Algorithms Description Document. Affymetrix, Santa Clara.
Kohl, M. (2005) Numerical Contributions to the Asymptotic Theory of Robustness. Bayreuth: Dissertation.
Kohl M. and Deigner H.P. (2010). Preprocessing of gene expression data by optimally robust estimators. BMC Bioinformatics, 11:583.
M. Kohl, P. Ruckdeschel, and H. Rieder (2010). Infinitesimally Robust Estimation in General Smoothly Parametrized Models. Statistical Methods and Application, 19(3):333-354.
Rieder, H. (1994) Robust Asymptotic Statistics. New York: Springer.
Rieder, H., Kohl, M. and Ruckdeschel, P. (2008) The Costs of not Knowing the Radius. Statistical Methods and Applications 17(1) 13-40. Extended version: http://r-kurs.de/RRlong.pdf
roblox
, rowRoblox
,
AffyBatch-class
,
generateExprVal.method.mas
,
ExpressionSet-class
,
summarize
set.seed(123) # to have reproducible results for package checking ## similar to rowRoblox of package RobLox ind <- rbinom(200, size=1, prob=0.05) X <- matrix(rnorm(200, mean=ind*3, sd=(1-ind) + ind*9), nrow = 2) robloxbioc(X) robloxbioc(X, steps = 5) robloxbioc(X, eps = 0.05) robloxbioc(X, eps = 0.05, steps = 5) ## \donttest to reduce check time ## the function is designed for large scale problems X <- matrix(rnorm(50000*20, mean = 1), nrow = 50000) system.time(robloxbioc(X)) ## using Affymetrix data ## confer example to generateExprVal.method.mas ## A more worked out example can be found in the scripts folder ## of the package. data(SpikeIn) probes <- pm(SpikeIn) mas <- generateExprVal.method.mas(probes) rl <- 2^robloxbioc(log2(t(probes))) concentrations <- as.numeric(colnames(SpikeIn)) plot(concentrations, mas$exprs, log="xy", ylim=c(50,10000), type="b", ylab = "expression measures") points(concentrations, rl[,1], pch = 20, col="orange", type="b") legend("topleft", c("MAS", "roblox"), pch = c(1, 20)) ## Affymetrix dilution data library(affydata) data(Dilution) eset <- robloxbioc(Dilution) ## Affymetrix scale normalization eset1 <- robloxbioc(Dilution, normalize = TRUE) ## Illumina bead level data library(beadarrayExampleData) data(exampleBLData) res <- robloxbioc(exampleBLData, eps.upper = 0.5) res
set.seed(123) # to have reproducible results for package checking ## similar to rowRoblox of package RobLox ind <- rbinom(200, size=1, prob=0.05) X <- matrix(rnorm(200, mean=ind*3, sd=(1-ind) + ind*9), nrow = 2) robloxbioc(X) robloxbioc(X, steps = 5) robloxbioc(X, eps = 0.05) robloxbioc(X, eps = 0.05, steps = 5) ## \donttest to reduce check time ## the function is designed for large scale problems X <- matrix(rnorm(50000*20, mean = 1), nrow = 50000) system.time(robloxbioc(X)) ## using Affymetrix data ## confer example to generateExprVal.method.mas ## A more worked out example can be found in the scripts folder ## of the package. data(SpikeIn) probes <- pm(SpikeIn) mas <- generateExprVal.method.mas(probes) rl <- 2^robloxbioc(log2(t(probes))) concentrations <- as.numeric(colnames(SpikeIn)) plot(concentrations, mas$exprs, log="xy", ylim=c(50,10000), type="b", ylab = "expression measures") points(concentrations, rl[,1], pch = 20, col="orange", type="b") legend("topleft", c("MAS", "roblox"), pch = c(1, 20)) ## Affymetrix dilution data library(affydata) data(Dilution) eset <- robloxbioc(Dilution) ## Affymetrix scale normalization eset1 <- robloxbioc(Dilution, normalize = TRUE) ## Illumina bead level data library(beadarrayExampleData) data(exampleBLData) res <- robloxbioc(exampleBLData, eps.upper = 0.5) res
The function AffySimStudy
can be used to perform Monte-Carlo studies
comparing Tukey's biweight and rmx estimators for normal location and scale.
The function IlluminaSimStudy
can be used to perform Monte-Carlo studies
comparing Illumina's default method - a Huber-type skipped mean and sd
(cf. Hampel (1985)) - and rmx estimators for normal location and scale.
In addition, maximum likelihood (ML) estimators (mean and sd) and median and
MAD are computed. The comparison is based on the empirical MSE.
AffySimStudy(n, M, eps, seed = 123, eps.lower = 0, eps.upper = 0.05, steps = 3L, fsCor = TRUE, contD, plot1 = FALSE, plot2 = FALSE, plot3 = FALSE) IlluminaSimStudy(n, M, eps, seed = 123, eps.lower = 0, eps.upper = 0.05, steps = 3L, fsCor = TRUE, contD, plot1 = FALSE, plot2 = FALSE, plot3 = FALSE)
AffySimStudy(n, M, eps, seed = 123, eps.lower = 0, eps.upper = 0.05, steps = 3L, fsCor = TRUE, contD, plot1 = FALSE, plot2 = FALSE, plot3 = FALSE) IlluminaSimStudy(n, M, eps, seed = 123, eps.lower = 0, eps.upper = 0.05, steps = 3L, fsCor = TRUE, contD, plot1 = FALSE, plot2 = FALSE, plot3 = FALSE)
n |
integer; sample size, should be at least 3. |
M |
integer; Monte-Carlo replications. |
eps |
amount of contamination in [0, 0.5]. |
seed |
random seed. |
eps.lower |
used by rmx estimator. |
eps.upper |
used by rmx estimator. |
steps |
integer; steps used for estimator construction. |
fsCor |
logical; use finite-sample correction. |
contD |
object of class |
plot1 |
logical; plot cdf of ideal and real distribution. |
plot2 |
logical; plot 20 (or M if M < 20) randomly selected samples. |
plot3 |
logical; generate boxplots of the results. |
Normal location and scale with mean = 0 and sd = 1 is used as ideal model (without restriction due to equivariance).
Since there is no estimator which yields reliable results if 50 percent or more of the observations are contaminated, we use a modification where we re-simulate all samples including at least 50 percent contaminated data.
We use funtion rowRoblox
for the computation of the rmx estimator.
Data.frame including empirical MSE (standardized by sample size n) and relMSE with respect to the rmx estimator.
Matthias Kohl [email protected]
Affymetrix, Inc. (2002). Statistical Algorithms Description Document. Affymetrix, Santa Clara.
Hampel F.R. (1985). The breakdown points of the mean combined with some rejection rules. Technometrics, 27(2):95-107.
set.seed(123) # to have reproducible results for package checking AffySimStudy(n = 11, M = 100, eps = 0.02, contD = Norm(mean = 0, sd = 3), plot1 = TRUE, plot2 = TRUE, plot3 = TRUE) IlluminaSimStudy(n = 30, M = 100, eps = 0.02, contD = Norm(mean = 0, sd = 3), plot1 = TRUE, plot2 = TRUE, plot3 = TRUE)
set.seed(123) # to have reproducible results for package checking AffySimStudy(n = 11, M = 100, eps = 0.02, contD = Norm(mean = 0, sd = 3), plot1 = TRUE, plot2 = TRUE, plot3 = TRUE) IlluminaSimStudy(n = 30, M = 100, eps = 0.02, contD = Norm(mean = 0, sd = 3), plot1 = TRUE, plot2 = TRUE, plot3 = TRUE)