Title: | Classes and Methods for "Class Comparison" Problems on Microarrays |
---|---|
Description: | Defines the classes used for "class comparison" problems in the OOMPA project (<http://oompa.r-forge.r-project.org/>). Class comparison includes tests for differential expression; see Simon's book for details on typical problem types. |
Authors: | Kevin R. Coombes [aut, cre] |
Maintainer: | Kevin R. Coombes <[email protected]> |
License: | Apache License (== 2.0) |
Version: | 3.3.2 |
Built: | 2024-12-16 03:58:55 UTC |
Source: | https://github.com/r-forge/oompa |
The Bum
class is used to fit a beta-uniform mixture model to a
set of p-values.
Bum(pvals, ...) ## S4 method for signature 'Bum' summary(object, tau=0.01, ...) ## S4 method for signature 'Bum' hist(x, res=100, xlab='P Values', main='', ...) ## S4 method for signature 'Bum' image(x, ...) ## S4 method for signature 'Bum' cutoffSignificant(object, alpha, by='FDR', ...) ## S4 method for signature 'Bum' selectSignificant(object, alpha, by='FDR', ...) ## S4 method for signature 'Bum' countSignificant(object, alpha, by='FDR', ...) likelihoodBum(object)
Bum(pvals, ...) ## S4 method for signature 'Bum' summary(object, tau=0.01, ...) ## S4 method for signature 'Bum' hist(x, res=100, xlab='P Values', main='', ...) ## S4 method for signature 'Bum' image(x, ...) ## S4 method for signature 'Bum' cutoffSignificant(object, alpha, by='FDR', ...) ## S4 method for signature 'Bum' selectSignificant(object, alpha, by='FDR', ...) ## S4 method for signature 'Bum' countSignificant(object, alpha, by='FDR', ...) likelihoodBum(object)
pvals |
numeric vector containing values between |
object |
object of class |
tau |
numeric scalar between |
x |
object of class |
res |
positive integer scalar specifying the resolution at which to plot the fitted distribution curve |
xlab |
character string specifying the label for the x axis |
main |
character string specifying the graph title |
alpha |
Either the false discovery rate (if |
by |
character string denoting the method to use for determining cutoffs. Valid values are:
|
... |
extra arguments for generic or plotting routines |
The BUM method was introduced by Stan Pounds and Steve Morris, although it was simultaneously discovered by several other researchers. It is generally applicable to any analysis of microarray or proteomics data that performs a separate statistical hypothesis test for each gene or protein, where each test produces a p-value that would be valid if the analyst were only performing one statistical test. When performing thousands of statistical tests, however, those p-values no longer have the same interpretation as Type I error rates. The idea behind BUM is that, under the null hypothesis that none of the genes or proteins is interesting, the expected distribution of the set of p-values is uniform. By contrast, if some of the genes are interesting, then we should see an overabundance of small p-values (or a spike in the histogram near zero). We can model the alternative hypothesis with a beta distribution, and view the set of all p-values as a mixture distribution.
Fitting the BUM model is straightforward, using a nonlinear optimizer to compute the maximum likelihood parameters. After the model has been fit, one can easily determine cutoffs on the p-values that correspond to desired false discovery rates. Alternatively, the original Pounds and Morris paper shows that their results can be reinterpreted to recover the empirical Bayes method introduced by Efron and Tibshirani. Thus, one can also determine cutoffs by specifying a desired posterior probability of significance.
Graphical functions (hist
and image
) invisibly return the
object on which they were invoked.
The cutoffSignificant
method returns a real number between zero
and one. P-values below this cutoff are considered statistically
significant at either the specified false discovery rate or at the
specified posterior probability.
The selectSignificant
method returns a vector of logical values
whose length is equal to the length of the vector of p-values that was
used to construct the Bum
object. True values in the return
vector mark the statistically significant p-values.
The countSignificant
method returns an integer, the number of
statistically significant p-values.
The summary
method returns an object of class
BumSummary
.
Although objects can be created directly using new
, the most
common usage will be to pass a vector of p-values to the
Bum
function.
pvals
:numeric vector of p-values used to construct the object.
ahat
:Model parameter
lhat
:Model parameter
pihat
:Model parameter
For each value of the p-value
cutoff tau
, computes estimates of the fraction of true
positives (TP), false negatives (FN), false positives (FP), and
true negatives (TN).
Plots a
histogram of the object, and overlays (1) a straight line to indicate
the contribution of the uniform component and (2) the fitted
beta-uniform distribution from the observed values. Colors in the
plot are controlled by
oompaColor$EXPECTED
and
oompaColor$OBSERVED
.
Produces four plots in a 2x2 layout: (1) the
histogram produced by hist
; (2) a plot of cutoffs against
the desired false discovery rate; (3) a plot of cutoffs against
the posterior probability of coming from the beta component; and
(4) an ROC curve.
Computes the
cutoff needed for significance, which in this case means arising
from the beta component rather than the uniform component of the
mixture. Significance is specified either by the false discovery
rate (when by = 'FDR'
or by = 'FalseDiscovery'
) or
by the posterior probability (when by = 'EmpiricalBayes'
)
Uses
cutoffSignificant
to determine a logical vector that
indicates which of the p-values are significant.
Uses
selectSignificant
to count the number of significant
p-values.
Kevin R. Coombes [email protected]
Pounds S, Morris SW.
Estimating the occurrence of false positives and false negatives in
microarray studies by approximating and partitioning the empirical
distribution of p-values.
Bioinformatics. 2003 Jul 1;19(10):1236-42.
Benjamini Y, Hochberg Y.
Controlling the false discovery rate: a practical and powerful approach
to multiple testing.
J Roy Statist Soc B, 1995; 57: 289-300.
Efron B, Tibshirani R.
Empirical bayes methods and false discovery rates for microarrays.
Genet Epidemiol 2002, 23: 70-86.
Two classes that produce lists of p-values that can (and often
should) be analyzed using BUM are MultiTtest
and
MultiLinearModel
. Also see BumSummary
.
showClass("Bum") fake.data <- c(runif(700), rbeta(300, 0.3, 1)) a <- Bum(fake.data) hist(a, res=200) alpha <- (1:25)/100 plot(alpha, cutoffSignificant(a, alpha, by='FDR'), xlab='Desired False Discovery Rate', type='l', main='FDR Control', ylab='Significant P Value') GAMMA <- 5*(10:19)/100 plot(GAMMA, cutoffSignificant(a, GAMMA, by='EmpiricalBayes'), ylab='Significant P Value', type='l', main='Empirical Bayes', xlab='Posterior Probability') b <- summary(a, (0:100)/100) be <- b@estimates sens <- be$TP/(be$TP+be$FN) spec <- be$TN/(be$TN+be$FP) plot(1-spec, sens, type='l', xlim=c(0,1), ylim=c(0,1), main='ROC Curve') points(1-spec, sens) abline(0,1) image(a) countSignificant(a, 0.05, by='FDR') countSignificant(a, 0.99, by='Emp')
showClass("Bum") fake.data <- c(runif(700), rbeta(300, 0.3, 1)) a <- Bum(fake.data) hist(a, res=200) alpha <- (1:25)/100 plot(alpha, cutoffSignificant(a, alpha, by='FDR'), xlab='Desired False Discovery Rate', type='l', main='FDR Control', ylab='Significant P Value') GAMMA <- 5*(10:19)/100 plot(GAMMA, cutoffSignificant(a, GAMMA, by='EmpiricalBayes'), ylab='Significant P Value', type='l', main='Empirical Bayes', xlab='Posterior Probability') b <- summary(a, (0:100)/100) be <- b@estimates sens <- be$TP/(be$TP+be$FN) spec <- be$TN/(be$TN+be$FP) plot(1-spec, sens, type='l', xlim=c(0,1), ylim=c(0,1), main='ROC Curve') points(1-spec, sens) abline(0,1) image(a) countSignificant(a, 0.05, by='FDR') countSignificant(a, 0.99, by='Emp')
An implementation class. Users are not expected to create these objects
directly; they are produced as return objects from the summary method for
Bum
.
bum
:object of class Bum
estimates
:data.frame
Fhat
:numeric
signature(object = "BumSummary")
: Print the
object, which contains a summary of the underlying Bum
object. The summary contains a data frame with estimates of the
fraction of true positives (TP), false positives (FP), true negatives
(TN) and false negatives (FN) at the set of p-value cutoffs
specified in the call to the summary method.
Kevin R. Coombes [email protected]
showClass("BumSummary")
showClass("BumSummary")
An implementation of the method of Dudoit and colleagues to apply the Westfall-Young adjustment to p-values to control the family-wise error rate when analyzing microarray data.
Dudoit(data, classes, nPerm=1000, verbose=TRUE) ## S4 method for signature 'Dudoit,missing' plot(x, y, xlab='T-Statistic', ylab='P-Value', ...) ## S4 method for signature 'Dudoit' cutoffSignificant(object, alpha, ...) ## S4 method for signature 'Dudoit' selectSignificant(object, alpha, ...) ## S4 method for signature 'Dudoit' countSignificant(object, alpha, ...)
Dudoit(data, classes, nPerm=1000, verbose=TRUE) ## S4 method for signature 'Dudoit,missing' plot(x, y, xlab='T-Statistic', ylab='P-Value', ...) ## S4 method for signature 'Dudoit' cutoffSignificant(object, alpha, ...) ## S4 method for signature 'Dudoit' selectSignificant(object, alpha, ...) ## S4 method for signature 'Dudoit' countSignificant(object, alpha, ...)
data |
either a data frame or matrix with numeric values, or an
|
classes |
If |
nPerm |
integer scalar specifying the number of permutations to perform |
verbose |
logical scalar. If |
object |
object of class |
alpha |
numeric scalar specifying the target family-wise error rate |
x |
object of class |
y |
Nothing, since it is supposed to be missing. Changes to the Rd processor require documenting the missing entry. |
xlab |
character string specifying label for the x axis |
ylab |
character string specifying label for the y axis |
... |
extra arguments for generic or plotting routines |
In 2002, Dudoit and colleagues introduced a method to adjust the p-values when performing gene-by-gene tests for differential expression. The adjustment was based on the method of Westfall and Young, with the goal of controlling the family-wise error rate.
The standard method for plot
returns what you would expect.
The cutoffSignificant
method returns a real number (its input
value alpha
). The selectSignificant
method returns a
vector of logical values identifying the significant test results, and
countSignificant
returns an integer counting the number of
significant test results.
As usual, objects can be created by new
, but better methods are
available in the form of the Dudoit
function. The basic
inputs to this function are the same as those used for row-by-row
statistical tests throughout the ClassComparison package; a detailed
description can be found in the MultiTtest
class.
The additional input determines the number, nPerm
, of
permutations to perform. The accuracy of the p-value adjustment
depends on this value. Since the implementation is in R (and does not
call out to something compiled like C or FORTRAN), however, the
computations are slow. The default value of 1000 can take a long
time with modern microarrays that contain 40,000 spots.
adjusted.p
:numeric vector of adjusted p-values.
t.statistics
:Object of class numeric
containing the computed t-statistics.
p.values
:Object of class numeric
containing
the computed p-values.
groups
:Object of class character
containing
the names of the classes being compared.
call
:Object of class call
containing the
function call that created the object.
Class MultiTtest
, directly. In particular, objects of this class
inherit methods for summary
, hist
, and plot
from
the base class.
Determine cutoffs on
the adjusted p-values at the desired significance level. In other
words, this function simply returns alpha
.
Compute a logical vector for selecting significant test results.
Count the number of significant test results.
signature(x=Dudoit, y=missing)
: ...
Kevin R. Coombes [email protected]
Dudoit S, Yang YH, Callow MJ, Speed TP.
Statistical Methods for Identifying Differentially Expressed Genes in
Replicated cDNA Microarray Experiments.
Statistica Sinica (2002), 12(1): 111-139.
Westfall PH, Young SS.
Resampling-based multiple testing: examples and methods for p-value
adjustment.
Wiley series in probability and mathematics statistics.
John Wiley and Sons, 1993.
showClass("Dudoit") ng <- 10000 ns <- 15 nd <- 200 fake.class <- factor(rep(c('A', 'B'), each=ns)) fake.data <- matrix(rnorm(ng*ns*2), nrow=ng, ncol=2*ns) fake.data[1:nd, 1:ns] <- fake.data[1:nd, 1:ns] + 2 fake.data[(nd+1):(2*nd), 1:ns] <- fake.data[(nd+1):(2*nd), 1:ns] - 2 # the permutation test is slow. it really needs many more # than 10 permutations, but this is just an example... dud <- Dudoit(fake.data, fake.class, nPerm=10) summary(dud) plot(dud) countSignificant(dud, 0.05)
showClass("Dudoit") ng <- 10000 ns <- 15 nd <- 200 fake.class <- factor(rep(c('A', 'B'), each=ns)) fake.data <- matrix(rnorm(ng*ns*2), nrow=ng, ncol=2*ns) fake.data[1:nd, 1:ns] <- fake.data[1:nd, 1:ns] + 2 fake.data[(nd+1):(2*nd), 1:ns] <- fake.data[(nd+1):(2*nd), 1:ns] - 2 # the permutation test is slow. it really needs many more # than 10 permutations, but this is just an example... dud <- Dudoit(fake.data, fake.class, nPerm=10) summary(dud) plot(dud) countSignificant(dud, 0.05)
Computes the density function for the Wilcoxon rank-sum distribution without centering.
dwil(q, m, n)
dwil(q, m, n)
q |
vector of quantiles |
m |
number of observations in the first sample |
n |
number of observations in the second sample |
Computes the density function for the Wilcoxon rank-sum distribution, using
exact values when both groups have fewer than 50 items and switching
to a normal approximation otherwise. It was originally written for
S-Plus, which still perversely insists that m
and n
must
be less than 50. The function was retained when the OOMPA library was
ported to R, since S-Plus keeps the actual rank-sum but R centers the
distribution at zero. This function encapsulated the difference, allowing
everything else to continue to work as it had worked previously.
A vector of the same length as q
containing (approximate or
exact) values of the density function.
Kevin R. Coombes [email protected]
dwil(51:60, 9, 3) dwil(51:60, 9, 51)
dwil(51:60, 9, 3) dwil(51:60, 9, 51)
Class to fit multiple (row-by-row) linear (fixed-effects) models on microarray or proteomics data.
MultiLinearModel(form, clindata, arraydata) ## S4 method for signature 'MultiLinearModel' summary(object, ...) ## S4 method for signature 'MultiLinearModel' as.data.frame(x, row.names=NULL, optional=FALSE, ...) ## S4 method for signature 'MultiLinearModel' hist(x, xlab='F Statistics', main=NULL, ...) ## S4 method for signature 'MultiLinearModel,missing' plot(x, y, ylab='F Statistics', ...) ## S4 method for signature 'MultiLinearModel,ANY' plot(x, y, xlab='F Statistics', ylab=deparse(substitute(y)), ...) ## S4 method for signature 'MultiLinearModel' anova(object, ob2, ...) multiTukey(object, alpha)
MultiLinearModel(form, clindata, arraydata) ## S4 method for signature 'MultiLinearModel' summary(object, ...) ## S4 method for signature 'MultiLinearModel' as.data.frame(x, row.names=NULL, optional=FALSE, ...) ## S4 method for signature 'MultiLinearModel' hist(x, xlab='F Statistics', main=NULL, ...) ## S4 method for signature 'MultiLinearModel,missing' plot(x, y, ylab='F Statistics', ...) ## S4 method for signature 'MultiLinearModel,ANY' plot(x, y, xlab='F Statistics', ylab=deparse(substitute(y)), ...) ## S4 method for signature 'MultiLinearModel' anova(object, ob2, ...) multiTukey(object, alpha)
form |
|
clindata |
either a data frame of "clinical" or other
covariates, or an |
arraydata |
matrix or data frame of values to be explained by the model.
If |
object |
object of class |
ob2 |
object of class |
x |
object of class |
y |
optional numeric vector |
xlab |
character string specifying label for the x-axis |
ylab |
character string specifying label for the y-axis |
main |
character string specifying graph title |
... |
extra arguments for generic or plotting functions |
row.names |
see the base version |
optional |
see the base version |
alpha |
numeric scalar between |
The anova
method returns a data frame. The rows in the data
frame corresponds to the rows in the arraydata
object that was
used to construct the MultiLinearModel
objects. The first
column contains the F-statistics and the second column contains the
p-values.
The multiTukey
function returns a vector whose length equals
the number of rows in the arraydata
object used to construct
the MultiLinearModel
. Assuming that the overall F-test was
significant, differences in group means (in each data row) larger than
this value are significant by Tukey's test for honestly significant
difference. (Of course, that statement is incorrect, since we haven't
fully corrected for multiple testing. Our standard practice is to take
the p-values from the row-by-row F-tests and evaluate them using the
beta-uniform mixture model (see Bum
). For the rows that
correspond to models whose p-values are smaller than the Bum
cutoff, we simply use the Tukey HSD values without further
modification.)
Objects should be created by calling the MultiLinearModel
function. The first argument is a formula
specifying
the linear model, in the same manner that it would be passed to
lm
. We will fit the linear model separately for each
row in the arraydata
matrix. Rows of arraydata
are
attached to the clindata
data frame and are always referred to
as "Y" in the formulas. In particular, this implies that
clindata
can not include a column already called "Y". Further,
the implementation only works if "Y" is the response variable in the model.
The BioConductor packages uses an ExpressionSet
to combine microarray
data and clinical covariates (known in their context as
phenoData
objects) into a single structure.
You can call MultiLinearModel
using an ExpressionSet
object
for the clindata
argument. In this case, the function extracts
the phenoData
slot of the ExpressionSet
to use for the
clinical covariates, and extracts the exprs
slot of the
ExpressionSet
object to use for the array data.
call
:A call
object describing how the object
was constructed.
model
:The formula
object specifying the linear
model.
F.statistics
:A numeric vector of F-statistics comparing the linear model to the null model.
p.values
:A numeric vector containing the p-values associated to the F-statistics.
coefficients
:A matrix
of the coefficients in
the linear models.
predictions
:A matrix
of the (Y-hat) values
predicted by the models.
sse
:A numeric vector of the sum of squared error terms from fitting the models.
ssr
:A numeric vector of the sum of squared regression terms from the model.
df
:A numeric vector (of length two) containing the degrees of freedom for the F-tests.
Write out a summary of the object.
Create a histogram of the F-statistics.
Plot the F-statistics as a function of the row index.
Plot the F-statistics against the numeric vector y
.
Perform row-by-row F-tests comparing two different linear models.
The MultiLinearModel
constructor computes row-by-row F-tests
comparing each linear model to the null model Y ~ 1. In many
instances, one wishes to use an F-test to compare two different linear
models. For instance, many standard applications of analysis of
variance (ANOVA) can be described using such a comparison between two
different linear models. The anova
method for the
MultiLinearModel
class performs row-by-row F-tests comparing
two competing linear models.
The implementation of MultiLinearModel
does not take the naive
approach of using either apply
or a
for
-loop to attach rows one at a time and fit separate
linear models. All the models are actually fit simultaneously by a
series of matrix operations, which greatly reduces the amount of time
needed to compute the models. The constraint on the column names in
clindata
still holds, since one row is attached to allow
model.matrix
to determine the contrasts matrix.
Kevin R. Coombes [email protected]
anova
,
lm
,
Bum
,
MultiTtest
,
MultiWilcoxonTest
showClass("MultiLinearModel") ng <- 10000 ns <- 50 dat <- matrix(rnorm(ng*ns), ncol=ns) cla <- factor(rep(c('A', 'B'), 25)) cla2 <- factor(rep(c('X', 'Y', 'Z'), times=c(15, 20, 15))) covars <- data.frame(Grade=cla, Stage=cla2) res <- MultiLinearModel(Y ~ Grade + Stage, covars, dat) summary(res) hist(res, breaks=101) plot(res) plot(res, [email protected]) graded <- MultiLinearModel(Y ~ Grade, covars, dat) summary(graded) hist([email protected], breaks=101) hist([email protected], breaks=101) oop <- anova(res, graded) hist(oop$p.values, breaks=101)
showClass("MultiLinearModel") ng <- 10000 ns <- 50 dat <- matrix(rnorm(ng*ns), ncol=ns) cla <- factor(rep(c('A', 'B'), 25)) cla2 <- factor(rep(c('X', 'Y', 'Z'), times=c(15, 20, 15))) covars <- data.frame(Grade=cla, Stage=cla2) res <- MultiLinearModel(Y ~ Grade + Stage, covars, dat) summary(res) hist(res, breaks=101) plot(res) plot(res, res@p.values) graded <- MultiLinearModel(Y ~ Grade, covars, dat) summary(graded) hist(graded@p.values, breaks=101) hist(res@p.values, breaks=101) oop <- anova(res, graded) hist(oop$p.values, breaks=101)
Class to perform row-by-row t-tests on microarray or proteomics data.
MultiTtest(data, classes, na.rm=TRUE) ## S4 method for signature 'MultiTtest' summary(object, ...) ## S4 method for signature 'MultiTtest' as.data.frame(x, row.names=NULL, optional=FALSE, ...) ## S4 method for signature 'MultiTtest' hist(x, xlab='T Statistics', main=NULL, ...) ## S4 method for signature 'MultiTtest,missing' plot(x, y, ylab='T Statistics', ...) ## S4 method for signature 'MultiTtest,ANY' plot(x, y, xlab='T Statistics', ylab=deparse(substitute(y)), ...)
MultiTtest(data, classes, na.rm=TRUE) ## S4 method for signature 'MultiTtest' summary(object, ...) ## S4 method for signature 'MultiTtest' as.data.frame(x, row.names=NULL, optional=FALSE, ...) ## S4 method for signature 'MultiTtest' hist(x, xlab='T Statistics', main=NULL, ...) ## S4 method for signature 'MultiTtest,missing' plot(x, y, ylab='T Statistics', ...) ## S4 method for signature 'MultiTtest,ANY' plot(x, y, xlab='T Statistics', ylab=deparse(substitute(y)), ...)
data |
either a data frame or matrix with numeric values, or an
|
classes |
If |
na.rm |
logical scalar. If |
object |
object of class |
x |
object of class |
y |
numeric vector |
xlab |
character string specifying the label for the x axis |
ylab |
character string specifying the label for the y axis |
main |
character string specifying the plot title |
row.names |
see the base version |
optional |
see the base version |
... |
extra arguments for generic or plotting routines |
The graphical routines invisibly return the object against which they were invoked.
Although objects can be created using new
, the preferred method is
to use the MultiTtest
generator. In the simplest case, you
simply pass in a data matrix and a logical vector assigning classes to
the columns, and the constructor performs row-by-row two-sample
t-tests and computes the associated (single test) p-values. To adjust
for multiple testing, you can pass the p-values on to the
Bum
class.
If you use a factor instead of a logical vector, then the t-test
compares the first level of the factor to everything else. To handle
the case of multiple classes, see the MultiLinearModel
class.
As with other class comparison functions that are part of the OOMPA,
we can also perform statistical tests on
ExpressionSet
objects from
the BioConductor libraries. In this case, we pass in an
ExpressionSet
object along with the name of a factor to use for
splitting the data.
t.statistics
:Object of class numeric
containing the computed t-statistics.
p.values
:Object of class numeric
containing
the computed p-values.
df
:Numeric vector of the degrees of freedom per gene. Introduced to allow for missing data.
groups
:Object of class character
containing
the names of the classes being compared.
call
:Object of class call
containing the
function call that created the object.
Write out a summary of the object.
Produce a histogram of the t-statistics.
Produces a scatter plot of the t-statistics against their index.
Produces a scatter plot of the t-statistics in the
object x
against the numeric vector y
.
Kevin R. Coombes [email protected]
matrixT
,
Bum
,
Dudoit
,
MultiLinearModel
showClass("MultiTtest") ng <- 10000 ns <- 50 dat <- matrix(rnorm(ng*ns), ncol=ns) cla <- factor(rep(c('A', 'B'), each=25)) res <- MultiTtest(dat, cla) summary(res) hist(res, breaks=101) plot(res) plot(res, [email protected]) hist([email protected], breaks=101) dat[1,1] <- NA mm <- matrixMean(dat, na.rm=TRUE) vv <- matrixVar(dat, mm, na.rm=TRUE) tt <- matrixT(dat, cla, na.rm=TRUE) mtt <- MultiTtest(dat,cla)
showClass("MultiTtest") ng <- 10000 ns <- 50 dat <- matrix(rnorm(ng*ns), ncol=ns) cla <- factor(rep(c('A', 'B'), each=25)) res <- MultiTtest(dat, cla) summary(res) hist(res, breaks=101) plot(res) plot(res, res@p.values) hist(res@p.values, breaks=101) dat[1,1] <- NA mm <- matrixMean(dat, na.rm=TRUE) vv <- matrixVar(dat, mm, na.rm=TRUE) tt <- matrixT(dat, cla, na.rm=TRUE) mtt <- MultiTtest(dat,cla)
The MultiWilcoxonTest
class is used to perform row-by-row Wilcoxon
rank-sum tests on a data matrix. Significance cutoffs are determined by the
empirical Bayes method of Efron and Tibshirani.
MultiWilcoxonTest(data, classes, histsize=NULL) ## S4 method for signature 'MultiWilcoxonTest' summary(object, prior=1, significance=0.9, ...) ## S4 method for signature 'MultiWilcoxonTest' hist(x, xlab='Rank Sum', ylab='Prob(Different | Y)', main='', ...) ## S4 method for signature 'MultiWilcoxonTest,missing' plot(x, prior=1, significance=0.9, ylim=c(-0.5, 1), xlab='Rank Sum', ylab='Prob(Different | Y)', ...) ## S4 method for signature 'MultiWilcoxonTest' cutoffSignificant(object, prior, significance, ...) ## S4 method for signature 'MultiWilcoxonTest' selectSignificant(object, prior, significance, ...) ## S4 method for signature 'MultiWilcoxonTest' countSignificant(object, prior, significance, ...) ## S4 method for signature 'MultiWilcoxonTest' probDiff(object, p0, ...)
MultiWilcoxonTest(data, classes, histsize=NULL) ## S4 method for signature 'MultiWilcoxonTest' summary(object, prior=1, significance=0.9, ...) ## S4 method for signature 'MultiWilcoxonTest' hist(x, xlab='Rank Sum', ylab='Prob(Different | Y)', main='', ...) ## S4 method for signature 'MultiWilcoxonTest,missing' plot(x, prior=1, significance=0.9, ylim=c(-0.5, 1), xlab='Rank Sum', ylab='Prob(Different | Y)', ...) ## S4 method for signature 'MultiWilcoxonTest' cutoffSignificant(object, prior, significance, ...) ## S4 method for signature 'MultiWilcoxonTest' selectSignificant(object, prior, significance, ...) ## S4 method for signature 'MultiWilcoxonTest' countSignificant(object, prior, significance, ...) ## S4 method for signature 'MultiWilcoxonTest' probDiff(object, p0, ...)
data |
either a data frame or matrix with numeric values, or an
|
classes |
If |
histsize |
An integer; the number of bins used for the histogram
summarizing the Wilcoxon statistics. When |
object |
an object of the |
x |
an object of the |
xlab |
character string specifying label for the x axis |
ylab |
character string specifying label for the y axis |
ylim |
Plotting limits on the y-axis |
main |
character string specifying graph title |
p0 |
see prior. |
prior |
Prior probability that an arbitrary gene is not differentially expressed, or that an arbitrary row does not yield a significant Wilcoxon rank-sum statistic. |
significance |
Desired level of posterior probability |
... |
extra arguments for generic or plotting routines |
See the paper by Efron and Tibshirani.
The standard methods summary
, hist
, and plot
return what you would expect.
The cutoffSignificant
method returns a list of two
integers. Rank-sum values smaller than the first value or larger than
the second value are statistically significant in the sense that their
posterior probability exceeds the specified significance
level
given the assumptions about the prior
probability of not being
significant.
The selectSignificant
method returns a vector of logical values
identifying the significant test results, and countSignificant
returns an integer counting the number of significant test results.
As usual, objects can be created by new
, but better methods are
available in the form of the MultiWilcoxonTest
function. The
inputs to this function are the same as those used for row-by-row
statistical tests throughout the ClassComparison package; a detailed
description can be found in the MultiTtest
class.
The constructor computes row-by-row Wilcoxon rank-sum statistics on
the input data
, comparing the two groups defined by the
classes
argument. It also estimates the observed and
theoretical (expected) density functions for the collection of
rank-sum statistics.
The additional input argument, histsize
is usually best left to
its default value. In certain pathological cases, we have found it
necessary to use fewer bins; one suspects that the underlying model
does not adequately capture the complexity of those situations.
statistics
:numeric vector containing the computed rank-sum statistics.
xvals
:numeric vector, best thought of as the vector of possible rank-sum statistics given the sizes of the two groups.
theoretical.pdf
:numeric vector containing the
theoretical density function evaluated at the points of
xvals
.
pdf
:numeric vector containing the empirical density
function computed at the points of xvals
.
unravel
:numeric vector containing a smoothed
estimate (by Poisson regression using B-splines) of the empirical
density function evaluated at xvals
.
groups
:A vector containing the names of the groups
defined by classes
.
call
:object of class call
representing the
function call that created the object.
Write out a summary of the object. For a given value of the
prior
probability of not being differentially expressed and
a given significance cutoff on the posterior probability, reports
the cutoffs and number of items in both tails of the distribution.
Plot a histogram of the rank-sum statistics, with overlaid curves
representing the expected and observed distributions. Colors of
the curves are controlled by
oompaColor$EXPECTED
and
oompaColor$OBSERVED
.
Plots the posterior probability of being differentially expressed
for given values of the prior
. Horizontal lines are added
at each specified significance
level for the posterior
probability.
Determine cutoffs on the rank-sum statistic at the desired significance level.
Compute a logical vector for selecting significant test results.
Count the number of significant test results.
Compute the probabilty that an observed value comes from the "unusual" part of the mixture distribution. Only exported so it can be inherited by other classes....
Kevin R. Coombes [email protected]
Efron B, Tibshirani R.
Empirical bayes methods and false discovery rates for microarrays.
Genet Epidemiol 2002, 23: 70-86.
Pounds S, Morris SW.
Estimating the occurrence of false positives and false negatives in
microarray studies by approximating and partitioning the empirical
distribution of p-values.
Bioinformatics. 2003 Jul 1;19(10):1236-42.
Implementation is handled in part by the functions dwil
and rankSum
. The empirical Bayes results for
alternative tests (such as MultiTtest
) can be obtained
using the beta-uniform mixture model in the Bum
class.
showClass("MultiWilcoxonTest") ng <- 10000 ns <- 15 nd <- 200 fake.class <- factor(rep(c('A', 'B'), each=ns)) fake.data <- matrix(rnorm(ng*ns*2), nrow=ng, ncol=2*ns) fake.data[1:nd, 1:ns] <- fake.data[1:nd, 1:ns] + 2 fake.data[(nd+1):(2*nd), 1:ns] <- fake.data[(nd+1):(2*nd), 1:ns] - 2 a <- MultiWilcoxonTest(fake.data, fake.class) hist(a) plot(a) plot(a, prior=0.85) abline(h=0) cutoffSignificant(a, prior=0.85, signif=0.95) countSignificant(a, prior=0.85, signif=0.95)
showClass("MultiWilcoxonTest") ng <- 10000 ns <- 15 nd <- 200 fake.class <- factor(rep(c('A', 'B'), each=ns)) fake.data <- matrix(rnorm(ng*ns*2), nrow=ng, ncol=2*ns) fake.data[1:nd, 1:ns] <- fake.data[1:nd, 1:ns] + 2 fake.data[(nd+1):(2*nd), 1:ns] <- fake.data[(nd+1):(2*nd), 1:ns] - 2 a <- MultiWilcoxonTest(fake.data, fake.class) hist(a) plot(a) plot(a, prior=0.85) abline(h=0) cutoffSignificant(a, prior=0.85, signif=0.95) countSignificant(a, prior=0.85, signif=0.95)
Compute the Wilcoxon rank-sum statistic.
rankSum(data, selector)
rankSum(data, selector)
data |
numeric vector |
selector |
logical vector the same length as |
This is an efficient function to compute the value of the Wilcoxon
rank-sum statistic without the extra overhead of the full
wilcox.test
function. It is used internally by the
MultiWilcoxonTest
class to perform row-by-row Wilcoxon
tests.
Returns an integer, the rank-sum of the subset of the
data
for which the selector
is true.
Kevin R. Coombes [email protected]
dd <- rnorm(100) cc <- rep(c(TRUE, FALSE), each=50) rankSum(dd, cc)
dd <- rnorm(100) cc <- rep(c(TRUE, FALSE), each=50) rankSum(dd, cc)
Implements the "Significance Analysis of Microarrays" approach to detecting differentially expressed genes.
Sam(data, classes, nPerm=100, verbose=TRUE) ## S4 method for signature 'Sam,missing' plot(x, y, tracks=NULL, xlab='Expected T Statistics (Empirical)', ylab='Observed T Statistics', ...) ## S4 method for signature 'Sam' summary(object, cutoff=1, ...) ## S4 method for signature 'Sam' selectSignificant(object, cutoff=1, ...) ## S4 method for signature 'Sam' countSignificant(object, cutoff=1, ...)
Sam(data, classes, nPerm=100, verbose=TRUE) ## S4 method for signature 'Sam,missing' plot(x, y, tracks=NULL, xlab='Expected T Statistics (Empirical)', ylab='Observed T Statistics', ...) ## S4 method for signature 'Sam' summary(object, cutoff=1, ...) ## S4 method for signature 'Sam' selectSignificant(object, cutoff=1, ...) ## S4 method for signature 'Sam' countSignificant(object, cutoff=1, ...)
data |
Either a data frame or matrix with numeric values or an
|
classes |
If |
nPerm |
An integer; the number of permutations |
verbose |
A logical flag |
x |
A |
y |
Nothing, since it is supposed to be missing. Changes to the Rd processor require documenting the missing entry. |
tracks |
a numeric vector |
xlab |
Label for the x axis |
ylab |
Label for the y axis |
object |
A |
cutoff |
A numeric value |
... |
The usual extra arguments to generic functions |
The SAM approach to analyzing microarray data was developed by Tusher
and colleagues; their implementation is widely available. This is an
independent implementation based on the description in their original
paper, customized to use the same interface (and thus work with
ExpressionSet
objects) used
by the rest of the ClassComparison package. The fundamental idea
behind SAM is that the observed distribution of row-by-row two-sample
t-tests should be compared not to the theoretical null distribution
but to a null distribution estimated by a permutation test. The
Sam
constructor performs the permutation test.
summary
returns an object of class SamSummary
.
selectSignificant
returns a vector of logical values.
countSignificant
returns an integer.
As usual, objects can be created by new
, but better methods are
available in the form of the Sam
function. The inputs to this
function are the same as those used for row-by-row statistical tests
throughout the ClassComparison package; a detailed description can be
found in the MultiTtest
class.
t.statistics
:numeric vector containing the observed t-statistics.
observed
:numeric vector containing the sorted observed t-statistics.
expected
:numeric vector of the expected distribution of t-statistics based on a permutation test.
sim.data
:numeric matrix containing all the t-statistics from all the permutations.
call
:object of class call
specifying the function
call that was used to create this object.
Compute a summary of the object.
Plot the observed and expected
t-statistics. The tracks
argument causes parallel lines to be
drawn on either side of the quantile-quantile central line, at the
specified offsets. Colors in the plot are controlled by the current
values of oompaColor$CENTRAL.LINE
and
oompaColor$CONFIDENCE.CURVE
Compute a vector that selects significant values
Count the number of significant values
Kevin R. Coombes [email protected]
Tusher VG, Tibshirani R, Chu G.
Significance analysis of microarrays applied to the ionizing radiation
response.
Proc Natl Acad Sci U S A (2001) 98, 5116-5121.
showClass("Sam") ng <- 10000 ns <- 50 nd <- 100 dat <- matrix(rnorm(ng*ns), ncol=ns) dat[1:nd, 1:(ns/2)] <- dat[1:nd, 1:(ns/2)] + 2 dat[(nd+1):(2*nd), 1:(ns/2)] <- dat[(nd+1):(2*nd), 1:(ns/2)] - 2 cla <- factor(rep(c('A', 'B'), each=25)) res <- Sam(dat, cla) plot(res) plot(res, tracks=1:3) summary(res) summary(res, cutoff=2) a <- summary(res) plot([email protected]) plot([email protected][1:300]) countSignificant(res, 1)
showClass("Sam") ng <- 10000 ns <- 50 nd <- 100 dat <- matrix(rnorm(ng*ns), ncol=ns) dat[1:nd, 1:(ns/2)] <- dat[1:nd, 1:(ns/2)] + 2 dat[(nd+1):(2*nd), 1:(ns/2)] <- dat[(nd+1):(2*nd), 1:(ns/2)] - 2 cla <- factor(rep(c('A', 'B'), each=25)) res <- Sam(dat, cla) plot(res) plot(res, tracks=1:3) summary(res) summary(res, cutoff=2) a <- summary(res) plot(a@significant.calls) plot(a@significant.calls[1:300]) countSignificant(res, 1)
An implementation class. Users are not expected to create these objects
directly; they are produced as return objects from the summary method
for Sam
.
fdr
:numeric scalar between 0
and 1
specifying
the expected false discovery rate
hi
:Upper threshold for significance
lo
:Lower threshold for significance
cutoff
:numeric scalar specified in the call to the
Sam
summary
method.
significant.calls
:vector of logical values
average.false.count
:The average number of false positives in the permuted data at this cutoff level.
signature(object = SamSummary)
: Print the
object, which contains a summary of the underlying Sam
object.
Kevin R. Coombes [email protected]
showClass("SamSummary")
showClass("SamSummary")
In the world of multiple testing that is inhabited by most microarray or protein profiling experiments, analysts frequently perform separate statistical tests for each gene or protein in the experiment. Determining cutoffs that achieve statistical significance (in a meaningful way) is an inherent part of the procedure. It is then common to select the significant items for further processing or for preparing reports, or at least to count the number of significant items. These generic functions provide a standard set of tools for selecting and counting the significant items, which can be used with various statistical tests and various ways to account for multiple testing.
## S4 method for signature 'ANY' cutoffSignificant(object, ...) ## S4 method for signature 'ANY' selectSignificant(object, ...) ## S4 method for signature 'ANY' countSignificant(object, ...) ## S4 method for signature 'ANY' probDiff(object, p0, ...)
## S4 method for signature 'ANY' cutoffSignificant(object, ...) ## S4 method for signature 'ANY' selectSignificant(object, ...) ## S4 method for signature 'ANY' countSignificant(object, ...) ## S4 method for signature 'ANY' probDiff(object, p0, ...)
object |
an object that performs multiple statistical tests on microarray or proteomics data |
p0 |
Prior probability that an observed value comes from the lnown distribution. |
... |
additional arguments affecting these generic methods |
cutoffSignificant
returns appropriate cutoff values that achieve
specified significance criteria.
selectSignificant
returns a logical vector, with TRUE
values
indicating items that satisfy the cutoff making them statistically
significant.
countSignificant
returns an integer, representing the number of
significant items.
Kevin R. Coombes [email protected]
Preliminary analysis of one group of samples for use in
the SmoothTtest
class. A key feature is the standard
quality control plot.
SingleGroup(avg, sd, span=0.5, name='') ## S4 method for signature 'SingleGroup' as.data.frame(x, row.names=NULL, optional=FALSE) ## S4 method for signature 'SingleGroup' summary(object, ...) ## S4 method for signature 'SingleGroup' print(x, ...) ## S4 method for signature 'SingleGroup' show(object) ## S4 method for signature 'SingleGroup,missing' plot(x, multiple=3, ccl=0, main=x@name, xlab='Mean', ylab='Std Dev', xlim=0, ylim=0, ...)
SingleGroup(avg, sd, span=0.5, name='') ## S4 method for signature 'SingleGroup' as.data.frame(x, row.names=NULL, optional=FALSE) ## S4 method for signature 'SingleGroup' summary(object, ...) ## S4 method for signature 'SingleGroup' print(x, ...) ## S4 method for signature 'SingleGroup' show(object) ## S4 method for signature 'SingleGroup,missing' plot(x, multiple=3, ccl=0, main=x@name, xlab='Mean', ylab='Std Dev', xlim=0, ylim=0, ...)
avg |
numeric vector of mean values |
sd |
numeric vector of standard deviations |
span |
parameter is passed onto |
name |
character string specifying the name of this object |
object |
object of class |
x |
object of class |
multiple |
numeric scalar specifying the multiple of the smoothed standard deviation to call significant |
ccl |
list containing objects of the
|
main |
character string specifying plot title |
xlab |
character string specifying label for the x axis |
ylab |
character string specifying label for the y axis |
xlim |
Plotting limits for the x axis. If left at the default value of zero, then the limits are automatically generated |
ylim |
Plotting limits for the y axis. If left at the default value of zero, then the limits are automatically generated |
row.names |
See the base version of |
optional |
See the base version of |
... |
extra arguments for generic or plotting routines |
In 2001 and 2002, Baggerly and Coombes developed the smooth t-test for
finding differentially expressed genes in microarray data. Along with
many others, they began by log-transforming the data as a reasonable
step in the direction of variance stabilization. They observed,
however, that the gene-by-gene standard deviations still seemed to
vary in a systematic way as a function of the mean log intensity. By
borrowing strength across genes and using loess
to fit
the observed standard deviations as a function of the mean, one
presumably got a better estimate of the true standard deviation.
Objects can be created by calls to the SingleGroup
constructor.
Users rarely have need to create these objects directly; they are
usually created as a consequence of the construction of an object of
the SmoothTtest
class.
name
:character string specifying the name of this object
avg
:numeric vector of mean values
sd
:numeric vector of standard deviations
span
:parameter used in the loess
function
to fit sd
as a function of avg
.
fit
:list containing components x
and
y
resulting from the loess
fit
score
:numeric vector specifying the ratio of the pointwise standard deviations to their smooth (loess) estimates
Combine the slots containing numeric vectors into a data frame, suitable for printing or exporting.
Write out a summary of the object.
Print the entire object. You never want to do this.
Print the entire object. You never want to do this.
Produce a scatter plot of the standard
deviations (x@sd
) as a function of the means (x@avg
).
The appropriate multiple of the loess
fit is overlaid, and
points that exceed this multiple are flagged in a different
color. Colors in the plot are controlled by the current values of
oompaColor$CENTRAL.LINE
,
oompaColor$CONFIDENCE.CURVE
,
oompaColor$BORING
,
oompaColor$BAD.REPLICATE
, and
oompaColor$WORST.REPLICATE
.
Kevin R. Coombes [email protected]
Baggerly KA, Coombes KR, Hess KR, Stivers DN, Abruzzo LV, Zhang W.
Identifying differentially expressed genes in cDNA microarray
experiments.
J Comp Biol. 8:639-659, 2001.
Coombes KR, Highsmith WE, Krogmann TA, Baggerly KA, Stivers DN, Abruzzo LV.
Identifying and quantifying sources of variation in microarray data
using high-density cDNA membrane arrays.
J Comp Biol. 9:655-669, 2002.
showClass("SingleGroup") m <- rnorm(1000, 8, 2.5) v <- rnorm(1000, 0.7) plot(m, v) x <- SingleGroup(m, v, name='bogus') summary(x) plot(x) plot(x, multiple=2)
showClass("SingleGroup") m <- rnorm(1000, 8, 2.5) v <- rnorm(1000, 0.7) plot(m, v) x <- SingleGroup(m, v, name='bogus') summary(x) plot(x) plot(x, multiple=2)
Implements the smooth t-test for differential expression as developed by Baggerly and Coombes.
SmoothTtest(stats, aname='Group One', bname='Group Two', name=paste(aname, 'vs.', bname)) ## S4 method for signature 'SmoothTtest' as.data.frame(x, row.names=NULL, optional=FALSE) ## S4 method for signature 'SmoothTtest' summary(object, ...) ## S4 method for signature 'SmoothTtest,missing' plot(x, folddiff=3, goodflag=2, badch=4, ccl=0, name=x@name, pch='.', xlab='log intensity', ylab='log ratio', ...)
SmoothTtest(stats, aname='Group One', bname='Group Two', name=paste(aname, 'vs.', bname)) ## S4 method for signature 'SmoothTtest' as.data.frame(x, row.names=NULL, optional=FALSE) ## S4 method for signature 'SmoothTtest' summary(object, ...) ## S4 method for signature 'SmoothTtest,missing' plot(x, folddiff=3, goodflag=2, badch=4, ccl=0, name=x@name, pch='.', xlab='log intensity', ylab='log ratio', ...)
stats |
object of class |
aname |
character string specifying the name of the first group |
bname |
character string specifying the name of the second group |
name |
character string specifying the name of this object |
object |
object of class |
x |
object of class |
row.names |
See the base version of |
optional |
See the base version of |
folddiff |
numeric scalar specifying the level of fold difference considered large enough to be indicated in the plots |
goodflag |
numeric scalar specifying the level (in standard deviation units) of the smooth t-statistic considered large enough to be indicated in the plot |
badch |
numeric scalar specifying the level of variability in single
groups considered large enough to be worrisome. See the |
ccl |
list containing objects of class
|
pch |
default plotting character |
xlab |
character string specifying label for the x axis |
ylab |
character string specifying label for the y axis |
... |
extra arguments for generic or plotting routines |
In 2001 and 2002, Baggerly and Coombes developed the smooth t-test for
finding differentially expressed genes in microarray data. Along with
many others, they began by log-transforming the data as a reasonable
step in the direction of variance stabilization. They observed,
however, that the gene-by-gene standard deviations still seemed to
vary in a systematic way as a function of the mean log intensity. By
borrowing strength across genes and using loess
to fit
the observed standard deviations as a function of the mean, one
presumably got a better estimate of the true standard deviation.
These smooth estimates are computed for each of two groups of samples being compared. They are then combined (gene-by-gene using the usual univariate formulas) to compute pooled "smooth" estimates of the standard deviation. These smooth estimates are then used in gene-by-gene t-tests.
The interesting question then arises of how to compute and interpret
p-values associated to these individual tests. The liberal
argument asserts that, because smoothing uses data from hundreds
of measurements to estimate the standard deviation, it can effectively
be treated as "known" in the t-tests, which should thus be compared
against the normal distribution. A conservative argument claims
that the null distribution should still be the t-distribution with the
degrees of freedom determined in the usual way by the number of
samples. The truth probably lies somewhere in between, and is
probably best approximated by some kind of permutation test. In this
implementation, we take the coward's way out and don't provide any of
those alternatives. You have to extract the t-statistics (from the
smooth.t.statistics
slot of the object) and compute your own
p-values in your favorite way. If you base the computations on a
theoretical model rather than a permutation test, then the
Bum
class provides a convenient way to account for
multiple testing.
In practice, users will first use a data frame and a classification
vector (or an ExpressionSet
) to construct an object of the
TwoGroupStats
object. This object can then be handed
directly to the SmoothTtest
function to perform the smooth
t-test.
one
:object of class SingleGroup
representing a
loess smooth of standard deviation as a function of the mean in the
first group of samples
two
:object of class SingleGroup
representing a
loess smooth of standard deviation as a function of the mean in the
second group of samples
smooth.t.statistics
:numeric vector containing the smooth t-statistics
fit
:data.frame with two columns, x
and y
,
containing the smooth estimates of the pooled standard deviation
dif
:numeric vector of the differences in mean values between the two groups
avg
:numeric vector of the overall mean value
aname
:character string specifying the name of the first group
bname
:character string specifying the name of the second group
name
:character string specifying the name of this object
stats
:object of class TwoGroupStats
that was used
to create this object
Convert the object into a data frame suitable for printing or exporting.
Write out a summary of the object.
Create a
set of six plots. The first two plots are the QC plots from the
SingleGroup
objects representing the two groups of
samples. The third plot is a scatter plot comparing the means in
the two groups. The fourth plot is Bland-Altman plot of the
overall mean against the difference in means (also known colloquially
as an M-vs-A plot). The fifth plot is a histogram of the smooth
t-statistics. The final plot is a scatter plot of the smooth
t-statistics as a function of the mean intensity.
Colors in the plots are controlled by the current values of
oompaColor$BORING
,
oompaColor$SIGNIFICANT
,
oompaColor$BAD.REPLICATE
,
oompaColor$WORST.REPLICATE
,
oompaColor$FOLD.DIFFERENCE
,
oompaColor$CENTRAL.LINE
, and
oompaColor$CONFIDENCE.CURVE
.
Kevin R. Coombes [email protected]
Baggerly KA, Coombes KR, Hess KR, Stivers DN, Abruzzo LV, Zhang W.
Identifying differentially expressed genes in cDNA microarray
experiments.
J Comp Biol. 8:639-659, 2001.
Coombes KR, Highsmith WE, Krogmann TA, Baggerly KA, Stivers DN, Abruzzo LV.
Identifying and quantifying sources of variation in microarray data
using high-density cDNA membrane arrays.
J Comp Biol. 9:655-669, 2002.
Altman DG, Bland JM.
Measurement in Medicine: the Analysis of Method Comparison Studies.
The Statistician, 1983; 32: 307-317.
Bum
,
MultiTtest
,
SingleGroup
,
TwoGroupStats
showClass("SmoothTtest") bogus <- matrix(rnorm(30*1000, 8, 3), ncol=30, nrow=1000) splitter <- rep(FALSE, 30) splitter[16:30] <- TRUE x <- TwoGroupStats(bogus, splitter) y <- SmoothTtest(x) opar <- par(mfrow=c(2, 3), pch='.') plot(y, badch=2, goodflag=1) par(opar)
showClass("SmoothTtest") bogus <- matrix(rnorm(30*1000, 8, 3), ncol=30, nrow=1000) splitter <- rep(FALSE, 30) splitter[16:30] <- TRUE x <- TwoGroupStats(bogus, splitter) y <- SmoothTtest(x) opar <- par(mfrow=c(2, 3), pch='.') plot(y, badch=2, goodflag=1) par(opar)
Implements the "Total Number of Misclassifications" method for finding differentially expressed genes.
TNoM(data, classes, verbose=TRUE) ## S4 method for signature 'TNoM' summary(object, ...) ## S4 method for signature 'TNoM' update(object, nPerm, verbose=FALSE, ...) ## S4 method for signature 'TNoM' selectSignificant(object, cutoff, ...) ## S4 method for signature 'TNoM' countSignificant(object, cutoff, ...) ## S4 method for signature 'fullTNoM,missing' plot(x, y, ...) ## S4 method for signature 'fullTNoM' hist(x, ...)
TNoM(data, classes, verbose=TRUE) ## S4 method for signature 'TNoM' summary(object, ...) ## S4 method for signature 'TNoM' update(object, nPerm, verbose=FALSE, ...) ## S4 method for signature 'TNoM' selectSignificant(object, cutoff, ...) ## S4 method for signature 'TNoM' countSignificant(object, cutoff, ...) ## S4 method for signature 'fullTNoM,missing' plot(x, y, ...) ## S4 method for signature 'fullTNoM' hist(x, ...)
data |
Either a data frame or matrix with numeric values or an
|
classes |
If |
verbose |
logical scalar. If |
object |
object of class |
nPerm |
integer scalar specifying the number of permutations to perform |
cutoff |
integer scalar |
x |
object of class |
y |
Nothing, since it is supposed to be missing. Changes to the Rd processor require documenting the missing entry. |
... |
extra arguments to generic or plotting routines |
The TNoM method was developed by Yakhini and Ben-Dor and first applied in the melanoma microarray study by Bittner and colleagues (see references). The goal of the method is to detect genes that are differentially expressed between two groups of samples. The idea is that each gene serves as a potential classifier to distinguish the two groups. One starts by determining an optimal cutoff on the expression of each gene and counting the number of misclassifications that gene makes. Next, we bin genes based on the total number of misclassifications. This distribution can be compared with the expected value (by simulating normal data sets of the same size). Alternatively, one can estimate the null distribution directly by scrambling the sample labels to perform a permutation test.
The TNoM
constructor computes the optimal cutoffs and the
misclassification rates. The update
method performs the
simulations and permutation tests, producing an object of the
fullTNoM
class.
summary
returns a TNoMSummary
object.
update
returns a fullTNoM
object.
selectSignificant
returns a vector of logical values.
countSignificant
returns an integer.
Although objects of the class can be created by a direct call to
new, the preferred method is to use the
TNoM
generator. The inputs to this function are the same as those
used for row-by-row statistical tests throughout the ClassComparison
package; a detailed description can be found in the MultiTtest
class.
Objects of the TNoM
class have the following slots:
data
:The data matrix used to construct the object
tnomData
:numeric vector, whose length is the number
of rows in data
, recording the minimum number of
misclassification achieved using this data row.
nCol
:The number of columns in data
nRow
:The number of rows in data
classifier
:The classification vector used to create the object.
call
:The function call
that created the object
Objects of the fullTNoM
class have the following slots:
dex
:Numeric vector of the different possible numbers of misclassifications
fakir
:Numeric vector of expected values based on simulations
obs
:Numeric vector of observed values
scr
:Numeric vector of values based on a permutation test
name
:A character string with a name for the object
Objects of the TNoM
class have the following methods:
Write out a summary of the object, including the number of genes achieving each possible number of misclassifications.
Count the number of
significant genes at the given cutoff
.
Get a vector for
selecting the number of significant genes at the given
cutoff
.
Perform
simulation and permutation tests on the TNoM
object.
Objects of the fullTNoM
class have the following methods:
Plot a summary of the TNoM object. This consists
of three curves: the observed cumulative number of genes at each
misclassification level, along with the corresponding numbers
expected based on simulations or permutation tests.
The colors of the curves are controlled by the values of
oompaColor$OBSERVED
,
oompaColor$EXPECTED
, and
oompaColor$PERMTEST
Plot a not terribly useful nor informative histogram of the results.
Kevin R. Coombes [email protected]
Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Radmacher M,
Simon R, Yakhini Z, Ben-Dor A, Sampas N, Dougherty E, Wang E, Marincola
F, Gooden C, Lueders J, Glatfelter A, Pollock P, Carpten J, Gillanders
E, Leja D, Dietrich K, Beaudry C, Berens M, Alberts D, Sondak V.
Molecular classification of cutaneous malignant melanoma by gene
expression profiling.
Nature. 2000 Aug 3;406(6795):536-40.
Bum
,
MultiTtest
,
MultiWilcoxonTest
showClass("TNoM") showClass("fullTNoM") n.genes <- 200 n.samples <- 10 bogus <- matrix(rnorm(n.samples*n.genes, 0, 3), ncol=n.samples) splitter <- rep(FALSE, n.samples) splitter[sample(1:n.samples, trunc(n.samples/2))] <- TRUE tn <- TNoM(bogus, splitter) summary(tn) tnf <- update(tn) plot(tnf) hist(tnf)
showClass("TNoM") showClass("fullTNoM") n.genes <- 200 n.samples <- 10 bogus <- matrix(rnorm(n.samples*n.genes, 0, 3), ncol=n.samples) splitter <- rep(FALSE, n.samples) splitter[sample(1:n.samples, trunc(n.samples/2))] <- TRUE tn <- TNoM(bogus, splitter) summary(tn) tnf <- update(tn) plot(tnf) hist(tnf)
An implementation class. Users are not expected to create these objects
directly; they are produced as return objects from the summary method for
TNoM
.
TNoM
:object of class TNoM
~~
counts
:object of class numeric
~~
signature(object = TNoMSummary)
: Print the
object, which contains a summary of the underlying TNoM
object. In particular, the summary reports the number of genes
achieving each possible number of misclassifications.
Kevin R. Coombes [email protected]
showClass("TNoMSummary")
showClass("TNoMSummary")
Compute row-by-row means and variances for a data matrix whose columns belong to two different groups of interest.
TwoGroupStats(data, classes, name=comparison, name1=A, name2=B) ## S4 method for signature 'TwoGroupStats' as.data.frame(x, row.names=NULL, optional=FALSE) ## S4 method for signature 'TwoGroupStats' summary(object, ...) ## S4 method for signature 'TwoGroupStats' print(x, ...) ## S4 method for signature 'TwoGroupStats' show(object) ## S4 method for signature 'TwoGroupStats,missing' plot(x, main=x@name, useLog=FALSE, ...)
TwoGroupStats(data, classes, name=comparison, name1=A, name2=B) ## S4 method for signature 'TwoGroupStats' as.data.frame(x, row.names=NULL, optional=FALSE) ## S4 method for signature 'TwoGroupStats' summary(object, ...) ## S4 method for signature 'TwoGroupStats' print(x, ...) ## S4 method for signature 'TwoGroupStats' show(object) ## S4 method for signature 'TwoGroupStats,missing' plot(x, main=x@name, useLog=FALSE, ...)
data |
Either a data frame or matrix with numeric values or an
|
classes |
If |
name |
A character string; the name of this object |
name1 |
A character string; the name of the first group |
name2 |
A character string; the name of the second group |
x |
A |
row.names |
See the base version of |
optional |
See the base version of |
object |
A |
main |
Plot title |
useLog |
a logical flag; should the values be log-transformed before plotting? |
... |
The usual extra arguments to generic functions |
This class was one of the earliest developments in our suite of tools to analyze microarrays. Its main purpose is to segregate out the preliminary computation of summary statistics on a row-by-row basis, along with a set of plots that could be generated automatically and used for quality control.
Although objects of the class can be created by a direct call to
new, the preferred method is to use the
TwoGroupStats
generator. The inputs to this
function are the same as those used for row-by-row statistical tests
throughout the ClassComparison package; a detailed description can be
found in the MultiTtest
class.
One should note that this class serves as the front end to the
SmoothTtest
class, providing it with an interface that
accepts ExpressionSet
objects compatible with the other statistical tests in the
ClassComparison package.
mean1
:numeric vector of means in the first group
mean2
:numeric vector of means in the second group
overallMean
:numeric vector of overall row means
var1
:numeric vector of variances in the first group
var2
:numeric vector of variances in the second group
overallVar
:numeric vector of variances assuming the two groups have the same mean
pooledVar
:numeric vector of row-by-row pooled variances, assuming the two groups have the same variance but different means
n1
:numeric scalar specifying number of items in the first group
n2
:numeric scalar specifying number of items in the second group
name1
:character string specifying name of the first group
name2
:character string specifying name of the second group
name
:character string specifying name of the object
Collect the numeric vectors from the object into a single dat fame, suitable for printing or exporting.
Write out a summary of the object.
Print the object. (Actually, it only prints a
summary, since the whole object is almost always more than you
really want to see. If you insist on printing everything, use
as.data.frame
.)
Print the object (same as print method).)
This function
actually produces six different plots of the data, so it is
usually wrapped by a graphical layout command like
par(mfrow=c(2,3))
. The first two plots show the relation
between the mean and standard deviation for the two groups
separately; the third plot does the same for the overall mean and
variance. The fourth plot is a Bland-Altman plot of the difference
between the means against the overall mean. (In the microarray
world, this is usually called an M-vs-A plot.) A loess fit is
overlaid on the scatter plot, and points outside confidence bounds
based on the fit are printed in a different color to flag them as
highly variable. The fifth plot shows a loess fit (with confidence
bounds) of the difference as a function of the row index (which
often is related to the geometric position of spots on a
microarray). Thus, this plot gives a possible indication of regions
of an array where unusual things happen. The final plot compares
the overall variances to the pooled variances.
Kevin R. Coombes [email protected]
Altman DG, Bland JM.
Measurement in Medicine: the Analysis of Method Comparison Studies.
The Statistician, 1983; 32: 307-317.
showClass("TwoGroupStats") bogus <- matrix(rnorm(30*1000, 8, 3), ncol=30, nrow=1000) splitter <- rep(FALSE, 30) splitter[16:30] <- TRUE x <- TwoGroupStats(bogus, splitter) summary(x) opar<-par(mfrow=c(2,3), pch='.') plot(x) par(opar)
showClass("TwoGroupStats") bogus <- matrix(rnorm(30*1000, 8, 3), ncol=30, nrow=1000) splitter <- rep(FALSE, 30) splitter[16:30] <- TRUE x <- TwoGroupStats(bogus, splitter) summary(x) opar<-par(mfrow=c(2,3), pch='.') plot(x) par(opar)
Classes to perform row-by-row paired or unequal variance t-tests on microarray or proteomics data.
MultiTtestPaired(data, classes, pairing) MultiTtestUnequal(data, classes) ## S4 method for signature 'MultiTtestPaired' summary(object, ...) ## S4 method for signature 'MultiTtestUnequal' summary(object, ...)
MultiTtestPaired(data, classes, pairing) MultiTtestUnequal(data, classes) ## S4 method for signature 'MultiTtestPaired' summary(object, ...) ## S4 method for signature 'MultiTtestUnequal' summary(object, ...)
data |
Either a data frame or matrix with numeric values or an
|
classes |
If |
pairing |
A numerical vector indicating which samples are paired. |
object |
A |
... |
Unused; optional extra parameters for |
Although objects can be created using new
, the better method is
to use the MultiTtestPaired
or MultiTtestUnequal
functions. In the simplest case, you simply pass in a data matrix
and a logical vector assigning classes to the columns (and, in the
case of a paired t-test, a numeric vector describing the pairing), and
the constructor performs row-by-row two-sample t-tests and computes
the associated (single test) p-values. To adjust for multiple
testing, you can pass the p-values on to the Bum
class.
If you use a factor instead of a logical vector, then the t-test
compares the first level of the factor to everything else. To handle
the case of multiple classes, see the MultiLinearModel
class.
As with other class comparison functions that are part of the OOMPA,
we can also perform statistical tests on
ExpressionSet
objects from
the BioConductor libraries. In this case, we pass in an
ExpressionSet
object along with the name of a factor to use for
splitting the data.
Both classes extend class MultiTtest
, directly. See that
class for descriptions of the inherited methods and slots.
df
:The MultiTtestUnequal
class adds a slot to
record e gene-by-gene degrees of freedom, which can change along
with the variances.
signature(object = MultiTtestPaired)
:
Write out a summary of the object.
signature(object = MultiTtestUnequal)
:
Write out a summary of the object.
Kevin R. Coombes [email protected]
OOMPA
showClass("MultiTtestPaired") showClass("MultiTtestUnequal") ng <- 10000 ns <- 50 dat <- matrix(rnorm(ng*ns), ncol=ns) cla <- factor(rep(c('A', 'B'), each=25)) res <- MultiTtestUnequal(dat, cla) summary(res) hist(res, breaks=101) plot(res, [email protected]) pairing <- rep(1:25, 2) res <- MultiTtestPaired(dat, cla, pairing) summary(res) plot(res) hist([email protected], breaks=101)
showClass("MultiTtestPaired") showClass("MultiTtestUnequal") ng <- 10000 ns <- 50 dat <- matrix(rnorm(ng*ns), ncol=ns) cla <- factor(rep(c('A', 'B'), each=25)) res <- MultiTtestUnequal(dat, cla) summary(res) hist(res, breaks=101) plot(res, res@p.values) pairing <- rep(1:25, 2) res <- MultiTtestPaired(dat, cla, pairing) summary(res) plot(res) hist(res@p.values, breaks=101)