Title: | The Tail-Rank Statistic |
---|---|
Description: | Implements the tail-rank statistic for selecting biomarkers from a microarray data set, an efficient nonparametric test focused on the distributional tails. See <https://gitlab.com/krcoombes/coombeslab/-/blob/master/doc/papers/tolstoy-new.pdf>. |
Authors: | Kevin R. Coombes |
Maintainer: | Kevin R. Coombes <[email protected]> |
License: | Apache License (== 2.0) |
Version: | 3.2.3 |
Built: | 2025-01-09 03:08:44 UTC |
Source: | https://github.com/r-forge/oompa |
Density, distribution function, quantile function, and random generation
for the beta-binomial distribution. A variable with a beta-binomial
distribution is distributed as binomial distribution with parameters
N
and p
, where the probability p
of success iteself
has a beta distribution with parameters u
and v
.
dbb(x, N, u, v, log = FALSE) pbb(q, N, u, v) qbb(p, N, u, v) rbb(n, N, u, v)
dbb(x, N, u, v, log = FALSE) pbb(q, N, u, v) qbb(p, N, u, v) rbb(n, N, u, v)
x |
vector of qauntiles |
q |
vector of quantiles |
p |
vector of probabilities |
n |
number of observations |
N |
number of trials ( a positive integer) |
u |
first positive parameter of the beta distribution |
v |
second positive parameter of the beta distribution |
log |
A logical value; if true, values are returned on the log scale |
The beta-binomial distribution with parameters ,
, and
has density given by
for ,
, a positive integer
, and any
nonnegative integer
. Although one can express the integral
in closed form using generalized hypergeometric functions, the
implementation of distribution function used here simply relies on the
the cumulative sum of the density.
The mean and variance of the beta-binomial distribution can be computed explicitly as
and
dbb
gives the density, pbb
gives the distribution function,
qbb
gives the quantile function, and rbb
generates random
deviates.
Kevin R. Coombes <[email protected]>
dbeta
for the beta distribution and
dbinom
for the binomial distribution.
# set up parameters w <- 10 u <- 0.3*w v <- 0.7*w N <- 12 # generate random values from the beta-binomial x <- rbb(1000, N, u, v) # check that the empirical summary matches the theoretical one summary(x) qbb(c(0.25, 0.50, 0.75), N, u, v) # check that the empirpical histogram matches te theoretical density hist(x, breaks=seq(-0.5, N + 0.55), prob=TRUE) lines(0:N, dbb(0:N, N, u,v), type='b')
# set up parameters w <- 10 u <- 0.3*w v <- 0.7*w N <- 12 # generate random values from the beta-binomial x <- rbb(1000, N, u, v) # check that the empirical summary matches the theoretical one summary(x) qbb(c(0.25, 0.50, 0.75), N, u, v) # check that the empirpical histogram matches te theoretical density hist(x, breaks=seq(-0.5, N + 0.55), prob=TRUE) lines(0:N, dbb(0:N, N, u,v), type='b')
Compute an array of power tables for the tail-rank.test.
biomarkerPowerTable(G, N1=20, N2=seq(25, 250, by=25), psi = c(0.95, 0.99), conf=0.99, phi = seq(0.10, 0.50, by = 0.05), model="bb")
biomarkerPowerTable(G, N1=20, N2=seq(25, 250, by=25), psi = c(0.95, 0.99), conf=0.99, phi = seq(0.10, 0.50, by = 0.05), model="bb")
G |
An integer; the number of genes being assessed as potential biomarkers. Statistically, the number of hypotheses being tested. |
N1 |
An integer; the number of "train" or "healthy" samples used. |
N2 |
An integer; the number of "test" or "cancer" samples used. |
psi |
A real number between 0 and 1; the desired specificity of the test. |
conf |
A real number between 0 and 1; the confidence level of the results. Can be obtained by subtracting the family-wise Type I error from 1. |
phi |
A real number between 0 and 1; the sensitivity that one would like to be able to detect, conditional on the specificity. |
model |
A character string that determines whether power and significance are computed from abinomial or a beta-binomial (bb) model. |
Returns a list of objects of the BMPT
class. Each item
in the list consists of a two-dimensional table (indexed by the sample
sizes N
and the sensitivities phi
) with scalars
recording the values of G
, conf
, and psi
that
were used to generate it.
Default values of the optional arguments (N
, psi
,
conf
, phi
)are included in the usage examples.
Kevin R. Coombes <[email protected]>
TailRankTest
,
tailRankPower
,
biomarkerPowerTable
,
matrixMean
,
toleranceBound
stuff <- biomarkerPowerTable(10000, 20, c(10, 20, 50, 100, 250, 500), c(0.95, 0.99), c(0.99, 0.95), seq(0.1, 0.7, by=0.1)) lapply(stuff, summary)
stuff <- biomarkerPowerTable(10000, 20, c(10, 20, 50, 100, 250, 500), c(0.95, 0.99), c(0.99, 0.95), seq(0.1, 0.7, by=0.1)) lapply(stuff, summary)
A class for producing BioMarker Power Tables (BMPT), and methods for
accessing them. This class is primarily an implementation detail for
the function biomarkerPowerTable
.
BMPT(G, psi, conf, power) ## S4 method for signature 'BMPT' print(x,...) ## S4 method for signature 'BMPT' summary(object,...)
BMPT(G, psi, conf, power) ## S4 method for signature 'BMPT' print(x,...) ## S4 method for signature 'BMPT' summary(object,...)
G |
A positive integer. |
psi |
A real number between 0 and 1. |
conf |
A real number between 0 and 1. |
power |
A data frame. |
x |
A |
object |
A |
... |
Extra graphical parameters |
Although objects can be created using new
, the preferred method
is to use the constructor function BMPT
. In practice, these
objects are most likely to be created using the more general interface
through biomarkerPowerTable
.
G
:A positive integer; the number of genes being assessed as potential biomarkers. Statistically, the number of hypotheses being tested.
psi
:A real number between 0 and 1; the desired specificity of the test.
conf
:A real number between 0 and 1; the confidence level of the results. Can be obtained by subtracting the family-wise Type I error from 1.
power
:A data frame containing the power computations. The rows are indexed by the sample size and the columns by the sensitivity.
Print the power table x
.
Summarize the power table object
.
See biomarkerPowerTable
for examples.
Kevin R. Coombes <[email protected]>
TailRankTest
,
tailRankPower
,
biomarkerPowerTable
Compute the significance level and the power of a tail-rank test.
tailRankPower(G, N1, N2, psi, phi, conf = 0.95, model=c("bb", "betabinom", "binomial")) tailRankCutoff(G, N1, N2, psi, conf, model=c("bb", "betabinom", "binomial"), method=c('approx', 'exact'))
tailRankPower(G, N1, N2, psi, phi, conf = 0.95, model=c("bb", "betabinom", "binomial")) tailRankCutoff(G, N1, N2, psi, conf, model=c("bb", "betabinom", "binomial"), method=c('approx', 'exact'))
G |
An integer; the number of genes being assessed as potnetial biomarkers. Statistically, the number of hypotheses being tested. |
N1 |
An integer; the number of "train" or "healthy" samples used. |
N2 |
An integer; the number of "test" or "cancer" samples used. |
psi |
A real number between 0 and 1; the desired specificity of the test. |
phi |
A real number between 0 and 1; the sensitivity that one would like to be able to detect, conditional on the specificity. |
conf |
A real number between 0 and 1; the confidence level of the results. Can be obtained by subtracting the family-wise Type I error from 1. |
model |
A character string that determines whether significance and power are computed based on a binomial or a beta-binomial (bb) model. |
method |
A character string; either "exact" or "approx". The deafult is to use a Bonferroni approximation. |
A power estimate for the tail-rank test can be obtained as follows.
First, let X ~ Binom(N,p) denote a binomial random variable. Under the
null hypotheis that cancer is not different from normal, we let
be the expected proportion of successes in a test of
whether the value exceeds the psi-th quantile. Now let
be one such binomial measurement. When we make independent
binomial measurements, we take
(In our paper on the tail-rank statistic, we write everything in terms
of .) Then we have
Using a Bonferroni-like approximation, we can take
Solving for , we find that
So, the cutoff that ensures that in multiple experiments, each looking
at genes in
samples, we have confidence level
(or significance level
) of no false positives is
computed by the function
tailRankCutoff
.
The final point to note is that the quantiles are also defined
in terms of , so there are lots of disfiguring "1's"
in the implementation.
Now we set to be the significance cutoff using the procedure
detailed above. A gene with sensitivity
gets detected if the
observed number of cases above the threshold is greater than or equal to
. The
tailRankPower
function implements formula (1.3) of
our paper on the tail-rank test.
tailRankCutoff
returns an integer that is the
maximum expected value of the tail rank statistic under the null
hypothesis.
tailRankPower
returns a real numbe between 0 and 1
that is the power of the tail-rank test to detect a marker with
true sensitivity equal to .
Kevin R. Coombes <[email protected]>
TailRankTest
,
tailRankPower
,
biomarkerPowerTable
,
matrixMean
,
toleranceBound
psi.0 <- 0.99 confide <- rev(c(0.8, 0.95, 0.99)) nh <- 20 ng <- c(100, 1000, 10000, 100000) ns <- c(10, 20, 50, 100, 250, 500) formal.cut <- array(0, c(length(ns), length(ng), length(confide))) for (i in 1:length(ng)) { for (j in 1:length(ns)) { formal.cut[j, i, ] <- tailRankCutoff(ng[i], nh, ns[j], psi.0, confide) } } dimnames(formal.cut) <- list(ns, ng, confide) formal.cut phi <- seq(0.1, 0.7, by=0.1) N <- c(10, 20, 50, 100, 250, 500) pows <- matrix(0, ncol=length(phi), nrow=length(N)) for (ph in 1:length(phi)) { pows[, ph] <- tailRankPower(10000, nh, N, 0.95, phi[ph], 0.9) } pows <- data.frame(pows) dimnames(pows) <- list(as.character(N), as.character(round(100*phi))) pows
psi.0 <- 0.99 confide <- rev(c(0.8, 0.95, 0.99)) nh <- 20 ng <- c(100, 1000, 10000, 100000) ns <- c(10, 20, 50, 100, 250, 500) formal.cut <- array(0, c(length(ns), length(ng), length(confide))) for (i in 1:length(ng)) { for (j in 1:length(ns)) { formal.cut[j, i, ] <- tailRankCutoff(ng[i], nh, ns[j], psi.0, confide) } } dimnames(formal.cut) <- list(ns, ng, confide) formal.cut phi <- seq(0.1, 0.7, by=0.1) N <- c(10, 20, 50, 100, 250, 500) pows <- matrix(0, ncol=length(phi), nrow=length(N)) for (ph in 1:length(phi)) { pows[, ph] <- tailRankPower(10000, nh, N, 0.95, phi[ph], 0.9) } pows <- data.frame(pows) dimnames(pows) <- list(as.character(N), as.character(round(100*phi))) pows
Perform a tail-rank test to find candidate biomarkers in a microarray data set.
TailRankTest(data, classes, specificity = 0.95, tolerance = 0.50, model=c("bb", "betabinomial", "binomial"), confidence = 0.95, direction = "up")
TailRankTest(data, classes, specificity = 0.95, tolerance = 0.50, model=c("bb", "betabinomial", "binomial"), confidence = 0.95, direction = "up")
data |
A matrix or data.frame containing numerical measurements on which to perform the tail-rank test. |
classes |
A logical vector or factor splitting the data into two
parts. The length of this vector should equal the number of columns
in the |
specificity |
a real number between 0 and 1; the desired specificity used in the test to estimate a quantile from the "base" group. This is an optional argument with default value 0.95. |
tolerance |
a real number between 0 and 1; the upper tolerance bound used to estimate the threshold. This is an optional argument with default value 0.90. |
model |
a character string that determines whther significance comes from a binomial model or a beta-binomial (bb) model. |
confidence |
a real number between 0 and 1; the confidence level that there are no false positives. This is an optional argument with default value 0.50, which is equivalent to ignoring the tolerance. |
direction |
a character string representing the direction of the test; can be "up", "down", or "two-sided". The default value is "up". |
This function computes the tail rank statistic for each gene (viewed as one row of the data matrix). The data is split into two groups. The first ("base") group is used to estimate a tolerance bound (defaults to 50%) on a specific quantile (defaults to 95%) of the distribution of each gene. The tail-rank statistic is the defined as the number of samples in the second ("test") group that lie outside the bound. The test can be applied in the "up", "down", or "two-sided" direction, depending on the kinds of markers being sought. Also computes the cutoff for significance based on a confidence level that is "1 - FWER" for a desired family-wise error rate.
The return value is an object of class TailRankTest.
Kevin R. Coombes <[email protected]>
http://bioinformatics.mdanderson.org
TailRankTest-class
,
tailRankPower
,
biomarkerPowerTable
,
toleranceBound
# generate some fake data to use in the example nr <- 40000 nc <- 110 fake.data <- matrix(rnorm(nr*nc), ncol=nc) fake.class <- rep(c(TRUE, FALSE), c(40, 70)) # perform the tail-rank test null.tr <- TailRankTest(fake.data, fake.class) # get a summary of the results summary(null.tr) # plot a histogram of the statistics hist(null.tr, overlay=TRUE) # get the actual statistics stats <- getStatistic(null.tr) # get a vector that selects the "positive" calls for the test is.marker <- as.logical(null.tr) # the following line should evaluate to the number of rows, nr = 40000 sum( is.marker == (stats > null.tr@cutoff) )
# generate some fake data to use in the example nr <- 40000 nc <- 110 fake.data <- matrix(rnorm(nr*nc), ncol=nc) fake.class <- rep(c(TRUE, FALSE), c(40, 70)) # perform the tail-rank test null.tr <- TailRankTest(fake.data, fake.class) # get a summary of the results summary(null.tr) # plot a histogram of the statistics hist(null.tr, overlay=TRUE) # get the actual statistics stats <- getStatistic(null.tr) # get a vector that selects the "positive" calls for the test is.marker <- as.logical(null.tr) # the following line should evaluate to the number of rows, nr = 40000 sum( is.marker == (stats > null.tr@cutoff) )
This is the class representation for the results of a tail-rank test to find biomarkers in a microarray data set. It includes methods for summarizing and plotting the results of the test.
Although objects can be created, as usual, using new
, the only
reliable way to create valid objects is to use the
TailRankTest
function. See the description of that
function for details on how the tail-rank test works.
statistic
:a numeric vector containng the tail-rank statistic for each row (gene) in a microarray data set
direction
:a character string representing the direction of the test; can be "up", "down", or "two-sided"
N1
:an integer; the numnber of samples in the "base" or "healthy" group
N2
:an integer; the number of samples in the "test" or "cancer" group
specificity
:a real number between 0 and 1; the desired specificity used in the test to estimate a quantile from the "base" group
tolerance
:a real number between 0 and 1; the upper tolerance bound used to estimate the threshold
confidence
:a real number between 0 and 1; the confidence level that there are no false positives
cutoff
:an integer; the maximum expected value of the statistic under the null hypothesis
model
:a character string describing the model (binomial or beta-binomial) used to decide on cutoffs for significance
tau
:a numeric vector or NULL; gene-by-gene upper bounds for significance
rho
:a numeric vector or NULL; gene-by-gene lower bounds for significance
Display a summary of the TailRankTest object
Plot a histogram of the statistic in
the TailRankTest object x
. The optional argument
overlay
is a logical flag. If overlay=TRUE
, then the
histogram is overlain with a curve representing the null
distribution. The default value of overlay
is FALSE
.
Convert the TailRankTest object
x
into a logical vector, which takes on a TRUE
value
whenever the tail-rank statistic exceeds the significance cutoff.
Obtain the vector of tail-rank
statistics contained in object
.
Kevin R. Coombes <[email protected]>
TailRankTest
,
tailRankPower
,
biomarkerPowerTable
,
matrixMean
,
toleranceBound
# generate some fake data to use in the example nr <- 40000 nc <- 110 fake.data <- matrix(rnorm(nr*nc), ncol=nc) fake.class <- rep(c(TRUE, FALSE), c(40, 70)) # perform the tail-rank test null.tr <- TailRankTest(fake.data, fake.class) # get a summary of the results summary(null.tr) # plot a histogram of the statistics hist(null.tr, overlay=TRUE) # get the actual statistics stats <- getStatistic(null.tr) # get a vector that selects the "positive" calls for the test is.marker <- as.logical(null.tr) # the following line should evaluate to the number of rows, nr = 40000 sum( is.marker == (stats > null.tr@cutoff) )
# generate some fake data to use in the example nr <- 40000 nc <- 110 fake.data <- matrix(rnorm(nr*nc), ncol=nc) fake.class <- rep(c(TRUE, FALSE), c(40, 70)) # perform the tail-rank test null.tr <- TailRankTest(fake.data, fake.class) # get a summary of the results summary(null.tr) # plot a histogram of the statistics hist(null.tr, overlay=TRUE) # get the actual statistics stats <- getStatistic(null.tr) # get a vector that selects the "positive" calls for the test is.marker <- as.logical(null.tr) # the following line should evaluate to the number of rows, nr = 40000 sum( is.marker == (stats > null.tr@cutoff) )
This file describes the methods for an object of the class
TailRankTest
class.
## S4 method for signature 'TailRankTest' summary(object, ...) ## S4 method for signature 'TailRankTest' hist(x, overlay = FALSE, xlab = "tail-rank statistic", main = "", ...) ## S4 method for signature 'TailRankTest' as.logical(x, ...) ## S4 method for signature 'TailRankTest' getStatistic(object,...)
## S4 method for signature 'TailRankTest' summary(object, ...) ## S4 method for signature 'TailRankTest' hist(x, overlay = FALSE, xlab = "tail-rank statistic", main = "", ...) ## S4 method for signature 'TailRankTest' as.logical(x, ...) ## S4 method for signature 'TailRankTest' getStatistic(object,...)
x |
A |
object |
A |
overlay |
An optional logical flag; defaults to |
xlab |
A character string |
main |
A character string |
... |
Extra graphical parameters |
as.logical |
Returns a logical vector. |
getStatistic |
Returns the vector of tail-rank statistics
contained in |
hist |
Invisibly returns the TailRankTest object. |
summary |
Invisibly returns the TailRankTest object. |
Kevin R. Coombes <[email protected]>
TailRankTest-class
,
TailRankTest
,
tailRankPower
# generate some fake data to use in the example nr <- 40000 nc <- 110 fake.data <- matrix(rnorm(nr*nc), ncol=nc) fake.class <- rep(c(TRUE, FALSE), c(40, 70)) # build an object null.tr <- TailRankTest(fake.data, fake.class) # summarize the object summary(null.tr) # plot a histogram hist(null.tr) hist(null.tr, breaks=70, col='blue', overlay=TRUE) # get a logical vector that can select those markers # identified by the test selector <- as.logical(null.tr)
# generate some fake data to use in the example nr <- 40000 nc <- 110 fake.data <- matrix(rnorm(nr*nc), ncol=nc) fake.class <- rep(c(TRUE, FALSE), c(40, 70)) # build an object null.tr <- TailRankTest(fake.data, fake.class) # summarize the object summary(null.tr) # plot a histogram hist(null.tr) hist(null.tr, breaks=70, col='blue', overlay=TRUE) # get a logical vector that can select those markers # identified by the test selector <- as.logical(null.tr)
The function toleranceBound
computes theoretical upper tolerance
bounds on the quantiles of the standard normal distribution. These can
be used to produce reliable data-driven estimates of the quantiles in
any normal distribution.
toleranceBound(psi, gamma, N)
toleranceBound(psi, gamma, N)
psi |
A real number between 0 and 1 giving the desired quantile |
gamma |
A real number between 0 and 1 giving the desired tolerance bound |
N |
An integer giving the number of observations used to estimate the quantile |
Suppose that we collect observations from a normal distribution
with unknown mean and variance, and wish to estimate the
th
percentile of the distribution. A simple point estimate is given by
. However, only the mean of the distribution is
less than this value
of the time. When
, for example,
almost half of the time (
), fewer than
of the
observed values will be less than
. This problem is addressed by
constructing a statistical tolerance interval (more precisely, a one-sided
tolerance bound) that contains a given fraction,
, of the
population with a given confidence level,
[Hahn and Meeker,
1991]. With enough samples, one can obtain distribution-free tolerance
bounds [op.\ cit., Chapter 5]. For instance, one can use bootstrap or
jackknife methods to estimate these bounds empirically.
Here, however, we assume that the measurements are normally distributed. We
let denote the sample mean and let
denote the sample
standard deviation. The upper tolerance bound that,
of
the time, exceeds
of
values from a normal
distribution is approximated by
,
where
and, for any ,
is the critical value of the normal
distribution that is exceeded with probability
[Natrella, 1963].
Returns the value of with the property that the
th quantile will be less than the estimate
(based on
data points) at least
of the time.
Lower tolerance bounds on quantiles with psi
less than
one-half can be obtained as ,
Kevin R. Coombes <[email protected]>
Natrella, M.G. (1963) Experimental Statistics. NBS Handbook 91, National Bureau of Standards, Washington DC.
Hahn, G.J. and Meeker, W.Q. (1991) Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, Inc., New York.
N <- 50 x <- rnorm(N) tolerance <- 0.90 quant <- 0.95 tolerance.factor <- toleranceBound(quant, tolerance, N) # upper 90% tolerance bound for 95th percentile tau <- mean(x) + sd(x)*tolerance.factor # lower 90% tolerance bound for 5th percentile rho <- mean(x) - sd(x)*tolerance.factor # behavior of the tolerance bound as N increases nn <- 10:100 plot(nn, toleranceBound(quant, tolerance, nn)) # behavior of the bound as the tolerance varies xx <- seq(0.5, 0.99, by=0.01) plot(xx, toleranceBound(quant, xx, N))
N <- 50 x <- rnorm(N) tolerance <- 0.90 quant <- 0.95 tolerance.factor <- toleranceBound(quant, tolerance, N) # upper 90% tolerance bound for 95th percentile tau <- mean(x) + sd(x)*tolerance.factor # lower 90% tolerance bound for 5th percentile rho <- mean(x) - sd(x)*tolerance.factor # behavior of the tolerance bound as N increases nn <- 10:100 plot(nn, toleranceBound(quant, tolerance, nn)) # behavior of the bound as the tolerance varies xx <- seq(0.5, 0.99, by=0.01) plot(xx, toleranceBound(quant, xx, N))