Title: | Statistical Models for Word Frequency Distributions |
---|---|
Description: | Statistical models and utilities for the analysis of word frequency distributions. The utilities include functions for loading, manipulating and visualizing word frequency data and vocabulary growth curves. The package also implements several statistical models for the distribution of word frequencies in a population. (The name of this package derives from the most famous word frequency distribution, Zipf's law.) |
Authors: | Stefan Evert <[email protected]>, Marco Baroni <[email protected]> |
Maintainer: | Stefan Evert <[email protected]> |
License: | GPL-3 |
Version: | 0.6-71 |
Built: | 2024-12-18 04:54:06 UTC |
Source: | https://github.com/r-forge/zipfr |
The zipfR package performs Large-Number-of-Rare-Events (LNRE) modeling of (linguistic) type frequency distributions (Baayen 2001) and provides utilities to run various forms of lexical statistics analysis in R.
The best way to get started with zipfR is to read the tutorial, which you can find as a package vignettte via the HTML documentation; you can also download it from https://zipfr.r-forge.r-project.org/#start
zipfR is released under the GNU General Public License (http://www.gnu.org/copyleft/gpl.html)
Stefan Evert <[email protected]> and Marco Baroni <[email protected]>
Maintainer: Stefan Evert <[email protected]>
zipfR Website: https://zipfR.r-forge.r-project.org/
Baayen, R. Harald (2001). Word Frequency Distributions. Kluwer, Dordrecht.
Baroni, Marco (2008). Distributions in text. In: A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, article 37. Mouton de Gruyter, Berlin.
Evert, Stefan (2004). A simple LNRE model for random character sequences. Proceedings of JADT 2004, 411-422.
Evert, Stefan (2004b). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, IMS, University of Stuttgart. URN urn:nbn:de:bsz:93-opus-23714 http://dx.doi.org/10.18419/opus-2556
Evert, Stefan and Baroni, Marco (2006). Testing the extrapolation quality of word frequency models. Proceedings of Corpus Linguistics 2005.
Evert, Stefan and Baroni, Marco (2006). The zipfR library: Words and other rare events in R. useR! 2006: The second R user conference.
The zipfR tutorial: available as a package vignette and online from https://zipfr.r-forge.r-project.org/#start.
Some good entry points into the zipfR documentation are
be spc
, vgc
, tfl
,
read.spc
, read.tfl
,
read.vgc
, lnre
,
lnre.vgc
, plot.spc
,
plot.vgc
Harald Baayen's LEXSTATS tools, which implement a wider range of LNRE models: https://www.springer.com/de/book/9780792370178
Stefan Evert's UCS tools for collocation analysis, which include functions that have been integrated into zipfR: http://www.collocations.de/software.html
## load Oliver Twist and Great Expectations frequency spectra data(DickensOliverTwist.spc) data(DickensGreatExpectations.spc) ## check sample size and vocabulary and hapax counts N(DickensOliverTwist.spc) V(DickensOliverTwist.spc) Vm(DickensOliverTwist.spc,1) N(DickensGreatExpectations.spc) V(DickensGreatExpectations.spc) Vm(DickensGreatExpectations.spc,1) ## compute binomially interpolated growth curves ot.vgc <- vgc.interp(DickensOliverTwist.spc,(1:100)*1570) ge.vgc <- vgc.interp(DickensGreatExpectations.spc,(1:100)*1865) ## plot them plot(ot.vgc,ge.vgc,legend=c("Oliver Twist","Great Expectations")) ## load Dickens' works frequency spectrum data(Dickens.spc) ## compute Zipf-Mandelbrot model from Dickens data ## and look at model summary zm <- lnre("zm",Dickens.spc) zm ## plot observed and expected spectrum zm.spc <- lnre.spc(zm,N(Dickens.spc)) plot(Dickens.spc,zm.spc) ## obtain expected V and V1 values at arbitrary sample sizes EV(zm,1e+8) EVm(zm,1,1e+8) ## generate expected V and V1 growth curves up to a sample size ## of 10 million tokens and plot them, with vertical line at ## estimation size ext.vgc <- lnre.vgc(zm,(1:100)*1e+5,m.max=1) plot(ext.vgc,N0=N(zm),add.m=1)
## load Oliver Twist and Great Expectations frequency spectra data(DickensOliverTwist.spc) data(DickensGreatExpectations.spc) ## check sample size and vocabulary and hapax counts N(DickensOliverTwist.spc) V(DickensOliverTwist.spc) Vm(DickensOliverTwist.spc,1) N(DickensGreatExpectations.spc) V(DickensGreatExpectations.spc) Vm(DickensGreatExpectations.spc,1) ## compute binomially interpolated growth curves ot.vgc <- vgc.interp(DickensOliverTwist.spc,(1:100)*1570) ge.vgc <- vgc.interp(DickensGreatExpectations.spc,(1:100)*1865) ## plot them plot(ot.vgc,ge.vgc,legend=c("Oliver Twist","Great Expectations")) ## load Dickens' works frequency spectrum data(Dickens.spc) ## compute Zipf-Mandelbrot model from Dickens data ## and look at model summary zm <- lnre("zm",Dickens.spc) zm ## plot observed and expected spectrum zm.spc <- lnre.spc(zm,N(Dickens.spc)) plot(Dickens.spc,zm.spc) ## obtain expected V and V1 values at arbitrary sample sizes EV(zm,1e+8) EVm(zm,1,1e+8) ## generate expected V and V1 growth curves up to a sample size ## of 10 million tokens and plot them, with vertical line at ## estimation size ext.vgc <- lnre.vgc(zm,(1:100)*1e+5,m.max=1) plot(ext.vgc,N0=N(zm),add.m=1)
Frequency spectra included as examples in Baayen (2001).
Baayen2001
Baayen2001
A list of 23 frequency spectra, i.e. objects of class spc
.
List elements are named according to the original files, but without the extension .spc
.
See Baayen (2001, pp. 249-277) for details.
In particular, the following spectra are included:
alice
:Lewis Carroll, Alice's Adventures in Wonderland
through
:Lewis Carroll, Through the Looking-Glass and What Alice Found There
war
:H. G. Wells, War of the Worlds
hound
:Arthur Conan-Doyle, Hound of the Baskervilles
havelaar
:E. Douwes Dekker, Max Havelaar
turkish
:An archeology text (Turkish)
estonian
:A. H. Tammsaare, Truth and Justice (Estonian)
bnc
:The context-governed subcorpus of the British National Corpus (BNC)
in1
:Sample of 1 million tokens from The Independent
in8
:Sample of 8 million tokens from The Independent
heid
:Nouns in -heid in the CELEX database (Dutch)
iteit
:Nouns in -iteit in the CELEX database (Dutch)
ster
:Nouns in -ster in the CELEX database (Dutch)
in
:Nouns in -in in the CELEX database (Dutch)
nouns
:Simplex nouns in the CELEX database (Dutch)
sing
:Singular nouns in M. Innes, The Bloody Wood
plur
:Plural nouns in M. Innes, The Bloody Wood
nessw
:Nouns in -ness in the written subcorpus of the BNC
nesscg
:Nouns in -ness in the context-governed subcorpus of the BNC
nessd
:Nouns in -ness in the demographic subcorpus of the BNC
filarial
:Counts of filarial worms in mites on rats
cv
:Context-vowel patterns in the TIMIT speech database
pairs
:Word pairs in E. Douwes Dekker, Max Havelaar
Baayen, R. Harald (2001). Word Frequency Distributions. Kluwer, Dordrecht.
Baayen2001$alice
Baayen2001$alice
The functions documented here compute incomplete and regularized Beta and Gamma functions as well as their logarithms and the corresponding inverse functions. These functions will be of interest to developers, not users of the toolkit.
Cgamma(a, log=!missing(base), base=exp(1)) Igamma(a, x, lower=TRUE, log=!missing(base), base=exp(1)) Igamma.inv(a, y, lower=TRUE, log=!missing(base), base=exp(1)) Rgamma(a, x, lower=TRUE, log=!missing(base), base=exp(1)) Rgamma.inv(a, y, lower=TRUE, log=!missing(base), base=exp(1)) Cbeta(a, b, log=!missing(base), base=exp(1)) Ibeta(x, a, b, lower=TRUE, log=!missing(base), base=exp(1)) Ibeta.inv(y, a, b, lower=TRUE, log=!missing(base), base=exp(1)) Rbeta(x, a, b, lower=TRUE, log=!missing(base), base=exp(1)) Rbeta.inv(y, a, b, lower=TRUE, log=!missing(base), base=exp(1))
Cgamma(a, log=!missing(base), base=exp(1)) Igamma(a, x, lower=TRUE, log=!missing(base), base=exp(1)) Igamma.inv(a, y, lower=TRUE, log=!missing(base), base=exp(1)) Rgamma(a, x, lower=TRUE, log=!missing(base), base=exp(1)) Rgamma.inv(a, y, lower=TRUE, log=!missing(base), base=exp(1)) Cbeta(a, b, log=!missing(base), base=exp(1)) Ibeta(x, a, b, lower=TRUE, log=!missing(base), base=exp(1)) Ibeta.inv(y, a, b, lower=TRUE, log=!missing(base), base=exp(1)) Rbeta(x, a, b, lower=TRUE, log=!missing(base), base=exp(1)) Rbeta.inv(y, a, b, lower=TRUE, log=!missing(base), base=exp(1))
a , b
|
non-negative numeric vectors, the parameters of the Gamma
and Beta functions ( |
x |
a non-negative numeric vector, the point at which the
incomplete or regularized Gamma or Beta function is evaluated (for
the Beta functions, |
y |
a non-negative numeric vector, the values of the Gamma or Beta function on linear or logarithmic scale |
lower |
whether to compute the lower ( |
log |
if |
base |
a positive number, specifying the base of the logarithmic
scale for values of the Gamma and Beta functions (default: natural
logarithm). Setting the |
Cgamma
returns the (complete) Gamma function evaluated at
a
, .
Igamma
returns the (lower or upper)
incomplete Gamma function with parameter a
evaluated at point
x
, (
lower=TRUE
) or
(
lower=FALSE
). Rgamma
returns the corresponding
regularized Gamma function, (
lower=TRUE
) or
(
lower=FALSE
). If log=TRUE
, the returned
values are on logarithmic scale as specified by the base
parameter.
Igamma.inv
and Rgamma.inv
compute the inverse of the
incomplete and regularized Gamma functions with respect to the
parameter x
. I.e., Igamma.inv(a,y)
returns the point
x
at which the (lower or upper) incomplete Gamma function with
parameter a
evaluates to y
, and mutatis mutandis
for Rgamma.inv(a,y)
. If log=TRUE
, the parameter
y
is taken to be on a logarithmic scale as specified by
base
.
Cbeta
returns the (complete) Beta function with arguments
a
and b
, .
Ibeta
returns the (lower
or upper) incomplete Beta function with parameters a
and
b
, evaluated at point x
,
(
lower=TRUE
) and
(
lower=FALSE
). Note that in contrast to the Gamma functions,
capital refers to the lower incomplete Beta function,
and there is no standardized notation for the upper incomplete Beta
function, so
is used here as an ad-hoc symbol.
Rbeta
returns the corresponding regularized Beta function,
(
lower=TRUE
) or
(
lower=FALSE
). If log=TRUE
, the returned values are on
logarithmic scale as specified by the base
parameter.
Ibeta.inv
and Rbeta.inv
compute the inverse of the
incomplete and regularized Beta functions with respect to the
parameter x
. I.e., Ibeta.inv(y,a,b)
returns the point
x
at which the (lower or upper) incomplete Beta function with
parameters a
and b
evaluates to y
, and
mutatis mutandis for Rbeta.inv(y,a,b)
. If
log=TRUE
, the parameter y
is taken to be on a
logarithmic scale as specified by base
.
All Gamma and Beta functions can be vectorized in the arguments
x
, y
, a
and b
, with the usual R value
recycling rules in the case of multiple vectorizations.
The upper incomplete Gamma function is defined by the Gamma integral
The lower incomplete Gamma function is defined by the complementary Gamma integral
The complete Gamma function calculates the full Gamma integral,
i.e. . The regularized Gamma functions
scale the corresponding incomplete Gamma functions to the interval
, by dividing through
. Thus, the lower
regularized Gamma function is given by
and the upper regularized Gamma function is given by
The lower incomplete Beta function is defined by the Beta integral
and the upper incomplete Beta function is defined by the complementary integral
The complete Beta function calculates the full Beta integral, i.e.
.
The regularized Beta function scales the incomplete Beta function to
the interval
, by dividing through
. The lower
regularized Beta function is thus given by
and the upper regularized Beta function is given by
gamma
and lgamma
, which are fully
equivalent to Cgamma
. beta
and
lbeta
, which are fully equivalent to Cbeta
The implementations of the incomplete and regularized Gamma functions
are based on the Gamma distribution (see pgamma
), and
those of the Beta functions are based on the Beta distribution (see
pbeta
).
Cgamma(5 + 1) # = factorial(5) ## P(X >= k) for Poisson distribution with mean alpha alpha <- 5 k <- 10 Rgamma(k, alpha) # == ppois(k-1, alpha, lower=FALSE) n <- 49 k <- 6 1 / ((n+1) * Cbeta(n-k+1, k+1)) # == choose(n, k) ## P(X >= k) for binomial distribution with parameters n and p n <- 100 p <- .1 k <- 15 Rbeta(p, k, n-k+1) # == pbinom(k-1, n, p, lower=FALSE)
Cgamma(5 + 1) # = factorial(5) ## P(X >= k) for Poisson distribution with mean alpha alpha <- 5 k <- 10 Rgamma(k, alpha) # == ppois(k-1, alpha, lower=FALSE) n <- 49 k <- 6 1 / ((n+1) * Cbeta(n-k+1, k+1)) # == choose(n, k) ## P(X >= k) for binomial distribution with parameters n and p n <- 100 p <- .1 k <- 15 Rbeta(p, k, n-k+1) # == pbinom(k-1, n, p, lower=FALSE)
Estimate confidence intervals for empirical distributions obtained by parametric bootstrapping. The input data must contain a sufficient number of bootstrap replicates for the desired confidence level.
bootstrap.confint(x, level=0.95, method=c("normal", "mad", "empirical"), data.frame=FALSE)
bootstrap.confint(x, level=0.95, method=c("normal", "mad", "empirical"), data.frame=FALSE)
x |
a numeric matrix, with rows corresponding to bootstrap replicates and columns corresponding to different statistics or coefficients. The matrix should have column labels, which will be preserved in the result. A data frame with numeric columns is automatically converted to a matrix. |
level |
desired confidence level (two-sided) |
method |
type of confidence interval to be estimated (see "Details" below) |
data.frame |
if |
This function can compute three different types of confidence intervals, selected by the method
parameter. In addition, corresponding estimates of central tendency (center
) and spread (spread
) of the distribution are returned.
normal
:Wald-type confidence interval based on normal approximation of the bootstrapped distribution (default). Central tendency is given by the sample mean, spread by standard deviation.
This method is unreliable if the bootstrapping produces outlier results and can report confidence limits
outside the feasible range of a parameter or coefficient (e.g. a negative population diversity ).
For this reason, it is strongly recommended to use a more robust type of confidence interval.
mad
:Robust asymmetric confidence intervals around the median, using separate estimates for left and right median absolute deviation (MAD) as robust approximations of standard deviation. Central tendency is given by the median, and spread by the average of left and right standard deviation (estimated via MAD).
This method is applicable in most situations and requires fewer bootstrap replicates than empirical confidence intervals. Note that the values are different from those returned by mad
because of the separate left and right estimators.
empirical
:The empirical inter-quantile range, e.g. 2.5% to 97.5% for default conf.level=.95
.
Note that the actual range might be slightly different depending on the number of bootstrap replicates available.
Central tendency is given by the median, and spread by the inter-quartile range (IQR) re-scaled to be comparable to standard deviation (cf. IQR
).
This is the only method guaranteed to stay within feasible range, but requires a large number of bootstrap replicates for reliable confidence intervals (e.g. at least 120 replicates for the default 95% confidence level).
A numeric matrix with the same number of columns and column labels as x
, and four rows:
the lower boundary of the confidence interval (labelled with the corresponding quantile, e.g. 2.5%
)
the upper boundary of the confidence interval (labelled with the corresponding quantile, e.g. 97.5%
)
an estimate of central tendency (labelled center
)
an estimate of spread on a scale comparable to standard deviaton (labelled spread
)
If data.frame=TRUE
, the matrix is converted to a data frame for convenient printing and access in interactive sessions.
bootstrap.confint
is usually applied to the output of lnre.bootstrap
with simplify=TRUE
.
In particual, confint.lnre
uses this function to obtain bootstrapped confidence intervals for LNRE model parameters and other coefficients; lnre.productivity.measures
(with bootstrap=TRUE
) uses it to determine approximate sampling distributions of productivity measures for a LNRE population.
M <- cbind(alpha=rnorm(200, 10, 5), # Gaussian distribution around mean = 10 beta=rlnorm(200, log(10), 1)) # log-normal distribution around median = 10 summary(M) # overview of the distribution bootstrap.confint(M, method="normal") # normal approximation bootstrap.confint(M, method="mad") # robust asymmetric MAD estimates bootstrap.confint(M, method="empirical") # empirical confidence interval bootstrap.confint(M, method="normal", data.frame=TRUE) # as data frame
M <- cbind(alpha=rnorm(200, 10, 5), # Gaussian distribution around mean = 10 beta=rlnorm(200, log(10), 1)) # log-normal distribution around median = 10 summary(M) # overview of the distribution bootstrap.confint(M, method="normal") # normal approximation bootstrap.confint(M, method="mad") # robust asymmetric MAD estimates bootstrap.confint(M, method="empirical") # empirical confidence interval bootstrap.confint(M, method="normal", data.frame=TRUE) # as data frame
Brown.tfl
, Brown.spc
and Brown.emp.vgc
are
zipfR
objects of classes tfl
,
spc
and vgc
, respectively.
These data were extracted from the Brown corpus (see Kucera and Francis 1967).
Brown.emp.vgc
is the empirical vocabulary growth
curve, reflecting the V
and V(1)
development in the
non-randomized corpus.
We removed numbers and other forms of non-linguistic material before collecting word counts from the Brown.
Kucera, H. and Francis, W.N. (1967). Computational analysis of present-day American English. Brown University Press, Providence.
The datasets documented in BrownSubsets
pertain to
various subsets of the Brown (e.g., informative prose, adjectives
only, etc.)
data(Brown.tfl) summary(Brown.tfl) data(Brown.spc) summary(Brown.spc) data(Brown.emp.vgc) summary(Brown.emp.vgc)
data(Brown.tfl) summary(Brown.tfl) data(Brown.spc) summary(Brown.spc) data(Brown.emp.vgc) summary(Brown.emp.vgc)
Objects of classes spc
and vgc
that
contain frequency data for various subsets of words from the Brown
corpus (see Kucera and Francis 1967).
BrownAdj.spc
, BrownNoun.spc
and BrownVer.spc
are frequency spectra of all the Brown corpus words tagged as
adjectives, nouns and verbs, respectively. BrownAdj.emp.vgc
,
BrownNoun.emp.vgc
and BrownVer.emp.vgc
are the
corresponding observed vocabulary growth curves (tracking the
development of V
and V(1)
, like all the files with
suffix .emp.vgc
below).
BrownImag.spc
and BrownInform.spc
are frequency
spectra of the Brown corpus words subdivided into the two main
stylistic partitions of the corpus, i.e., imaginative and
informative prose, respectively. BrownImag.emp.vgc
and
BrownInform.emp.vgc
are the corresponding observed vocabulary
growth curves.
Brown100k.spc
is the spectrum of the first 100,000 tokens in
the Brown (useful, e.g., for extrapolation experiments in which we
want to estimate a lnre
model on a subset of the data
available). The corresponding observed growth curve can be easily
obtained from the one for the whole Brown (Brown.emp.vgc
).
Notice that we removed numbers and other forms of non-linguistic material before collecting any data from the Brown.
Kucera, H. and Francis, W.N. (1967). Computational analysis of present-day American English. Brown University Press, Providence.
The data described in Brown
pertain to the Brown as a
whole.
data(BrownAdj.spc) summary(BrownAdj.spc) data(BrownAdj.emp.vgc) summary(BrownAdj.emp.vgc) data(BrownInform.spc) summary(BrownInform.spc) data(BrownInform.emp.vgc) summary(BrownInform.emp.vgc) data(Brown100k.spc) summary(Brown100k.spc)
data(BrownAdj.spc) summary(BrownAdj.spc) data(BrownAdj.emp.vgc) summary(BrownAdj.emp.vgc) data(BrownInform.spc) summary(BrownInform.spc) data(BrownInform.emp.vgc) summary(BrownInform.emp.vgc) data(Brown100k.spc) summary(Brown100k.spc)
Compute bootstrapped confidence intervals for LNRE model parameters. The supplied model must contain a sufficient number of bootstrapping replicates.
## S3 method for class 'lnre' confint(object, parm, level=0.95, method=c("mad", "normal", "empirical"), plot=FALSE, breaks="Sturges", ...)
## S3 method for class 'lnre' confint(object, parm, level=0.95, method=c("mad", "normal", "empirical"), plot=FALSE, breaks="Sturges", ...)
object |
an LNRE model (i.e. an object belonging to a subclass of |
parm |
model parameter(s) for which confidence intervals are desired.
If unspecified, all parameters as well as population diversity |
level |
desired confidence level (two-sided) |
method |
type of confidence interval to be estimated (see |
plot |
if |
breaks |
breakpoints for histogram shown with |
... |
all other arguments are ignored |
A data frame with one numeric column for each selected model parameter (labelled with the parameter name) and four rows:
the lower boundary of the confidence interval (labelled with the corresponding quantile, e.g. 2.5%
)
the upper boundary of the confidence interval (labelled with the corresponding quantile, e.g. 97.5%
)
an estimate of central tendency (labelled center
)
an estimate of spread on a scale comparable to standard deviaton (labelled spread
)
lnre
for estimating LNRE models with bootstrap replicates,
lnre.bootstrap
for the underlying parameteric bootstrapping code, and
bootstrap.confint
for the different methods of estimating confidence intervals.
model <- lnre("fzm", spc=BrownAdj.spc, bootstrap=20) confint(model, "alpha") # Zipf slope confint(model, "S") # population diversity confint(model, "S", method="normal") # Gaussian approx works well in this case confint(model) # overview confint(model, "alpha", plot=TRUE) # visualize bootstrap distribution
model <- lnre("fzm", spc=BrownAdj.spc, bootstrap=20) confint(model, "alpha") # Zipf slope confint(model, "S") # population diversity confint(model, "S", method="normal") # Gaussian approx works well in this case confint(model) # overview confint(model, "alpha", plot=TRUE) # visualize bootstrap distribution
Objects of classes spc
and vgc
that
contain frequency data for a collection of Dickens's works from
Project Gutenberg, and for 3 novels (Oliver Twist, Great
Expectations and Our Mutual Friends).
Dickens.spc
has a frequency spectrum derived from a
collection of Dickens' works downloaded from the Gutenberg archive
(A Christmas Carol, David Copperfield, Dombey and Son, Great
Expectations, Hard Times, Master Humphrey's Clock, Nicholas
Nickleby, Oliver Twist, Our Mutual Friend, Sketches by BOZ, A Tale
of Two Cities, The Old Curiosity Shop, The Pickwick Papers, Three
Ghost Stories). Dickens.emp.vgc
contains the corresponding
observed vocabulary growth (V
and V(1)
).
DickensOliverTwist.spc
and DickensOliverTwist.emp.vgc
contain spectrum and observed growth curve (V
and V(1)
of the early novel Oliver Twist (1837-1839).
DickensGreatExpectations.spc
and
DickensGreatExpectations.emp.vgc
contain spectrum and
observed growth curve (V
and V(1)
) of the late novel
Great Expectations (1860-1861).
DickensOurMutualFriend.spc
and
DickensOurMutualFriend.emp.vgc
contain spectrum and observed
growth curve (V
and V(1)
) of Our Mutual Friend, the
last novel completed by Dickens (1864-1865).
Notice that we removed numbers and other forms of non-linguistic material before collecting the frequency data.
Project Gutenberg: https://www.gutenberg.org/
Charles Dickens on Wikipedia: https://en.wikipedia.org/wiki/Charles_Dickens
data(Dickens.spc) summary(Dickens.spc) data(Dickens.emp.vgc) summary(Dickens.emp.vgc) data(DickensOliverTwist.spc) summary(DickensOliverTwist.spc) data(DickensOliverTwist.emp.vgc) summary(DickensOliverTwist.emp.vgc)
data(Dickens.spc) summary(Dickens.spc) data(Dickens.emp.vgc) summary(Dickens.emp.vgc) data(DickensOliverTwist.spc) summary(DickensOliverTwist.spc) data(DickensOliverTwist.emp.vgc) summary(DickensOliverTwist.emp.vgc)
Internal function: Generic method for estimation of LNRE model parameters. Based on the class of its first argument, the method dispatches to a suitable implementation of the estimation procedure.
Unless you are a developer working on the zipfR
source code,
you are probably looking for the lnre
manpage.
estimate.model(model, spc, param.names, method, cost.function, m.max=15, runs=3, debug=FALSE, ...)
estimate.model(model, spc, param.names, method, cost.function, m.max=15, runs=3, debug=FALSE, ...)
model |
LNRE model object of the appropriate class (a subclass of
|
spc |
an observed frequency spectrum, i.e. an object of class
|
param.names |
a character vector giving the names of parameters for which values have to be estimated ("missing" parameters) |
method |
name of the minimization algorithm used for parameter
estimation (see |
cost.function |
cost function to be minimized (see
|
m.max |
number of spectrum elements that will be used to compute
the cost function (passed on to |
runs |
number of parameter optimization runs with random initialization. Parameters from the run that achieves the smallest value of the cost function will be selected. Some method implementations may not support multiple optimization runs. |
debug |
if |
... |
additional arguments are passed on and may be used by some implementations |
By default, estimate.model
dispatches to a generic
implementation of the estimation procedure that can be used with all
types of LNRE models (estimate.model.lnre
).
This generic implementation can be overridden for specific LNRE
models, e.g. to calculate better init values or improve the estimation
procedure in some other way. To provide a custom implementation for
Zipf-Mandelbrot models (of class lnre.zm
), for instance, it is
sufficient to define the corresponding method implementation
estimate.model.lnre.zm
. If no custom implementation is
provided but the user has selected the Custom
method (which is
the default), estimate.model
falls back on Nelder-Mead
for multi-dimensional minimization and NLM
for one-dimensional
minimization (where Nelder-Mead is considered to be unreliable).
Parmeter estimation is performed by minimization of the cost function
passed in the cost.function
argument (see lnre
for details). Depending on the method
argument, a range of
different minimization algorithms can be used (see lnre
for a complete listing). The minimization algorithm always operates
on transformed parameter values, making use of the
transform
utility provided by LNRE models (see
lnre.details
for more information about utility
functions). All parameters are initialized to 0 in the transformed
scale, which should translate to sensible starting points.
Note that the estimate.model
implementations do not
perform any error checking. It is the responsibility of the caller
to make sure that the arguments are sensible and complete. In
particular, all model parameters that will not be estimated (i.e. are
not listed in param.names
) must have been initialized to
their prespecified values in the model
passed to the function.
A modified version of model
, where the missing parameters
listed in param.names
have been estimated from the observed
frequency spectrum spc
. In addition, goodness-of-fit
information is added to the object.
The user-level function for estimating LNRE models is
lnre
. Its manpage also lists available cost functions
and minimization algorithms.
The internal structure of lnre
objects (representing LNRE
models) is described on the lnre.details
manpage, which
also outlines the necessary steps for implementing a new LNRE model.
The minimization algorithms used are described in detail on the
nlm
and optim
manpages from R's standard
library.
EV
and EVm
are generic methods for computing the
expected vocabulary size and frequency spectrum
according to a LNRE model (i.e. an object belonging to a
subclass of
lnre
).
When applied to a frequency spectrum (i.e. an object of class
spc
), these methods perform binomial interpolation (see
EV.spc
for details), although spc.interp
and vgc.interp
might be more convenient binomial
interpolation functions for most purposes.
EV(obj, N, ...) EVm(obj, m, N, ...)
EV(obj, N, ...) EVm(obj, m, N, ...)
obj |
an LNRE model (i.e. an object belonging to a subclass of
|
m |
positive integer value determining the frequency class
|
N |
sample size |
... |
additional arguments passed on to the method implementation (see respective manpages for details) |
EV
returns the expected vocabulary size in a
sample of
tokens, and
EVm
returns the expected spectrum
elements , according to the LNRE model given by
obj
(or according to binomial interpolation).
See lnre
for more information on LNRE models, a listing
of available models, and methods for parameter estimation.
The variances of the random variables and
can
be computed with the methods
VV
and VVm
.
See EV.spc
and EVm.spc
for more
information about the usage of these methods to perform binomial
interpolation (but consider using spc.interp
and
vgc.interp
instead).
## see lnre() documentation for examples
## see lnre() documentation for examples
Compute the expected vocabulary size (with function
EV.spc
) or expected frequency spectrum (with
function
EVm.spc
) for a random sample of size from a
given frequency spectrum (i.e., an object of class
spc
). The
expectations are calculated by binomial interpolation (following
Baayen 2001, pp. 64-69).
Note that these functions are not user-visible. They can be called
implicitly through the generic methods EV
and EVm
,
applied to an object of type spc
.
## S3 method for class 'spc' EV(obj, N, allow.extrapolation=FALSE, ...) ## S3 method for class 'spc' EVm(obj, m, N, allow.extrapolation=FALSE, ...)
## S3 method for class 'spc' EV(obj, N, allow.extrapolation=FALSE, ...) ## S3 method for class 'spc' EVm(obj, m, N, allow.extrapolation=FALSE, ...)
obj |
an object of class |
m |
positive integer value determining the frequency class
|
N |
sample size |
allow.extrapolation |
if |
... |
additional arguments passed on from generic methods will be ignored |
These functions are naive implementations of binomial interpolation,
using Equations (2.41) and (2.43) from Baayen (2001). No guarantees
are made concerning their numerical accuracy, especially for extreme
values of and
.
According to Baayen (2001), pp. 69-73., the same equations can also be
used for binomial extrapolation of a given frequency spectrum
to larger sample sizes. However, they become numerically unstable in
this case and will typically break down when extrapolating to more
than twice the size of the observed sample (Baayen 2001, p. 75).
Therefore, extrapolation has to be enabled explicitly with the option
allow.extrapolation=TRUE
and should be used with great caution.
EV
returns the expected vocabulary size for a
random sample of
tokens from the frequency spectrum
obj
, and EVm
returns the expected spectrum elements
for a random sample of
tokens from
obj
,
calculated by binomial interpolation.
Baayen, R. Harald (2001). Word Frequency Distributions. Kluwer, Dordrecht.
EV
and EVm
for the generic methods and
links to other implementations
spc.interp
and vgc.interp
are convenience
functions that compute an expected frequency spectrum or vocabulary
growth curve by binomial interpolation
Corpus data for measuring the productivity of German word formation affixes -bar, -lich, -sam, -ös, -tum, Klein-, -chen and -lein (Evert & Lüdeling 2001). Data were extracted from two volumes of the German daily newspaper Stuttgarter Zeitung, then manually cleaned and normalized.
EvertLuedeling2001
EvertLuedeling2001
A list of 8 character vectors for the different affixes, with names
klein
(Klein-), bar
(-bar),
chen
(-chen), lein
(-lein),
lich
(-lich), oes
(-ös),
sam
(-sam), tum
(-tum).
Each vector contains all relevant tokens from the corpus in their original (chronological) ordering, so vocabulary growth curves can be determined from the vectors in addition to type frequency lists and frequency spectra.
Evert, Stefan and Lüdeling, Anke (2001). Measuring morphological productivity: Is automatic preprocessing sufficient? In Proceedings of the Corpus Linguistics 2001 Conference, pages 167–175, Lancaster, UK.
str(EvertLuedeling2001) # tokens and type counts for the different affixes sapply(EvertLuedeling2001, function (x) { y <- vec2tfl(x) c(N=N(y), V=V(y)) })
str(EvertLuedeling2001) # tokens and type counts for the different affixes sapply(EvertLuedeling2001, function (x) { y <- vec2tfl(x) c(N=N(y), V=V(y)) })
ItaRi.spc
and ItaRi.emp.vgc
are zipfR
objects of classes tfl
, spc
and
vgc
, respectively. They contain frequency data for all
verbal lemmas with the prefix ri- (similar to English re-) in the
Italian la Repubblica corpus.
ItaUltra.spc
and ItaUltra.emp.vgc
contain the same
kinds of data for the adjectival prefix ultra-.
ItaRi.emp.vgc
and ItaUltra.emp.vgc
are
empirical vocabulary growth curves, reflecting the V
and V(1)
development in the non-randomized corpus.
The data were manually checked, as described for ri- in Baroni (to appear).
Baroni, M. (to appear) I sensi di ri-: Un'indagine preliminare. In Maschi, R., Penello, N. and Rizzolatti, P. (eds.), Miscellanea di studi linguistici offerti a Laura Vanelli. Udine, Forum.
la Repubblica corpus: http://sslmit.unibo.it/repubblica/
data(ItaRi.spc) summary(ItaRi.spc) data(ItaRi.emp.vgc) summary(ItaRi.emp.vgc) data(ItaUltra.spc) summary(ItaUltra.spc) data(ItaUltra.emp.vgc) summary(ItaUltra.emp.vgc)
data(ItaRi.spc) summary(ItaRi.spc) data(ItaRi.emp.vgc) summary(ItaRi.emp.vgc) data(ItaUltra.spc) summary(ItaUltra.spc) data(ItaUltra.emp.vgc) summary(ItaUltra.emp.vgc)
LNRE model constructor, returns an object representing a LNRE model with the specified parameters, or allows parameters to be estimated automatically from an observed frequency spectrum.
lnre(type=c("zm", "fzm", "gigp"), spc=NULL, debug=FALSE, cost=c("gof", "chisq", "linear", "smooth.linear", "mse", "exact"), m.max=15, runs=5, method=c("Nelder-Mead", "NLM", "BFGS", "SANN", "Custom"), exact=TRUE, sampling=c("Poisson", "multinomial"), bootstrap=0, verbose=TRUE, parallel=1L, ...)
lnre(type=c("zm", "fzm", "gigp"), spc=NULL, debug=FALSE, cost=c("gof", "chisq", "linear", "smooth.linear", "mse", "exact"), m.max=15, runs=5, method=c("Nelder-Mead", "NLM", "BFGS", "SANN", "Custom"), exact=TRUE, sampling=c("Poisson", "multinomial"), bootstrap=0, verbose=TRUE, parallel=1L, ...)
type |
class of LNRE model to use (see "LNRE Models" below) |
spc |
observed frequency spectrum used to estimate model
parameters. After parameter optimisation, goodness-of-fit of
the final model is tested again |
debug |
if |
cost |
cost function for measuring the "distance" between observed and expected vocabulary size and frequency spectrum. Parameters are estimated by minimizing this cost function (see ‘Cost Functions’ below for a list of built-in cost functions and details on user-defined cost functions). |
m.max |
number of spectrum elements considered by the cost function (see "Cost Functions" below for more information). If unspecified, the default is automatically adjusted to avoid small spectrum elements that may be mathematically unreliable. |
runs |
number of parameter optimization runs with random
initialization. Parameters from the run that achieves the smallest
value of the cost function will be selected. Currently not supported
for |
method |
algorithm used for parameter estimation, by minimizing the value of the cost function (see "Parameter Estimation" below for details, and "Minimization Algorithms" for descriptions of the available algorithms) |
exact |
if |
sampling |
type of random sampling model to use. |
bootstrap |
number of bootstrap samples used to estimate confidence
intervals for estimated model parameters. Recommended values are
|
parallel |
whether to use parallelisation for the bootstrapping procedure
(highly recommended). See |
verbose |
if |
... |
all further named arguments are interpreted as parameter values for the chosen LNRE model (see the respective manpages for names and descriptions of the model parameters) |
Currently, the following LNRE models are supported by the zipfR
package:
The Zipf-Mandelbrot (ZM) LNRE model (see lnre.zm
for details).
The finite Zipf-Mandelbrot (fZM) LNRE model (see
lnre.fzm
for details).
The Generalized Inverse Gauss-Poisson (GIGP) LNRE model (see
lnre.gigp
for details).
If explicit model parameters are specified in addition to an observed
frequency spectrum spc
, these parameters are fixed to the given
values and are excluded from the estimation procedure. This feature
can be useful if fully automatic parameter estimation leads to a poor
or counterintuitive fit.
An object of a suitable subclass of lnre
, depending on the
type
argument (e.g. lnre.fzm
for type="fzm"
).
This object represents a LNRE model of the selected type with the
specified parameter values, or with parameter values estimated from
the observed frequency spectrum spc
.
The internal structure of lnre
objects is described on the
lnre.details
manpage (intended for developers).
Automatic parameter estimation for LNRE models is performed by
matching the expected vocabulary size and frequency spectrum of the
model against the observed data passed in the spc
argument.
For this purpose, a cost function has to be defined as a measure of the "distance" between observed and expected frequency spectrum. Parameters are then estimated by applying a minimization algorithm in order to find those parameter values that lead to the smallest possible cost.
Parameter estimation is a crucial and often also quite critical step in the application of LNRE models. Depending on the shape of the observed frequency spectrum, the automatic estimation procedure may result in a poor and counter-intuitive fit, or may fail altogether.
Usually, multiple runs of the minimization are performed with different
random start values. An error will only be reported if all the estimation
runs fail. Such multiple runs have not been implemented for the Custom
minimization method yet; please specify runs=1
in this case.
Users can influence parameter estimation by choosing from a range of
predefined cost functions and from several minimization algorithms, as
described in the following sections. Some experimentation with the
cost
, m.max
and method
arguments will often help
to resolve estimation failures and may result in a considerably better
goodness-of-fit.
The following cost functions are available and can be selected with
the cost
argument. All functions are based on the differences
between observed and expected values for vocabulary size and the first
elements of the frequency spectrum (, where
is given by the
m.max
argument):
gof
:the multivariate chi-squared statistic used for
goodness-of-fit testing (lnre.goodness.of.fit
).
This cost function corresponds (almost) to maximum-likelihood
parameter estimation and is used by default.
chisq
:cost function based on a simplified version of
the multivariate chi-squared test for goodness-of-fit (assuming
independence between the random variables ).
linear
:linear cost function, which sums over the absolute differences between observed and expected values. This cost function puts more weight on fitting the vocabulary size and the first few elements of the frequency spectrum (where absolute differences are much larger than for higher spectrum elements).
smooth.linear
:modified version of the linear cost
function, which smoothes the kink of the absolute value function
for a difference of (since non-differentiable cost
functions might be problematic for gradient-base minimization
algorithms)
mse
:mean squared error cost function, averaging over the squares of differences between observed and expected values. This cost function penalizes large absolute differences more heavily than linear cost (and therefore puts even greater weight on fitting vocabulary size and the first spectrum elements).
exact
:this "virtual" cost function attempts to match
the observed vocabulary size and first spectrum elements exactly,
ignoring differences for all higher spectrum elements. This is
achieved by adjusting the value of m.max
automatically,
depending on the number of free parameters that are estimated (in
general, the number of constraints that can be satisfied by
estimating parameters is the same as the number of free
parameters). Having adjusted m.max
, the mse
cost
function is used to determined parameter values, so that the
estimation procedure will not fail even if the constraints cannot
be matched exactly.
Alternatively a user-defined cost function can be passed as a function object
with signature 'cost(model, spc, m.max)', which compares the LNRE model 'model'
against the observed frequency spectrum 'spc' and returns a cost value
(i.e. lower cost indicates a better fit). User-defined cost functions are also
convenient for setting model parameters based on implicit constraints (such as
a desired population diversity ). In this case, pass
spc=NULL
explicitly as a dummy frequency spectrum, skipping the final goodness-of-fit test.
Several different minimization algorithms can be used for parmeter
estimation and are selected with the method
argument:
Nelder-Mead
:the Nelder-Mead algorithm, implemented by
the optim
function, performs minimization without using
derivatives. Parameter estimation is therefore very robust, while
almost as fast and accurate as the NLM
method.
Nelder-Mead
is the default algorithm and is also used
internally by most custom minimization procedures (see below).
NLM
:a standard Newton-type algorithm for nonlinear
minimization, implemented by the nlm
function, which
makes use of numerical derivatives of the cost function.
NLM
minimization converges quickly and obtains very precise
parameter estimates (for a local minimum of the cost function),
but it is not very stable and may cause parameter estimation to
fail altogether.
SANN
:minimization by simulated annealing, also provided by the
optim
function. Like Nelder-Mead
, this algorithm is
very robust because it avoids numerical derivatives, but
convergence is extremely slow. In some cases, SANN
might
produce a better fit than Nelder-Mead
(if the latter
converges to a suboptimal local minimum).
BFGS
:a quasi-Newton method developed by Broyden, Fletcher, Goldfarb and Shanno. This minimization algorithm is efficient, but should be applied with care as it will often overshoot the valid range of parameter values.
Custom
:a custom estimation procedure provided
for certain types of LNRE model, which may exploit special
mathematical properties of the model in order to calculate one or
more of the parameter values directly. For example, one parameter
of the ZM and fZM models can easily be determined from the
constraint (but note that this additional
constraint leads to a different fit than is obtained by plain
minimization of the cost function!). Custom estimation might also
apply special configuration settings to improve convergence of the
minimization process, based on knowledge about the valid ranges
and "behaviour" of model parameters. If no custom estimation
procedure has been implemented for the selected LNRE model,
lnre
falls back on the Nelder-Mead
or NLM
algorithm.
See the nlm
and optim
manpages for more
information about the minimization algorithms used and key references.
Detailed descriptions of the different LNRE models provided by
zipfR
and their parameters can be found on the manpages
lnre.zm
, lnre.fzm
and
lnre.gigp
.
Useful methods for trained models are lnre.spc
,
lnre.vgc
, EV
, EVm
,
VV
, VVm
. Suitable implementations of the
print
and summary
methods are also
provided (see print.lnre
for details), as well as for
plotting (see plot.lnre
). Note that the
methods N
, V
and Vm
can be
applied to LNRE models with estimated parameters and return
information about the observed frequency spectrum used for parameter
estimation.
If bootstrapping samples have been generated (bootstrap > 0
),
confidence intervals for the model parameters can be determined with
confint.lnre
. See lnre.bootstrap
for
more information on the bootstrapping procedure and implementation.
The lnre.details
manpage gives details about the
implementation of LNRE models and the internal structure of
lnre
objects, while estimate.model
has more
information on the parameter estimation procedure (both manpages are
intended for developers).
See lnre.goodness.of.fit
for a complete description of
the goodness-of-fit test that is automatically performed after
parameter estimation (and which is reported in the summary
of
the LNRE model). This function can also be used to evaluate the
predictions of the LNRE model on a different data set than the one
used for parameter estimation.
## load Dickens dataset data(Dickens.spc) ## estimate parameters of GIGP model and show summary m <- lnre("gigp", Dickens.spc) m ## N, V and V1 of spectrum used to compute model ## (should be the same as for Dickens.spc) N(m) V(m) Vm(m,1) ## expected V and V_m and their variances for arbitrary N EV(m,100e6) VV(m,100e6) EVm(m,1,100e6) VVm(m,1,100e6) ## use only 10 instead of 15 spectrum elements to estimate model ## (note how fit improves for V and V1) m.10 <- lnre("gigp", Dickens.spc, m.max=10) m.10 ## experiment with different cost functions m.mse <- lnre("gigp", Dickens.spc, cost="mse") m.mse m.exact <- lnre("gigp", Dickens.spc, cost="exact") m.exact ## NLM minimization algorithm is faster but less robust m.nlm <- lnre("gigp", Dickens.spc, method="NLM") m.nlm ## ZM and fZM LNRE models have special estimation algorithms m.zm <- lnre("zm", Dickens.spc) m.zm m.fzm <- lnre("fzm", Dickens.spc) m.fzm ## estimation is much faster if approximations are allowed m.approx <- lnre("fzm", Dickens.spc, exact=FALSE) m.approx ## specify parameters of LNRE models directly m <- lnre("zm", alpha=.5, B=.01) lnre.spc(m, N=1000, m.max=10) m <- lnre("fzm", alpha=.5, A=1e-6, B=.01) lnre.spc(m, N=1000, m.max=10) m <- lnre("gigp", gamma=-.5, B=.01, C=.01) lnre.spc(m, N=1000, m.max=10) ## bootstrapped confidence intervals for model parameters ## Not run: model <- lnre("fzm", spc=BrownAdj.spc, bootstrap=40) confint(model, "alpha") # Zipf slope confint(model, "S") # population diversity confint(model, "S", method="normal") # Gaussian approx works well in this case ## speed up with parallelisation (see ?lnre.bootstrap for more information) model <- lnre("fzm", spc=BrownAdj.spc, bootstrap=40, parallel=8) # on Linux / MacOS with 8 available cores ## End(Not run)
## load Dickens dataset data(Dickens.spc) ## estimate parameters of GIGP model and show summary m <- lnre("gigp", Dickens.spc) m ## N, V and V1 of spectrum used to compute model ## (should be the same as for Dickens.spc) N(m) V(m) Vm(m,1) ## expected V and V_m and their variances for arbitrary N EV(m,100e6) VV(m,100e6) EVm(m,1,100e6) VVm(m,1,100e6) ## use only 10 instead of 15 spectrum elements to estimate model ## (note how fit improves for V and V1) m.10 <- lnre("gigp", Dickens.spc, m.max=10) m.10 ## experiment with different cost functions m.mse <- lnre("gigp", Dickens.spc, cost="mse") m.mse m.exact <- lnre("gigp", Dickens.spc, cost="exact") m.exact ## NLM minimization algorithm is faster but less robust m.nlm <- lnre("gigp", Dickens.spc, method="NLM") m.nlm ## ZM and fZM LNRE models have special estimation algorithms m.zm <- lnre("zm", Dickens.spc) m.zm m.fzm <- lnre("fzm", Dickens.spc) m.fzm ## estimation is much faster if approximations are allowed m.approx <- lnre("fzm", Dickens.spc, exact=FALSE) m.approx ## specify parameters of LNRE models directly m <- lnre("zm", alpha=.5, B=.01) lnre.spc(m, N=1000, m.max=10) m <- lnre("fzm", alpha=.5, A=1e-6, B=.01) lnre.spc(m, N=1000, m.max=10) m <- lnre("gigp", gamma=-.5, B=.01, C=.01) lnre.spc(m, N=1000, m.max=10) ## bootstrapped confidence intervals for model parameters ## Not run: model <- lnre("fzm", spc=BrownAdj.spc, bootstrap=40) confint(model, "alpha") # Zipf slope confint(model, "S") # population diversity confint(model, "S", method="normal") # Gaussian approx works well in this case ## speed up with parallelisation (see ?lnre.bootstrap for more information) model <- lnre("fzm", spc=BrownAdj.spc, bootstrap=40, parallel=8) # on Linux / MacOS with 8 available cores ## End(Not run)
Type density (
tdlnre
), type distribution
(
tplnre
), type quantiles (
tqlnre
),
probability density (
dlnre
), distribution function
(
plnre
), quantile function (
qlnre
),
logarithmic type and probability densities (ltdlnre
and
ldlnre
), and random sample generation (rlnre
) for LNRE
models.
tdlnre(model, x, ...) tplnre(model, q, lower.tail=FALSE, ...) tqlnre(model, p, lower.tail=FALSE, ...) dlnre(model, x, ...) plnre(model, q, lower.tail=TRUE, ...) qlnre(model, p, lower.tail=TRUE, ...) ltdlnre(model, x, base=10, log.x=FALSE, ...) ldlnre(model, x, base=10, log.x=FALSE, ...) rlnre(model, n, what=c("tokens", "tfl"), ...)
tdlnre(model, x, ...) tplnre(model, q, lower.tail=FALSE, ...) tqlnre(model, p, lower.tail=FALSE, ...) dlnre(model, x, ...) plnre(model, q, lower.tail=TRUE, ...) qlnre(model, p, lower.tail=TRUE, ...) ltdlnre(model, x, base=10, log.x=FALSE, ...) ldlnre(model, x, base=10, log.x=FALSE, ...) rlnre(model, n, what=c("tokens", "tfl"), ...)
model |
an object belonging to a subclass of |
x |
vector of type probabilities |
q |
vector of type probability quantiles, i.e. threshold values
|
p |
vector of tail probabilities |
lower.tail |
if |
base |
positive number, the base with respect to which the log-transformation is peformed (see "Details" below) |
log.x |
if |
n |
size of random sample to generate. If |
what |
whether to return the sample as a vector of tokens or as a type-frequency list (usually more efficient) |
... |
further arguments are passed through to the method implementations (currently unused) |
Note that the order in which arguments are specified differs from the
analogous functions for common statistical distributions in the R
standard library. In particular, the LNRE model model
always
has to be given as the first parameter so that R can dispatch the
function call to an appropriate method implementation for the chosen
LNRE model.
Some of the functions may not be available for certain types of LNRE
models. In particular, no analytical solutions are known for the
distribution and quantiles of GIGP models, so the functions
tplnre
, tqlnre
, plnre
, qlnre
and
rlnre
(which depends on qlnre
and tplnre
) are not
implemented for objects of class lnre.gigp
.
The default tails differ for the distribution function (plnre
,
qlnre
) and the type distribution (tplnre
,
tqlnre
), in order to match the definitions of and
. While the distribution function defaults to lower
tails (
lower.tail=TRUE
, corresponding to and
), the type distribution defaults to upper tails
(
lower.tail=FALSE
, corresponding to and
).
Unlike for standard distriutions, logarithmic tail probabilities
(log.p=TRUE
) are not provided for the LNRE models, since here
the focus is usually on the bulk of the distribution rather than on
the extreme tails.
The log-transformed density functions and
returned by
ldlnre
and ltdlnre
, respectively, can be
understood as probability and type densities for
instead of
, and are useful for visualization of LNRE
populations (with a logarithmic scale for the parameter
on
the x-axis). For example,
For rnlre
, either a factor of length n
(what="tokens"
,
the default) or a tfl
object (what="tfl"
), representing
a random sample from the population described by the specified LNRE model.
Note that the type-frequency list is a sufficient statistic, i.e. it provides
all relevant information from the sample. For large n
, type-frequency
lists are generated more efficiently and with less memory overhead.
For all other functions, a vector of non-negative numbers of the same
length as the second argument (x
, p
or q
).
tdlnre
returns the type density for the values of
specified in the vector
x
. tplnre
returns the
type distribution (default) or its complement
(if
lower.tail=TRUE
), for the values of
specified in the vector
q
. tqlnre
returns
type quantiles, i.e. the inverse (default) or
(if
lower.tail=TRUE
) of the type
distribution, for the type counts specified in the vector
p
.
dlnre
returns the probability density for the
values of
specified in the vector
x
. plnre
returns the distribution function (default) or its
complement
(if
lower.tail=FALSE
), for the
values of specified in the vector
q
. qlnre
returns quantiles, i.e. the inverse (default) or
(if
lower.tail=FALSE
) of the distribution
function, for the probabilities specified in the vector
p
.
ldlnre
and ltdlnre
compute logarithmically transformed
versions of the probability and type density functions, respectively,
taking logarithms with respect to the base specified in the
base
argument (default: ). See "Details" above for
more information.
lnre
for more information about LNRE models and how to
initialize them.
Random samples generated with rnlre
can be further processed
with the functions vec2tfl
, vec2spc
and
vec2vgc
(for token vectors) and tfl2spc
(for type-frequency lists).
## define ZM and fZM LNRE models ZM <- lnre("zm", alpha=.8, B=1e-3) FZM <- lnre("fzm", alpha=.8, A=1e-5, B=.05) ## random samples from the two models vec2tfl(rlnre(ZM, 10000)) vec2tfl(rlnre(FZM, 10000)) rlnre(FZM, 10000, what="tfl") # more efficient ## plot logarithmic type density functions x <- 10^seq(-6, 1, by=.01) # pi = 10^(-6) .. 10^(-1) y.zm <- ltdlnre(ZM, x) y.fzm <- ltdlnre(FZM, x) plot(x, y.zm, type="l", lwd=2, col="red", log="x", ylim=c(0,14000)) lines(x, y.fzm, lwd=2, col="blue") legend("topright", legend=c("ZM", "fZM"), lwd=3, col=c("red", "blue")) ## probability pi_k of k-th type according to FZM model k <- 10 plnre(FZM, tqlnre(FZM, k-1)) - plnre(FZM, tqlnre(FZM, k)) ## number of types with pi >= 1e-6 tplnre(ZM, 1e-6) ## lower tail fails for infinite population size ## Not run: tplnre(ZM, 1e-3, lower=TRUE) ## End(Not run) ## total probability mass assigned to types with pi <= 1e-6 plnre(ZM, 1e-6)
## define ZM and fZM LNRE models ZM <- lnre("zm", alpha=.8, B=1e-3) FZM <- lnre("fzm", alpha=.8, A=1e-5, B=.05) ## random samples from the two models vec2tfl(rlnre(ZM, 10000)) vec2tfl(rlnre(FZM, 10000)) rlnre(FZM, 10000, what="tfl") # more efficient ## plot logarithmic type density functions x <- 10^seq(-6, 1, by=.01) # pi = 10^(-6) .. 10^(-1) y.zm <- ltdlnre(ZM, x) y.fzm <- ltdlnre(FZM, x) plot(x, y.zm, type="l", lwd=2, col="red", log="x", ylim=c(0,14000)) lines(x, y.fzm, lwd=2, col="blue") legend("topright", legend=c("ZM", "fZM"), lwd=3, col=c("red", "blue")) ## probability pi_k of k-th type according to FZM model k <- 10 plnre(FZM, tqlnre(FZM, k-1)) - plnre(FZM, tqlnre(FZM, k)) ## number of types with pi >= 1e-6 tplnre(ZM, 1e-6) ## lower tail fails for infinite population size ## Not run: tplnre(ZM, 1e-3, lower=TRUE) ## End(Not run) ## total probability mass assigned to types with pi <= 1e-6 plnre(ZM, 1e-6)
Posterior distribution over the type probability space of a LNRE
model, given the observed frequency in a sample. Posterior
density (
postdlnre
) and log-transformed density
(postldlnre
) can be computed for all LNRE models. The
distribution function (postplnre
) and quantiles
(postqlnre
) are only available for selected types of models.
postdlnre(model, x, m, N, ...) postldlnre(model, x, m, N, base=10, log.x=FALSE, ...) postplnre(model, q, m, N, lower.tail=FALSE, ...) postqlnre(model, p, m, N, lower.tail=FALSE, ...)
postdlnre(model, x, m, N, ...) postldlnre(model, x, m, N, base=10, log.x=FALSE, ...) postplnre(model, q, m, N, lower.tail=FALSE, ...) postqlnre(model, p, m, N, lower.tail=FALSE, ...)
model |
an object belonging to a subclass of |
m |
frequency |
N |
sample size |
x |
vector of type probabilities |
q |
vector of type probability quantiles, i.e. threshold values
|
p |
vector of tail probabilities |
base |
positive number, the base |
log.x |
if |
lower.tail |
if |
... |
further arguments are passed through to the method implementations (currently unused) |
A vector of non-negative numbers of the same length as the second
argument (x
, p
or q
).
postdlnre
returns the posterior type density
for the values of
specified in the vector
x
.
postplnre
computes the posterior type distribution function
(default) or its complement
(if
lower.tail=TRUE
).
These correspond to and
, respectively (Evert 2004, p. 123).
postqlnre
returns quantiles, i.e. the inverse of the posterior
type distribution function.
postldlnre
computes a logarithmically transformed version of
the posterior type density, taking logarithms with respect to the
base specified in the
base
argument (default: ).
Such log-transformed densities are useful for visualizing distributions,
see
ldlnre
for more information.
lnre
for more information about LNRE models and how to
initialize them, LNRE
for type density and distribution
functions (which represent the prior distribution).
## TODO
## TODO
This function implements parametric bootstrapping for LNRE models, i.e. it draws a specified number of random samples from the population described by a given lnre
object. For each sample, two callback functions are applied to perform transformations and/or extract statistics. In an important application (bootstrapped confidence intervals for model parameters), the first callback estimates a new LNRE model and the second callback extracts the relevant parameters from this model. See ‘Use Cases’ and ‘Examples’ below for other use cases.
lnre.bootstrap(model, N, ESTIMATOR, STATISTIC, replicates=100, sample=c("spc", "tfl", "tokens"), simplify=TRUE, verbose=TRUE, parallel=1L, seed=NULL, ...)
lnre.bootstrap(model, N, ESTIMATOR, STATISTIC, replicates=100, sample=c("spc", "tfl", "tokens"), simplify=TRUE, verbose=TRUE, parallel=1L, seed=NULL, ...)
model |
a trained LNRE model, i.e. an object belonging to a subclass of |
N |
a single positive integer, specifying the size |
ESTIMATOR |
a callback function, normally used for estimating LNRE models in the bootstrap procedure. It is called once for each bootstrap sample with the sample as first argument (in the form determined by |
STATISTIC |
a callback function, normally used to extract model parameters and other relevant statistics from the bootstrapped LNRE models. It is called once for each bootstrap sample, with the value returned by |
replicates |
a single positive integer, specifying the number of bootstrap samples to be generated |
sample |
the form in which each sample is passed to Alternatively, a callback function that will be invoked with arguments |
simplify |
if |
verbose |
if |
parallel |
whether to enable parallel processing. Either an integer specifying the number of worker processes to be forked, or a pre-initialised snow cluster created with |
seed |
a single integer value used to initialize the RNG in order to generate reproducible results |
... |
any further arguments are passed through to the |
The parametric bootstrapping procedure works as follows:
replicates
random samples of N
tokens each are drawn from the population described by the LNRE model model
(possibly using a callback function provided in argument sample
)
Each sample is passed to the callback function ESTIMATOR
in the form determined by sample
(a frequency spectrum, type-frequency list, or factor vector of tokens). If ESTIMATOR
fails, it is re-run with a different sample, otherwise the return value is passed on to STATISTIC
. Use ESTIMATOR=identity
to pass the original sample through to STATISTIC
.
The callback function STATISTIC
is used to extract relevant information for each sample. If STATISTIC
fails, the procedure is repeated from step 2 with a different sample. The callback will typically return a vector of fixed length or a single-row data frame, and the results for all bootstrap samples are combined into a matrix or data frame if simplify=TRUE
.
Warning: Keep in mind that sampling a token vector can be slow and consume large amounts of memory for very large N
(several million tokens). If possible, use sample="spc"
or sample="tfl"
, which can be generated more efficiently.
Parallelisation
Since bootstrapping is a computationally expensive procedure, it is usually desirable to use parallel processing. lnre.bootstrap
supports two types of parallelisation, based on the parallel package:
On Unix platforms, you can set parallel
to an integer number in order to fork the specified number of worker processes, utilising multiple cores on the same machine. The detectCores
function shows how many cores are available, but due to hyperthreading and memory contention, it is often better to set parallel
to a smaller value. Note that forking may be unstable especially in a GUI environment, as explained on the mcfork
manpage.
On all platforms, you can pass a pre-initialised snow cluster in the argument
, which consists of worker processes on the same machine or on different machines. A suitable cluster can be created with makeCluster
; see the parallel package documentation for further information. It is your responsibility to set up the cluster so that all required data sets, packages and custom functions are available on the worker processes; lnre.bootstrap
will only ensure that the zipfR package itself is loaded.
Note that parallel processing is not enabled by default and will only be used if parallel
is set accordingly.
If simplify=FALSE
, a list of length replicates
containing the statistics obtained from each individual bootstrap sample. In addition, the following attributes are set:
N
= sample size of the bootstrap replicates
model
= the LNRE model from which samples were generated
errors
= number of samples for which either the ESTIMATOR
or the STATISTIC
callback produced an error
If simplify=TRUE
, the statistics are combined with rbind()
. This is performed unconditionally, so make sure that STATISTIC
returns a suitable value for all samples, typically vectors of the same length or single-row data frames with the same columns.
The return value is usually a matrix or data frame with replicates
rows. No additional attributes are set.
The confint
method for LNRE models uses bootstrapping to estimate confidence intervals for the model parameters.
For this application, ESTIMATOR=lnre
re-estimates the LNRE model from each bootstrap sample. Configuration options such as the model type, cost function, etc. are passed as additional arguments in ...
, and the sample must be provided in the form of a frequency spectrum. The return values are successfully estimated LNRE models.
STATISTIC
extracts the model parameters and other coefficients of interest (such as the population diversity S
) from each model and returns them as a named vector or single-row data frame. The results are combined with simplify=TRUE
, then empirical confidence intervals are determined for each column.
For some of the more complex measures of productivity and lexical richness (see productivity.measures
), it is difficult to estimate the sampling distribution mathematically. In these cases, an empirical approximation can be obtained by parametric bootstrapping.
The most convenient approach is to set ESTIMATOR=productivity.measures
, so the desired measures can be passed as an additional argument measures=
to lnre.bootstrap
. The default sample="spc"
is appropriate for most measures and is efficient enough to carry out the procedure for multiple sample sizes.
Since the estimator already returns the required statistics for each sample in a suitable format, set STATISTIC=identity
and simplify=TRUE
.
Vocabulary growth curves can only be generated from token vectors, so set sample="tokens"
and keep N
reasonably small.
ESTIMATOR=vec2vgc
compiles vgc
objects for the samples. Pass steps
or stepsize
as desired and set m.max
if growth curves for are desired.
Either use STATISTIC=identity
and simplify=FALSE
to return a list of vgc
objects, which can be plotted or processed further with sapply()
. This strategy is particulary useful if one or more are desired in addition to
.
Or use STATISTIC=function (x) x$V
to extract y-coordinates for the growth curve and combine them into a matrix with simplify=TRUE
, so that prediction intervals can be computed directly. Note that the corresponding x-coordinates are not returned and have to be inferred from N
and stepsize
.
More complex populations and non-random samples can be simulated by providing a user callback function in the sample
argument. This callback is invoked with parameters model
and n
and has to return a sample of size n
in the format expected by ESTIMATOR
.
For simulating non-randomness, the callback will typically use rlnre
to generate a random sample and then apply some transformation.
For simulating mixture distributions, it will typically generate multiple samples from different populations and merge them; the proportion of tokens from each population should be determined by a multinomial random variable. Individual populations might consist of LNRE models, or a finite number of “lexicalised” types. Note that only a single LNRE model will be passed to the callback; any other parameters have to be injected as bound variables in a local function definition.
lnre
for more information about LNRE models. The high-level estimator function lnre
uses lnre.bootstrap
to collect data for approximate confidence intervals; lnre.productivity.measures
uses it to approximate the sampling distributions of productivity measures.
## parametric bootstrapping from realistic LNRE model model <- lnre("zm", spc=ItaRi.spc) # has quite a good fit ## estimate distribution of V, V1, V2 for sample size N=1000 res <- lnre.bootstrap(model, N=1000, replicates=200, ESTIMATOR=identity, STATISTIC=function (x) c(V=V(x), V1=Vm(x,1), V2=Vm(x,2))) bootstrap.confint(res, method="normal") ## compare with theoretical expectations (EV/EVm = center, VV/VVm = spread^2) lnre.spc(model, 1000, m.max=2, variances=TRUE) ## lnre.bootstrap() also captures and ignores occasional failures res <- lnre.bootstrap(model, N=1000, replicates=200, ESTIMATOR=function (x) if (runif(1) < .2) stop() else x, STATISTIC=function (x) c(V=V(x), V1=Vm(x,1), V2=Vm(x,2))) ## empirical confidence intervals for vocabulary growth curve ## (this may become expensive because token-level samples have to be generated) res <- lnre.bootstrap(model, N=1000, replicates=200, sample="tokens", ESTIMATOR=vec2vgc, stepsize=100, # extra args passed to ESTIMATOR STATISTIC=V) # extract vocabulary sizes at equidistant N bootstrap.confint(res, method="normal") ## parallel processing is highly recommended for expensive bootstrapping library(parallel) ## adjust number of processes according to available cores on your machine cl <- makeCluster(2) # PSOCK cluster, should work on all platforms res <- lnre.bootstrap(model, N=1e4, replicates=200, sample="tokens", ESTIMATOR=vec2vgc, stepsize=1000, STATISTIC=V, parallel=cl) # use cluster for parallelisation bootstrap.confint(res, method="normal") stopCluster(cl) ## on MacOS / Linux, simpler fork-based parallelisation also works well ## Not run: res <- lnre.bootstrap(model, N=1e5, replicates=400, sample="tokens", ESTIMATOR=vec2vgc, stepsize=1e4, STATISTIC=V, parallel=8) # if you have enough cores ... bootstrap.confint(res, method="normal") ## End(Not run)
## parametric bootstrapping from realistic LNRE model model <- lnre("zm", spc=ItaRi.spc) # has quite a good fit ## estimate distribution of V, V1, V2 for sample size N=1000 res <- lnre.bootstrap(model, N=1000, replicates=200, ESTIMATOR=identity, STATISTIC=function (x) c(V=V(x), V1=Vm(x,1), V2=Vm(x,2))) bootstrap.confint(res, method="normal") ## compare with theoretical expectations (EV/EVm = center, VV/VVm = spread^2) lnre.spc(model, 1000, m.max=2, variances=TRUE) ## lnre.bootstrap() also captures and ignores occasional failures res <- lnre.bootstrap(model, N=1000, replicates=200, ESTIMATOR=function (x) if (runif(1) < .2) stop() else x, STATISTIC=function (x) c(V=V(x), V1=Vm(x,1), V2=Vm(x,2))) ## empirical confidence intervals for vocabulary growth curve ## (this may become expensive because token-level samples have to be generated) res <- lnre.bootstrap(model, N=1000, replicates=200, sample="tokens", ESTIMATOR=vec2vgc, stepsize=100, # extra args passed to ESTIMATOR STATISTIC=V) # extract vocabulary sizes at equidistant N bootstrap.confint(res, method="normal") ## parallel processing is highly recommended for expensive bootstrapping library(parallel) ## adjust number of processes according to available cores on your machine cl <- makeCluster(2) # PSOCK cluster, should work on all platforms res <- lnre.bootstrap(model, N=1e4, replicates=200, sample="tokens", ESTIMATOR=vec2vgc, stepsize=1000, STATISTIC=V, parallel=cl) # use cluster for parallelisation bootstrap.confint(res, method="normal") stopCluster(cl) ## on MacOS / Linux, simpler fork-based parallelisation also works well ## Not run: res <- lnre.bootstrap(model, N=1e5, replicates=400, sample="tokens", ESTIMATOR=vec2vgc, stepsize=1e4, STATISTIC=V, parallel=8) # if you have enough cores ... bootstrap.confint(res, method="normal") ## End(Not run)
This manpage describes technical details of LNRE models and parameter
estimation. It is intended developers who want to implement new LNRE
models, improve the parameter estimation algorithms, or work directly
with the internals of lnre
objects. All information required
for standard applications of LNRE models can be found on the
lnre
manpage.
Most operations on LNRE models (in particular, computation of expected
values and variances, distribution function and type distribution,
random sampling, etc.) are realized as S3 methods, so they are
automatically dispatched to appropriate implementations for the
various types of LNRE models (e.g., EV.lnre.zm
,
EV.lnre.fzm
and EV.lnre.gigp
for the EV
method).
For some methods (e.g. estimated variances VV
and VVm
),
a single generic implementation can be used for all model types,
provided through the base class (VV.lnre
and VVm.lnre
for variances).
If you want to implement new LNRE models, have a look at "Implementing LNRE Models" below.
Important note: LNRE model parameters can be passed as named
arguments to the lnre
constructor function when they are not
estimated automatically from an observed frequency spectrum. For this
reason, parameter names must be carefully chosen so that they do not
clash with other arguments of the lnre
function. Note that
because of R's argument matching rules, any parameter name that is a
prefix of a standard argument name will lead to such a clash.
In particular, single-letter parameters (such as and
for the GIGP model) should always be written in uppercase (
B
and C
in lnre.gigp
).
A LNRE model with estimated (or manually specified) parameter values
is represented by an object belonging to a suitable subclass of
lnre
. The specific class depends on the type of LNRE model, as
specified in the type
argument to the lnre
constructor
function (e.g. lnre.fzm
for a fZM model selected with
type="fzm"
).
All subtypes of lnre
object share the same data format, viz. a
list with the following components:
type |
a character string specifying the class of LNRE model,
e.g. |
name |
a character string specifying a human-readable name for
the LNRE model, e.g. |
param |
list of named model parameters, e.g. |
param2 |
a list of "secondary" parameters, i.e. constants that
can be determined from the model parameters but are frequently used
in the formulae for expected values, variances, etc.;
e.g. |
S |
population size, i.e. number of types in the population
described by the LNRE model (may be |
exact |
whether approximations are allowed when calculating
expectations and variances ( |
multinomial |
whether to use equations for multionmial sampling
( |
spc |
an object of class |
gof |
an object of class |
util |
a set of utility functions, given as a list with the following components:
|
In order to implement a new class of LNRE models, the following steps
are necessary (illustrated on the example of a lognormal type density
function, introducing the new LNRE class lnre.lognormal
):
Provide a constructor function for LNRE models of this type
(here, lnre.lognormal
), which must accept the parameters of
the LNRE model as named arguments with reasonable default values (or
alternatively as a list passed in the param
argument). The
constructor must return a partially initialized object of an
appropriate subclass of lnre
(lnre.lognormal
in our
example), and make sure that this object also inherits from the
lnre
class.
Provide the update
, transform
, print
and label
utility functions for the LNRE model, which must be returned in the
util
field of the LNRE model object (see "Value" above).
Add the new type of LNRE model to the type
argument of
the generic lnre
constructor, and insert the new constructor
function (lnre.lognormal
) in the switch
call in the
body of lnre
.
As a minimum requirement, implementations of the EV
and
EVm
methods must be provided for the new LNRE model (in our
example, they will be named EV.lnre.lognormal
and
EVm.lnre.lognormal
).
If possible, provide equations for the type density,
probability density, type distribution, distribution function
and posterior distribution of
the new LNRE model, as implementations of the tdlnre
,
dlnre
, tplnre
/tqlnre
,
plnre
/qlnre
and postplnre
/postqlnre
methods for the new LNRE model class. If
all these functions are defined, log-scaled densities and random
number generation are automatically handled by generic
implementations.
Optionally, provide a custom function for parameter estimation
of the new LNRE model, as an implementation of the
estimate.model
method (here,
estimate.model.lnre.lognormal
). Custom parameter estimation
can considerably improve convergence and goodness-of-fit if it is
possible to obtain direct estimates for one or more of the
parameters, e.g. from the condition . However, the
default Nelder-Mead algorithm is robust and produces satisfactory
results, as long as the LNRE model defines an appropriate parameter
transformation mapping. It is thus often more profitable to
optimize the
transform
utility than to spend a lot of time
implementing a complicated parameter estimation function.
The best way to get started is to take a look at one of the existing implementations of LNRE models. The GIGP model represents a "minimum" implementation (without custom parameter estimation and distribution functions), whereas ZM and fZM provide good examples of custom parameter estimation functions.
User-level information about LNRE models and parameter estimation can
be found on the lnre
manpage.
Descriptions of the different LNRE models implemented in zipfR
and their parameters are given on separate manpages
lnre.zm
, lnre.fzm
and
lnre.gigp
. These descriptions are intended for
interested end users, but are not required for standard applications
of the models.
The estimate.model
manpage explains details of the
parameter estimation procedure (intended for developers).
See lnre.goodness.of.fit
for a description of the
goodness-of-fit test performed after parameter estimation of an LNRE
model. This function can also be used to evaluate the predictions of
the model on a different data set.
The finite Zipf-Mandelbrot (fZM) LNRE model of Evert (2004).
The constructor function lnre.fzm
is not user-visible. It is
invoked implicitly when lnre
is called with LNRE model type
"fzm"
.
lnre.fzm(alpha=.8, A=1e-9, B=.01, param=list()) ## user call: lnre("fzm", spc=spc) or lnre("fzm", alpha=.8, A=1e-9, B=.01)
lnre.fzm(alpha=.8, A=1e-9, B=.01, param=list()) ## user call: lnre("fzm", spc=spc) or lnre("fzm", alpha=.8, A=1e-9, B=.01)
alpha |
the shape parameter |
A |
the lower cutoff parameter |
B |
the upper cutoff parameter |
param |
a list of parameters given as name-value pairs (alternative method of parameter specification) |
The parameters of the fZM model can either be specified as immediate arguments:
lnre.fzm(alpha=.5, A=5e-12, B=.1)
or as a list of name-value pairs:
lnre.fzm(param=list(alpha=.5, A=5e-12, B=.1))
which is usually more convenient when the constructor is invoked by
another function (such as lnre
). If both immediate arguments
and the param
list are given, the immediate arguments override
conflicting values in param
. For any parameters that are
neither specified as immediate arguments nor listed in param
,
the defaults from the function prototype are inserted.
The lnre.fzm
constructor also checks the types and ranges of
parameter values and aborts with an error message if an invalid
parameter is detected.
NB: parameter estimation is faster and more robust for the
inexact fZM model, so you might consider passing the
exact=FALSE
option to lnre
unless you intend to make
predictions for small sample sizes and/or high spectrum elements
(
) with the model.
A partially initialized object of class lnre.fzm
, which is
completed and passed back to the user by the lnre function.
See lnre
for a detailed description of lnre.fzm
objects (as a subclass of lnre
).
Similar to ZM, the fZM model is a LNRE re-formulation of the
Zipf-Mandelbrot law for a population with a finite vocabulary
size , i.e.
for . The parameters of the Zipf-Mandelbrot law
are
,
and
(see also Baayen 2001,
101ff). The fZM model is given by the type density function
for (and
otherwise), and has three
parameters
and
. The
normalizing constant is
and the population vocabulary size is
See Evert (2004) and the lnre.zm
manpage for further
details.
Baayen, R. Harald (2001). Word Frequency Distributions. Kluwer, Dordrecht.
Evert, Stefan (2004). A simple LNRE model for random character sequences. Proceedings of JADT 2004, 411-422.
lnre
for pointers to relevant methods and functions for
objects of class lnre
, as well as a complete listing of LNRE
models implemented in the zipfR
library.
The Generalized Inverse Gauss-Poisson (GIGP) LNRE model of Sichel (1971).
The constructor function lnre.gigp
is not user-visible. It is
invoked implicitly when lnre
is called with LNRE model type
"gigp"
.
lnre.gigp(gamma=-.5, B=.01, C=.01, param=list()) ## user call: lnre("gigp", spc=spc) or lnre("gigp", gamma=-.5, B=.01, C=.01)
lnre.gigp(gamma=-.5, B=.01, C=.01, param=list()) ## user call: lnre("gigp", spc=spc) or lnre("gigp", gamma=-.5, B=.01, C=.01)
gamma |
the shape parameter |
B |
the low-frequency decay parameter |
C |
the high-frequency decay parameter |
param |
a list of parameters given as name-value pairs (alternative method of parameter specification) |
The parameters of the GIGP model can either be specified as immediate arguments:
lnre.gigp(gamma=-.47, B=.001, C=.001)
or as a list of name-value pairs:
lnre.gigp(param=list(gamma=-.47, B=.001, C=.001))
which is usually more convenient when the constructor is invoked by
another function (such as lnre
). If both immediate arguments
and the param
list are given, the immediate arguments override
conflicting values in param
. For any parameters that are
neither specified as immediate arguments nor listed in param
,
the defaults from the function prototype are inserted.
The lnre.gigp
constructor also checks the types and ranges of
parameter values and aborts with an error message if an invalid
parameter is detected.
Notice that the implementation of GIGP leads to numerical problems
when estimating the expected frequency of high spectrum elements
(you might start worrying if you need to go above ).
Note that the parameters and
are normally written in
lowercase (e.g. Baayen 2001). For the technical reasons, it was
necessary to use uppercase letters
B
and C
in this
implementation.
A partially initialized object of class lnre.gigp
, which is
completed and passed back to the user by the lnre function.
See lnre
for a detailed description of lnre.gigp
objects (as a subclass of lnre
).
Despite its fance name, the Generalized Inverse Gauss-Poisson
or GIGP model belongs to the same class of LNRE models as ZM
and fZM. This class of models is characterized by a power-law in the
type density function and derives from the Zipf-Mandelbrot law
(see lnre.zm
for details on the relationship between
power-law LNRE models and the Zipf-Mandelbrot law).
The GIGP model is given by the type density function
with parameters and
. The
normalizing constant is
and the population vocabulary size is
Note that the "shape" parameter corresponds to
in the ZM and fZM models. The GIGP model was introduced
by Sichel (1971). See Baayen (2001, 89-93) for further details.
Baayen, R. Harald (2001). Word Frequency Distributions. Kluwer, Dordrecht.
Sichel, H. S. (1971). On a family of discrete distributions particularly suited to represent long-tailed frequency data. Proceedings of the Third Symposium on Mathematical Statistics, 51-97.
lnre
for pointers to relevant methods and functions for
objects of class lnre
, as well as a complete listing of LNRE
models implemented in the zipfR
library.
This function measures the goodness-of-fit of a LNRE model compared to an observed frequency spectrum, using a multivariate chi-squared test (Baayen 2001, p. 119ff).
lnre.goodness.of.fit(model, spc, n.estimated=0, m.max=15)
lnre.goodness.of.fit(model, spc, n.estimated=0, m.max=15)
model |
an LNRE model object, belonging to a suitable subclass of
|
spc |
an observed frequency spectrum, i.e. an object of class
|
n.estimated |
number of parameters of the LNRE model that have
been estimated on |
m.max |
number of spectrum elements that will be used to compute
the chi-squared statistic. The default value of 15 is also used by
Baayen (2001). For small samples, it may be sensible to
use fewer spectrum elements, e.g. by setting |
By default, the number of spectrum elements included in the
calculation of the chi-squared statistic may be reduced automatically
in order to ensure that it is not dominated by the sampling error of
spectrum elements with very small expected frequencies (which are
scaled up due to the small variance of these random variables). As an
ad-hoc rule of thumb, spectrum elements with variance less
than 5 are excluded, since the normal approximation to their discrete
distribution is likely to be inaccurate in this case.
Automatic reduction is disabled when the parameter m.max
is
specified explicitly (use m.max=15
to disable automatic
reduction without changing the default value).
A data frame with one row and the following variables:
X2 |
value of the multivariate chi-squared statistic |
df |
number of degrees of freedom of |
p |
p-value corresponding to |
Baayen, R. Harald (2001). Word Frequency Distributions. Kluwer, Dordrecht.
lnre
for more information about LNRE models
## load spectrum of first 100k Brown tokens data(Brown100k.spc) ## use this spectrum to compute zm and gigp ## models zm <- lnre("zm",Brown100k.spc) gigp <- lnre("gigp",Brown100k.spc) ## lnre.goodness.of.fit with appropriate ## n.estimated value produces the same multivariate ## chi-squared test that is reported in a model ## summary ## compare: zm lnre.goodness.of.fit(zm,Brown100k.spc,n.estimated=2) gigp lnre.goodness.of.fit(gigp,Brown100k.spc,n.estimated=3) ## goodness of fit of the 100k models calculated on the ## whole Brown spectrum (although this is superset of ## 100k spectrum, let's pretend it is an independent ## spectrum, and set n.estimated to 0) data(Brown.spc) lnre.goodness.of.fit(zm,Brown.spc,n.estimated=0) lnre.goodness.of.fit(gigp,Brown.spc,n.estimated=0)
## load spectrum of first 100k Brown tokens data(Brown100k.spc) ## use this spectrum to compute zm and gigp ## models zm <- lnre("zm",Brown100k.spc) gigp <- lnre("gigp",Brown100k.spc) ## lnre.goodness.of.fit with appropriate ## n.estimated value produces the same multivariate ## chi-squared test that is reported in a model ## summary ## compare: zm lnre.goodness.of.fit(zm,Brown100k.spc,n.estimated=2) gigp lnre.goodness.of.fit(gigp,Brown100k.spc,n.estimated=3) ## goodness of fit of the 100k models calculated on the ## whole Brown spectrum (although this is superset of ## 100k spectrum, let's pretend it is an independent ## spectrum, and set n.estimated to 0) data(Brown.spc) lnre.goodness.of.fit(zm,Brown.spc,n.estimated=0) lnre.goodness.of.fit(gigp,Brown.spc,n.estimated=0)
Compute expectations of various measures of productivity and lexical richness for a LNRE population.
lnre.productivity.measures(model, N=NULL, measures, data.frame=TRUE, bootstrap=FALSE, method="normal", conf.level=.95, sample=NULL, replicates=1000, parallel=1L, verbose=TRUE, seed=NULL)
lnre.productivity.measures(model, N=NULL, measures, data.frame=TRUE, bootstrap=FALSE, method="normal", conf.level=.95, sample=NULL, replicates=1000, parallel=1L, verbose=TRUE, seed=NULL)
model |
an object belonging to a subclass of |
measures |
character vector naming the productivity measures to
be computed (see |
N |
an integer vector, specifying the sample size(s) |
data.frame |
if |
bootstrap |
if |
method , conf.level
|
type of confidence interval to be estimated by parametric
bootstrapping and the requested confidence level;
see |
sample |
optional callback function to generate bootstrapping samples;
see |
replicates , parallel , seed , verbose
|
if |
If bootstrap=FALSE
, expected values of the productivity measures are computed based on the following approximations:
V
, TTR
, R
and P
are linear transformations of or
, so expectations can be obtained directly from the
EV
and EVm
methods.
C
, k
, U
and W
are nonlinear transformations of . In this case, the transformation function is approximated by a linear function around
, which is reasonable under typical circumstances.
Hapax
, S
, alpha2
and H
are based on ratios of two spectrum elements, in some cases with an additional nonlinear transformation. Expectations are based on normal approximations for and
together with a generalisation of Díaz-Francés and Rubio's (2013: 313) result on the ratio of two independent normal distributions; for a nonlinear transformation the same linear approximation is made as above.
K
and D
are (nearly) unbiased estimators of the population coefficient (Simpson 1949: 688).
Approximations used for expected values are explained in detail in Sec. 2.2 of the technical report Inside zipfR.
If bootstrap=FALSE
, a numeric matrix or data frame listing approximate expectations of the selected productivity measures,
with one row for each sample size N
and one column for each measure
. Rows and columns are labelled.
If bootstrap=TRUE
, a numeric matrix or data frame with one column for each productivity measure
and four rows
giving the lower and upper bound of the confidence interval, an estimate of central tendency, and an estimate of spread.
See bootstrap.confint
for details.
See productivity.measures
for a list of supported measures with equations and references.
The measures Entropy
and eta
are only supported for bootstrap=TRUE
.
Díaz-Francés, Eloísa and Rubio, Francisco J. (2013). On the existence of a normal approximation to the distribution of the ratio of two independent normal random variables. Statistical Papers, 54(2), 309–323.
Simpson, E. H. (1949). Measurement of diversity. Nature, 163, 688.
productivity.measures
computes productivity measures from observed data sets.
See lnre
for further information on LNRE models, and
lnre.bootstrap
and bootstrap.confint
for details on the bootstrapping procedure.
## plausible model for an author's vocabulary model <- lnre("fzm", alpha=0.4, B=0.06, A=1e-12) ## approximate expectation for different sample sizes lnre.productivity.measures(model, N=c(1000, 10000, 50000)) ## estimate sampling distribution: 95% interval, mean, s.d. ## (using parametric bootstrapping, only one sample size at a time) lnre.productivity.measures(model, N=1000, bootstrap=TRUE)
## plausible model for an author's vocabulary model <- lnre("fzm", alpha=0.4, B=0.06, A=1e-12) ## approximate expectation for different sample sizes lnre.productivity.measures(model, N=c(1000, 10000, 50000)) ## estimate sampling distribution: 95% interval, mean, s.d. ## (using parametric bootstrapping, only one sample size at a time) lnre.productivity.measures(model, N=1000, bootstrap=TRUE)
lnre.spc
computes the expected frequency spectrum of a LNRE
model at specified sample size N
, returning an object of class
spc
. Since almost all expected spectrum elements are non-zero,
only an incomplete spectrum can be generated.
lnre.spc(model, N=NULL, variances=FALSE, m.max=100)
lnre.spc(model, N=NULL, variances=FALSE, m.max=100)
model |
an object belonging to a subclass of |
N |
a single positive integer, specifying the sample size |
variances |
if |
m.max |
number of spectrum elements listed in the frequency
spectrum. The default of 100 is chosen to avoid numerical
problems that certain LNRE models (in particular, GIGP) have for
higher |
~~ TODO, if any ~~
An object of class spc
, representing the incomplete expected
frequency spectrum of the LNRE model lnre
at sample size
N
. If variances=TRUE
, the spectrum also includes
variance data.
spc
for more information about frequency spectra and
links to relevant functions; lnre
for more information
about LNRE models and how to initialize them
## load Dickens dataset and compute lnre models data(Dickens.spc) zm <- lnre("zm",Dickens.spc) fzm <- lnre("fzm",Dickens.spc, exact=FALSE) gigp <- lnre("gigp",Dickens.spc) ## calculate the corresponding expected ## frequency spectra at the Dickens size zm.spc <- lnre.spc(zm,N(Dickens.spc)) fzm.spc <- lnre.spc(fzm,N(Dickens.spc)) gigp.spc <- lnre.spc(gigp,N(Dickens.spc)) ## comparative plot plot(Dickens.spc,zm.spc,fzm.spc,gigp.spc,m.max=10) ## expected spectra at N=100e+8 ## and comparative plot zm.spc <- lnre.spc(zm,1e+8) fzm.spc <- lnre.spc(fzm,1e+8) gigp.spc <- lnre.spc(gigp,1e+8) plot(zm.spc,fzm.spc,gigp.spc,m.max=10) ## with variances zm.spc <- lnre.spc(zm,1e+8,variances=TRUE) head(zm.spc) ## asking for more than 50 spectrum elements ## (increasing m.max will eventually lead ## to error, at different threshold for ## the different models) zm.spc <- lnre.spc(zm,1e+8,m.max=1000) fzm.spc <- lnre.spc(fzm,1e+8,m.max=1000) gigp.spc <- lnre.spc(gigp,1e+8,m.max=100) ## gigp breaks first!
## load Dickens dataset and compute lnre models data(Dickens.spc) zm <- lnre("zm",Dickens.spc) fzm <- lnre("fzm",Dickens.spc, exact=FALSE) gigp <- lnre("gigp",Dickens.spc) ## calculate the corresponding expected ## frequency spectra at the Dickens size zm.spc <- lnre.spc(zm,N(Dickens.spc)) fzm.spc <- lnre.spc(fzm,N(Dickens.spc)) gigp.spc <- lnre.spc(gigp,N(Dickens.spc)) ## comparative plot plot(Dickens.spc,zm.spc,fzm.spc,gigp.spc,m.max=10) ## expected spectra at N=100e+8 ## and comparative plot zm.spc <- lnre.spc(zm,1e+8) fzm.spc <- lnre.spc(fzm,1e+8) gigp.spc <- lnre.spc(gigp,1e+8) plot(zm.spc,fzm.spc,gigp.spc,m.max=10) ## with variances zm.spc <- lnre.spc(zm,1e+8,variances=TRUE) head(zm.spc) ## asking for more than 50 spectrum elements ## (increasing m.max will eventually lead ## to error, at different threshold for ## the different models) zm.spc <- lnre.spc(zm,1e+8,m.max=1000) fzm.spc <- lnre.spc(fzm,1e+8,m.max=1000) gigp.spc <- lnre.spc(gigp,1e+8,m.max=100) ## gigp breaks first!
lnre.vgc
computes expected vocabulary growth curves
according to a LNRE model, returning an object of class
vgc
. Data points are returned for the specified values of
, optionally including estimated variances and/or growth curves
for the spectrum elements
.
lnre.vgc(model, N, m.max=0, variances=FALSE)
lnre.vgc(model, N, m.max=0, variances=FALSE)
model |
an object belonging to a subclass of |
N |
an increasing sequence of non-negative integers, specifying
the sample sizes |
m.max |
if specified, include vocabulary growth curves
|
variances |
if |
~~ TODO, if any ~~
An object of class vgc
, representing the expected vocabulary
growth curve of the LNRE model
lnre
, with data
points at the sample sizes N
.
If m.max
is specified, expected growth curves
for spectrum elements (hapax legomena, dis legomena,
etc.) up to
m.max
are also computed.
If variances=TRUE
, the vgc
object includes variance data
for all growth curves.
vgc
for more information about vocabulary growth curves
and links to relevant functions; lnre
for more
information about LNRE models and how to initialize them
## load Dickens dataset and estimate lnre models data(Dickens.spc) zm <- lnre("zm",Dickens.spc) fzm <- lnre("fzm",Dickens.spc,exact=FALSE) gigp <- lnre("gigp",Dickens.spc) ## compute expected V and V_1 growth up to 100 million tokens ## in 100 steps of 1 million tokens zm.vgc <- lnre.vgc(zm,(1:100)*1e6, m.max=1) fzm.vgc <- lnre.vgc(fzm,(1:100)*1e6, m.max=1) gigp.vgc <- lnre.vgc(gigp,(1:100)*1e6, m.max=1) ## compare plot(zm.vgc,fzm.vgc,gigp.vgc,add.m=1,legend=c("ZM","fZM","GIGP")) ## load Italian ultra- prefix data data(ItaUltra.spc) ## compute zm model zm <- lnre("zm",ItaUltra.spc) ## compute vgc up to about twice the sample size ## with variance of V zm.vgc <- lnre.vgc(zm,(1:100)*70, variances=TRUE) ## plot with confidence intervals derived from variance in ## vgc (with larger datasets, ci will typically be almost ## invisible) plot(zm.vgc)
## load Dickens dataset and estimate lnre models data(Dickens.spc) zm <- lnre("zm",Dickens.spc) fzm <- lnre("fzm",Dickens.spc,exact=FALSE) gigp <- lnre("gigp",Dickens.spc) ## compute expected V and V_1 growth up to 100 million tokens ## in 100 steps of 1 million tokens zm.vgc <- lnre.vgc(zm,(1:100)*1e6, m.max=1) fzm.vgc <- lnre.vgc(fzm,(1:100)*1e6, m.max=1) gigp.vgc <- lnre.vgc(gigp,(1:100)*1e6, m.max=1) ## compare plot(zm.vgc,fzm.vgc,gigp.vgc,add.m=1,legend=c("ZM","fZM","GIGP")) ## load Italian ultra- prefix data data(ItaUltra.spc) ## compute zm model zm <- lnre("zm",ItaUltra.spc) ## compute vgc up to about twice the sample size ## with variance of V zm.vgc <- lnre.vgc(zm,(1:100)*70, variances=TRUE) ## plot with confidence intervals derived from variance in ## vgc (with larger datasets, ci will typically be almost ## invisible) plot(zm.vgc)
The Zipf-Mandelbrot (ZM) LNRE model of Evert (2004).
The constructor function lnre.zm
is not user-visible. It is
invoked implicitly when lnre
is called with LNRE model type
"zm"
.
lnre.zm(alpha=.8, B=.01, param=list()) ## user call: lnre("zm", spc=spc) or lnre("zm", alpha=.8, B=.1)
lnre.zm(alpha=.8, B=.01, param=list()) ## user call: lnre("zm", spc=spc) or lnre("zm", alpha=.8, B=.1)
alpha |
the shape parameter |
B |
the upper cutoff parameter |
param |
a list of parameters given as name-value pairs (alternative method of parameter specification) |
The parameters of the ZM model can either be specified as immediate arguments:
lnre.zm(alpha=.5, B=.1)
or as a list of name-value pairs:
lnre.zm(param=list(alpha=.5, B=.1))
which is usually more convenient when the constructor is invoked by
another function (such as lnre
). If both immediate arguments
and the param
list are given, the immediate arguments override
conflicting values in param
. For any parameters that are
neither specified as immediate arguments nor listed in param
,
the defaults from the function prototype are inserted.
The lnre.zm
constructor also checks the types and ranges of
parameter values and aborts with an error message if an invalid
parameter is detected.
A partially initialized object of class lnre.zm
, which is
completed and passed back to the user by the lnre function.
See lnre
for a detailed description of lnre.zm
objects (as a subclass of lnre
).
The ZM model is a re-formulation of the Zipf-Mandelbrot law
with parameters and
(see also Baayen 2001,
101ff) as a LNRE model. It is given by the type density
function
for (and
otherwise), with the
parameters
and
. The
normalizing constant is
and the population vocabulary size is . The
parameters of the ZM model are related to those of the original
Zipf-Mandelbrot law by
and
. See Evert
(2004) for further details.
Baayen, R. Harald (2001). Word Frequency Distributions. Kluwer, Dordrecht.
Evert, Stefan (2004). A simple LNRE model for random character sequences. Proceedings of JADT 2004, 411-422.
lnre
for pointers to relevant methods and functions for
objects of class lnre
, as well as a complete listing of LNRE
models implemented in the zipfR
library.
Merge two or more type frequency lists. Types from the individual lists are pooled and frequencies of types occurring in multiple lists are aggregated.
## S3 method for class 'tfl' merge(x, y, ...)
## S3 method for class 'tfl' merge(x, y, ...)
x , y
|
type frequency lists (i.e. objects of class |
... |
optional further type frequency lists to be merged |
All type frequency lists to be merged must contain type labels, and none of them may be incomplete.
tfl
for more information about type frequency lists.
N
, V
and Vm
are generic methods that can (and
should) be used to access observed frequency data for objects of class
tfl
, spc
, vgc
and lnre
. The precise
behaviour of the functions depends on the class of the object, but in
general N
returns the sample size, V
the vocabulary
size, and Vm
one or more selected elements of the frequency
spectrum.
N(obj, ...) V(obj, ...) Vm(obj, m, ...)
N(obj, ...) V(obj, ...) Vm(obj, m, ...)
obj |
an object of class |
m |
positive integer value determining the frequency class
|
... |
additional arguments passed on to the method implementation (see respective manpages for details) |
For tfl
and vgc
objects, the Vm
method allows
only a single value m
to be specified.
For a frequency spectrum (class spc
), N
returns the
sample size, V
returns the vocabulary size, and Vm
returns individual spectrum elements.
For a type frequency list (class tfl
), N
returns the
sample size and V
returns the vocabulary size corresponding to
the list. Vm
returns a single spectrum element from the
corresponding frequency spectrum, and may only be called with a single
value m
.
For a vocabulary growth curve (class vgc
), N
returns the
vector of sample sizes and V
the vector of vocabulary sizes.
Vm
may only be called with a single value m
and returns
the corresponding vector from the vgc
object (if present).
For a LNRE model (class lnre
) estimated from an observed
frequency spectrum, the methods N
, V
and Vm
return information about this frequency spectrum.
For details on the implementations of these methods, see
N.tfl
, N.spc
, N.vgc
, etc.
When applied to an LNRE model, the methods return information about
the observed frequency spectrum from which the model was estimated, so
the manpages for N.spc
are relevant in this case.
Expected vocabulary size and frequency spectrum for a sample of size
according to a LNRE model can be computed with the analogous
methods
EV
and EVm
. The corresponding
variances are obtained with the VV
and VVm
methods, which can also be applied to expected or interpolated
frequency spectra and vocabulary growth curves.
## load Brown spc and tfl data(Brown.spc) data(Brown.tfl) ## you can extract N, V and Vm (for a specific m) ## from either structure N(Brown.spc) N(Brown.tfl) V(Brown.spc) V(Brown.tfl) Vm(Brown.spc,1) Vm(Brown.tfl,1) ## you can extract the same info also from a lnre model estimated ## from these data (NB: these are the observed quantities; for the ## expected values predicted by the model use EV and EVm instead!) model <- lnre("gigp",Brown.spc) N(model) V(model) Vm(model,1) ## Baayen's P: Vm(Brown.spc,1)/N(Brown.spc) ## when input is a spectrum (and only then) you can specify a vector ## of m's; e.g., to obtain class sizes of first 5 spectrum elements ## you can write: Vm(Brown.spc,1:5) ## the Brown vgc data(Brown.emp.vgc) ## with a vgc as input, N, V and Vm return vectors of the respective ## values for each sample size listed in the vgc Ns <- N(Brown.emp.vgc) Vs <- V(Brown.emp.vgc) V1s <- Vm(Brown.emp.vgc,1) head(Ns) head(Vs) head(V1s) ## since the last sample size in Brown.emp.vgc ## corresponds to the full Brown, the last elements ## of the Ns, Vs and V1s vectors are the same as ## the quantities extracted from the spectrum and ## tfl: Ns[length(Ns)] Vs[length(Vs)] V1s[length(V1s)]
## load Brown spc and tfl data(Brown.spc) data(Brown.tfl) ## you can extract N, V and Vm (for a specific m) ## from either structure N(Brown.spc) N(Brown.tfl) V(Brown.spc) V(Brown.tfl) Vm(Brown.spc,1) Vm(Brown.tfl,1) ## you can extract the same info also from a lnre model estimated ## from these data (NB: these are the observed quantities; for the ## expected values predicted by the model use EV and EVm instead!) model <- lnre("gigp",Brown.spc) N(model) V(model) Vm(model,1) ## Baayen's P: Vm(Brown.spc,1)/N(Brown.spc) ## when input is a spectrum (and only then) you can specify a vector ## of m's; e.g., to obtain class sizes of first 5 spectrum elements ## you can write: Vm(Brown.spc,1:5) ## the Brown vgc data(Brown.emp.vgc) ## with a vgc as input, N, V and Vm return vectors of the respective ## values for each sample size listed in the vgc Ns <- N(Brown.emp.vgc) Vs <- V(Brown.emp.vgc) V1s <- Vm(Brown.emp.vgc,1) head(Ns) head(Vs) head(V1s) ## since the last sample size in Brown.emp.vgc ## corresponds to the full Brown, the last elements ## of the Ns, Vs and V1s vectors are the same as ## the quantities extracted from the spectrum and ## tfl: Ns[length(Ns)] Vs[length(Vs)] V1s[length(V1s)]
Return the sample size (N.spc
), vocabulary size (V.spc
)
and class sizes (Vm.spc
) of the frequency spectrum represented
by a spc
object. For an expected spectrum with variance
information, VV.spc
returns the variance of the expected
spectrum size and VVm.spc
the variances of individual spectrum
elements.
Note that these functions are not user-visible. They can be called
implicitly through the generic methods N
, V
, Vm
,
VV
and VVm
, applied to an object of type spc
.
## S3 method for class 'spc' N(obj, ...) ## S3 method for class 'spc' V(obj, ...) ## S3 method for class 'spc' Vm(obj, m, ...) ## S3 method for class 'spc' VV(obj, N=NA, ...) ## S3 method for class 'spc' VVm(obj, m, N=NA, ...)
## S3 method for class 'spc' N(obj, ...) ## S3 method for class 'spc' V(obj, ...) ## S3 method for class 'spc' Vm(obj, m, ...) ## S3 method for class 'spc' VV(obj, N=NA, ...) ## S3 method for class 'spc' VVm(obj, m, N=NA, ...)
obj |
an object of class |
m |
positive integer value determining the frequency class
|
N |
not applicable (this argument of the generic method is not
used by the implementation for |
... |
additional arguments passed on from generic method will be ignored |
VV.spc
a VVm.spc
will fail if the object obj
is
not an expected frequency spectrum with variance data.
For an incomplete frequency spectrum, Vm.spc
(and
VVm.spc
) will return NA
for all spectrum elements that
are not listed in the object (i.e. for m > m.max
).
N.spc
returns the sample size ,
V.spc
returns the
vocabulary size (or expected vocabulary size
), and
Vm.spc
returns a vector of class sizes (ot the
expected spectrum elements
).
For an expected spectrum with variances, VV.spc
returns the
variance of the expected vocabulary
size, and
VVm.spc
returns variances
of the spectrum elements.
N
, V
, Vm
, VV
,
VVm
for the generic methods and links to other
implementations
spc
for details on frequency spectrum objects and links
to other relevant functions
Return the sample size (N.tfl
) and vocabulary size
(V.tfl
) of the type frequency list represented by a tfl
object, as well as class sizes (Vm.tfl
) of the corresponding
frequency spectrum.
Note that these functions are not user-visible. They can be called
implicitly through the generic methods N
, V
and
Vm
, applied to an object of type tfl
.
## S3 method for class 'tfl' N(obj, ...) ## S3 method for class 'tfl' V(obj, ...) ## S3 method for class 'tfl' Vm(obj, m, ...)
## S3 method for class 'tfl' N(obj, ...) ## S3 method for class 'tfl' V(obj, ...) ## S3 method for class 'tfl' Vm(obj, m, ...)
obj |
an object of class |
m |
non-negative integer value determining the frequency class
|
... |
additional arguments passed on from generic method will be ignored |
Only a single value is allowed for , which may also be 0.
In order to obtain multiple class sizes
, convert the type
frequency list to a frequency spectrum with
tfl2spc
first.
For an incomplete type frequency list, Vm.tfl
will return
NA
if m
is outside the range of listed frequencies
(i.e. for m < f.min
or m > f.max
).
N.tfl
returns the sample size ,
V.tfl
returns the
vocabulary size (or expected vocabulary size
), and
Vm.tfl
returns the number of types that occur exactly
times in the sample, i.e. the class size
.
N
, V
, Vm
for the generic
methods and links to other implementations
tfl
for details on type frequency list objects and links
to other relevant functions
Return the vector of sample sizes (N.vgc
), vocabulary sizes
(V.vgc
) or class sizes (Vm.vgc
) from the vocabulary
growth curve (VGC) represented by a vgc
object. For an
expected or interpolated VGC with variance information, VV.vgc
returns the vector of variances of the vocabulary size and
VVm.vgc
the variance vectors for individual spectrum elements.
Note that these functions are not user-visible. They can be called
implicitly through the generic methods N
, V
, Vm
,
VV
and VVm
, applied to an object of type vgc
.
## S3 method for class 'vgc' N(obj, ...) ## S3 method for class 'vgc' V(obj, ...) ## S3 method for class 'vgc' Vm(obj, m, ...) ## S3 method for class 'vgc' VV(obj, N=NA, ...) ## S3 method for class 'vgc' VVm(obj, m, N=NA, ...)
## S3 method for class 'vgc' N(obj, ...) ## S3 method for class 'vgc' V(obj, ...) ## S3 method for class 'vgc' Vm(obj, m, ...) ## S3 method for class 'vgc' VV(obj, N=NA, ...) ## S3 method for class 'vgc' VVm(obj, m, N=NA, ...)
obj |
an object of class |
m |
positive integer value determining the frequency class
|
N |
not applicable (this argument of the generic method is not
used by the implementation for |
... |
additional arguments passed on from generic method will be ignored |
VV.vgc
a VVm.vgc
will fail if the object obj
does
not include variance data. Vm.vgc
and VVm.vgc
will fail
if the selected frequency class is not included in the VGC data.
N.vgc
returns the vector of sample sizes ,
V.vgc
returns the corresponding vocabulary sizes (or expected
vocabulary sizes
), and
Vm.vgc
returns the vector of
class sizes (or the expected spectrum elements
) for the selected frequency class
.
For an expected or interpolated VGC with variance information,
VV.vgc
returns the vector of variances
of the expected vocabulary size,
and
VVm.vgc
returns vector of variances
for the selected frequency
class
.
Except for N.vgc
, the vector returned will be labelled with
corresponding sample sizes.
N
, V
, Vm
, VV
,
VVm
for the generic methods and links to other
implementations
vgc
for details on vocabulary growth curve objects and
links to other relevant functions
Visualisation of LNRE population distribution, showing either the (log-transformed) type or probability density function or the cumulative probability distribution function.
## S3 method for class 'lnre' plot(x, y, ..., type=c("types", "probability", "cumulative"), xlim=c(1e-9, 1), ylim=NULL, steps=200, xlab=NULL, ylab=NULL, legend=NULL, grid=FALSE, main="LNRE Population Distribution", lty=NULL, lwd=NULL, col=NULL, bw=zipfR.par("bw"))
## S3 method for class 'lnre' plot(x, y, ..., type=c("types", "probability", "cumulative"), xlim=c(1e-9, 1), ylim=NULL, steps=200, xlab=NULL, ylab=NULL, legend=NULL, grid=FALSE, main="LNRE Population Distribution", lty=NULL, lwd=NULL, col=NULL, bw=zipfR.par("bw"))
x , y , ...
|
one or more objects of class |
type |
what type of plot should be drawn, |
xlim , ylim
|
visible range on x- and y-axis. The default |
steps |
number of steps for drawing curves (increase for extra smoothness) |
xlab , ylab
|
labels for the x-axis and y-axis (with suitable defaults depending on |
legend |
optional vector of character strings or expressions
specifying labels for a legend box, which will be drawn in the upper
right-hand or left-hand corner of the screen. If |
grid |
whether to display a suitable grid in the background of the plot |
main |
a character string or expression specifying a main title for the plot |
lty , lwd , col
|
style vectors that can be used to
override the global styles defined by |
bw |
if |
There are two useful ways of visualising a LNRE population distribution, selected with the
type
argument:
types
A plot of the type density function over the type probability
on a log-transformed scale (so that the number of types corresponds to an integral over
, see
ltdlnre
).
The log transformation is essential so that the density function
remains in a reasonable range; a logarithmic y-axis would be very counter-intuitive.
Note that density values correspond to the number of types per order of magnitude
on the x-axis.
probability
A plot of the probability density function over the type probability
on a log-transformed scale (so that probability mass corresponds to an integral over
, see
ldlnre
).
Note that density values correspond to the total probability mass of types across one
order of magnitude on the x-axis.
cumulative
A plot of the cumulative probability distribution, i.e. the distribution function
showing the total probability mass of types with
type probability
. The x-axis shows
on a logarithmic scale
(but is labelled more intuitively with
by default). No special transformations
are required because
.
Line styles are defined globally through zipfR.par
,
but can be overridden with the optional parameters
lty
, lwd
and col
. In most cases, it is more advisable to
change the global settings temporarily for a sequence of plots, though.
The bw
parameter is used to switch between B/W and colour
modes. It can also be set globally with zipfR.par
.
Other standard graphics parameters (such as cex
or mar
) cannot
be passed to the plot function an need to be set up with par
in advance.
lnre
, ltdlnre
, plnre
zipfR.par
, zipfR.plotutils
plot.tfl
offers a different visualisation of the LNRE population distribution,
in the form of a Zipf-Mandelbrot law rather than type density.
## visualise three LNRE models trained on same data m1 <- lnre("zm", Dickens.spc) m2 <- lnre("fzm", Dickens.spc) m3 <- lnre("gigp", Dickens.spc) plot(m1, m2, m3, type="types", xlim=c(1e-8, 1e-2), ylim=c(0, 7.5e4), legend=TRUE) plot(m1, m2, m3, type="probability", xlim=c(1e-8, 1e-2), grid=TRUE, legend=TRUE) ## cumulative probability distribution is not available for GIGP plot(m1, m2, type="cumulative", grid=TRUE, xlim=c(1e-8, 1e-2), legend=c("ZM", "fZM")) ## first argument can also be a list of models with explicit call models <- lapply(seq(.1, .9, .2), function (x) lnre("zm", alpha=x, B=.1)) plot.lnre(models, type="cum", grid=TRUE, legend=TRUE) plot.lnre(models, type="prob", grid=TRUE, legend=TRUE)
## visualise three LNRE models trained on same data m1 <- lnre("zm", Dickens.spc) m2 <- lnre("fzm", Dickens.spc) m3 <- lnre("gigp", Dickens.spc) plot(m1, m2, m3, type="types", xlim=c(1e-8, 1e-2), ylim=c(0, 7.5e4), legend=TRUE) plot(m1, m2, m3, type="probability", xlim=c(1e-8, 1e-2), grid=TRUE, legend=TRUE) ## cumulative probability distribution is not available for GIGP plot(m1, m2, type="cumulative", grid=TRUE, xlim=c(1e-8, 1e-2), legend=c("ZM", "fZM")) ## first argument can also be a list of models with explicit call models <- lapply(seq(.1, .9, .2), function (x) lnre("zm", alpha=x, B=.1)) plot.lnre(models, type="cum", grid=TRUE, legend=TRUE) plot.lnre(models, type="prob", grid=TRUE, legend=TRUE)
Plot a word frequency spectrum, or a comparison of several word frequency spectra, either as a side-by-side barplot or as points and lines on various logarithmic scales.
## S3 method for class 'spc' plot(x, y, ..., m.max=if (log=="") 15 else 50, log="", conf.level=.95, bw=zipfR.par("bw"), points=TRUE, xlim=NULL, ylim=NULL, xlab="m", ylab="V_m", legend=NULL, main="Frequency Spectrum", barcol=NULL, pch=NULL, lty=NULL, lwd=NULL, col=NULL)
## S3 method for class 'spc' plot(x, y, ..., m.max=if (log=="") 15 else 50, log="", conf.level=.95, bw=zipfR.par("bw"), points=TRUE, xlim=NULL, ylim=NULL, xlab="m", ylab="V_m", legend=NULL, main="Frequency Spectrum", barcol=NULL, pch=NULL, lty=NULL, lwd=NULL, col=NULL)
x , y , ...
|
one or more objects of class |
m.max |
number of frequency classes that will be shown in plot. The default is 15 on linear scale and 50 when using any type of logarithmic scale. |
log |
a character string specifying the axis or axes for which
logarithmic scale is to be used ( |
conf.level |
confidence level for confidence intervals in
logarithmic plots (see "Details" below). The default value of
|
bw |
if |
points |
if |
xlim , ylim
|
visible range on x- and y-axis. The default values are automatically determined to fit the selected data in the plot. |
xlab , ylab
|
labels for the x-axis and y-axis. The default values nicely typeset mathematical expressions. The y-axis label also distinguishes between observed and expected frequency spectra. |
main |
a character string or expression specifying a main title for the plot |
legend |
optional vector of character strings or expressions,
specifying labels for a legend box, which will be drawn in the upper
right-hand corner of the screen. If |
barcol , pch , lty , lwd , col
|
style vectors that can be used to
override the global styles defined by |
By default, the frequency spectrum or spectra are represented as a
barplot, with both axes using linear scale. If the log
parameter is given, the spectra are shown either as lines in different
styles (points=FALSE
) or as overplotted points and lines
(point=TRUE
). The value of log
specifies which axes
should use logarithmic scale (specify log=""
for a
points-and-lines plot on linear scale).
In y-logarithmic plots, frequency classes with are drawn
outside the plot region (below the bottom margin) rather than skipped.
In all logarithmic plots, confidence intervals are indicated for
expected frequency spectra with variance data (by vertical lines with
T-shaped hooks at both ends). The size of the confidence intervals is
controlled by the conf.level
parameter (default: 95%). Set
conf.level=NA
in order to suppress the confidence interval
indicators.
Line and point styles, as well as bar colours in the barplot, can be
defined globally with zipfR.par
. They can be overridden
locally with the optional parameters barcol
, pch
,
lty
, lwd
and col
, but this should only be used
when absolutely necessary. In most cases, it is more advisable to
change the global settings temporarily for a sequence of plots.
The bw
parameter is used to switch between B/W and colour
modes. It can also be set globally with zipfR.par
.
spc
, lnre
, lnre.spc
,
plot.tfl
, plot.vgc
, zipfR.par
,
zipfR.plotutils
## load Italian ultra- prefix data data(ItaUltra.spc) ## plot spectrum plot(ItaUltra.spc) ## logarithmic scale for m (more elements are plotted) plot(ItaUltra.spc, log="x") ## just lines plot(ItaUltra.spc, log="x", points=FALSE) ## just the first five elements, then the first 100 plot(ItaUltra.spc, m.max=5) plot(ItaUltra.spc, m.max=100, log="x") ## compute zm model and expeccted spectrum zm <- lnre("zm", ItaUltra.spc) zm.spc <- lnre.spc(zm, N(ItaUltra.spc)) ## compare observed and expected spectra (also ## in black and white to print on papers) plot(ItaUltra.spc, zm.spc, legend=c("observed", "expected")) plot(ItaUltra.spc, zm.spc, legend=c("observed", "expected"), bw=TRUE) plot(ItaUltra.spc, zm.spc, legend=c("observed", "expected"), log="x") plot(ItaUltra.spc, zm.spc, legend=c("observed", "expected"), log="x", bw=TRUE) ## re-generate expected spectrum with variances zm.spc <- lnre.spc(zm, N(ItaUltra.spc), variances=TRUE) ## now 95% ci is shown in log plot plot(zm.spc, log="x") ## different title and labels plot(zm.spc, log="x", main="Expected Spectrum with Confidence Interval", xlab="spectrum elements", ylab="expected type counts") ## can pass list of spectra in first argument with explicit call plot.spc(Baayen2001[1:7], m.max=6, legend=names(Baayen2001)[1:7])
## load Italian ultra- prefix data data(ItaUltra.spc) ## plot spectrum plot(ItaUltra.spc) ## logarithmic scale for m (more elements are plotted) plot(ItaUltra.spc, log="x") ## just lines plot(ItaUltra.spc, log="x", points=FALSE) ## just the first five elements, then the first 100 plot(ItaUltra.spc, m.max=5) plot(ItaUltra.spc, m.max=100, log="x") ## compute zm model and expeccted spectrum zm <- lnre("zm", ItaUltra.spc) zm.spc <- lnre.spc(zm, N(ItaUltra.spc)) ## compare observed and expected spectra (also ## in black and white to print on papers) plot(ItaUltra.spc, zm.spc, legend=c("observed", "expected")) plot(ItaUltra.spc, zm.spc, legend=c("observed", "expected"), bw=TRUE) plot(ItaUltra.spc, zm.spc, legend=c("observed", "expected"), log="x") plot(ItaUltra.spc, zm.spc, legend=c("observed", "expected"), log="x", bw=TRUE) ## re-generate expected spectrum with variances zm.spc <- lnre.spc(zm, N(ItaUltra.spc), variances=TRUE) ## now 95% ci is shown in log plot plot(zm.spc, log="x") ## different title and labels plot(zm.spc, log="x", main="Expected Spectrum with Confidence Interval", xlab="spectrum elements", ylab="expected type counts") ## can pass list of spectra in first argument with explicit call plot.spc(Baayen2001[1:7], m.max=6, legend=names(Baayen2001)[1:7])
Zipf ranking plot of a type-frequency list, or comparison of several Zipf rankings, on linear or logarithmic scale.
## S3 method for class 'tfl' plot(x, y, ..., min.rank=1, max.rank=NULL, log="", type=c("p", "l", "b", "o", "s"), xlim=NULL, ylim=NULL, freq=TRUE, xlab="rank", ylab="frequency", legend=NULL, grid=FALSE, main="Type-Frequency List (Zipf ranking)", bw=zipfR.par("bw"), cex=1, steps=200, pch=NULL, lty=NULL, lwd=NULL, col=NULL)
## S3 method for class 'tfl' plot(x, y, ..., min.rank=1, max.rank=NULL, log="", type=c("p", "l", "b", "o", "s"), xlim=NULL, ylim=NULL, freq=TRUE, xlab="rank", ylab="frequency", legend=NULL, grid=FALSE, main="Type-Frequency List (Zipf ranking)", bw=zipfR.par("bw"), cex=1, steps=200, pch=NULL, lty=NULL, lwd=NULL, col=NULL)
x , y , ...
|
one or more objects of class |
min.rank , max.rank
|
range of Zipf ranks to be plotted for each type-frequency list. By default, all ranks are shown. |
log |
a character string specifying the axis or axes for which
logarithmic scale is to be used ( |
type |
what type of plot should be drawn. Types |
xlim , ylim
|
visible range on x- and y-axis. The default values are automatically determined to fit the selected data in the plot. |
freq |
if |
xlab , ylab
|
labels for the x-axis and y-axis. |
legend |
optional vector of character strings or expressions,
specifying labels for a legend box, which will be drawn in the upper
right-hand corner of the screen. If |
grid |
whether to display a suitable grid in the background of the plot (only for logarithmic axis) |
main |
a character string or expression specifying a main title for the plot |
bw |
if |
cex |
scaling factor for plot symbols (types |
steps |
number of steps for drawing population Zipf rankings of LNRE models. These
are always drawn as lines (regardless of |
pch , lty , lwd , col
|
style vectors that can be used to
override the global styles defined by |
The type-frequency lists are shown as Zipf plots, i.e. scatterplots of
the Zipf-ranked frequencies on a linear or logarithmic scale. Only a
sensible subset of the default plotting styles described in plot
are supported: p
(points), l
(lines), b
(both, with a margin around points),
o
(both overplotted) and s
(stair steps, but actually of type S
).
For plotting complete type-frequency lists from larger samples, type s
is
strongly recommended. It aggregates all types with the same frequency and is thus
much more efficient than the other plot types. Note that the points shown by the
other plot types coincide with the the right upper corners of the stair steps.
Trained LNRE models can also be included in the plot, but only with freq=FALSE
.
In this case, the corresponding
population Zipf rankings are displayed as lines (i.e. always type l
, regardless
of the type
parameter). The lines are intended to be smooth and are not aligned
with integer type ranks in order to highlight the fact that LNRE models are continuous
approximations of the discrete population.
Line and point styles are defined globally through zipfR.par
,
but can be overridden with the optional parameters pch
,
lty
, lwd
and col
. In most cases, it is more advisable to
change the global settings temporarily for a sequence of plots, though.
The bw
parameter is used to switch between B/W and colour
modes. It can also be set globally with zipfR.par
.
tfl
, vec2tfl
, rlnre
, spc2tfl
,
plot.spc
, plot.vgc
, plot.lnre
,
zipfR.par
, zipfR.plotutils
## plot tiny type-frequency lists (N = 100) for illustration tfl1 <- vec2tfl(EvertLuedeling2001$bar[1:100]) tfl2 <- vec2tfl(EvertLuedeling2001$lein[1:100]) plot(tfl1, type="b") plot(tfl1, type="b", log="xy") plot(tfl1, tfl2, legend=c("bar", "lein")) ## realistic type-frequency lists (type="s" recommended for efficiency) tfl1 <- spc2tfl(BrownImag.spc) tfl2 <- spc2tfl(BrownInform.spc) plot(tfl1, tfl2, log="xy", type="s", legend=c("fiction", "non-fiction"), grid=TRUE) ## always use freq=FALSE to compare samples of different size plot(tfl1, tfl2, log="xy", type="s", freq=FALSE, legend=c("fiction", "non-fiction"), grid=TRUE) ## show Zipf-Mandelbrot law fitted to low end of frequency spectrum m1 <- lnre("zm", BrownInform.spc) m2 <- lnre("fzm", BrownInform.spc) plot(tfl1, tfl2, m1, m2, log="xy", type="s", freq=FALSE, grid=TRUE, legend=c("fiction", "non-fiction", "ZM", "fZM")) ## call plot.tfl explicitly if only LNRE populations are displayed plot.tfl(m1, m2, max.rank=1e5, freq=FALSE, log="xy") ## first argument can then also be a list of TFLs and/or LNRE models plot.tfl(lapply(EvertLuedeling2001, vec2tfl), log="xy", type="s", freq=FALSE, legend=names(EvertLuedeling2001))
## plot tiny type-frequency lists (N = 100) for illustration tfl1 <- vec2tfl(EvertLuedeling2001$bar[1:100]) tfl2 <- vec2tfl(EvertLuedeling2001$lein[1:100]) plot(tfl1, type="b") plot(tfl1, type="b", log="xy") plot(tfl1, tfl2, legend=c("bar", "lein")) ## realistic type-frequency lists (type="s" recommended for efficiency) tfl1 <- spc2tfl(BrownImag.spc) tfl2 <- spc2tfl(BrownInform.spc) plot(tfl1, tfl2, log="xy", type="s", legend=c("fiction", "non-fiction"), grid=TRUE) ## always use freq=FALSE to compare samples of different size plot(tfl1, tfl2, log="xy", type="s", freq=FALSE, legend=c("fiction", "non-fiction"), grid=TRUE) ## show Zipf-Mandelbrot law fitted to low end of frequency spectrum m1 <- lnre("zm", BrownInform.spc) m2 <- lnre("fzm", BrownInform.spc) plot(tfl1, tfl2, m1, m2, log="xy", type="s", freq=FALSE, grid=TRUE, legend=c("fiction", "non-fiction", "ZM", "fZM")) ## call plot.tfl explicitly if only LNRE populations are displayed plot.tfl(m1, m2, max.rank=1e5, freq=FALSE, log="xy") ## first argument can then also be a list of TFLs and/or LNRE models plot.tfl(lapply(EvertLuedeling2001, vec2tfl), log="xy", type="s", freq=FALSE, legend=names(EvertLuedeling2001))
Plot a vocabulary growth curve (i.e., or
against
), or a comparison of several vocabulary growth curves.
## S3 method for class 'vgc' plot(x, y, ..., m=NULL, add.m=NULL, N0=NULL, conf.level=.95, conf.style=c("ticks", "lines"), log=c("", "x", "y", "xy"), bw=zipfR.par("bw"), xlim=NULL, ylim=NULL, xlab="N", ylab="V(N)", legend=NULL, main="Vocabulary Growth", lty=NULL, lwd=NULL, col=NULL)
## S3 method for class 'vgc' plot(x, y, ..., m=NULL, add.m=NULL, N0=NULL, conf.level=.95, conf.style=c("ticks", "lines"), log=c("", "x", "y", "xy"), bw=zipfR.par("bw"), xlim=NULL, ylim=NULL, xlab="N", ylab="V(N)", legend=NULL, main="Vocabulary Growth", lty=NULL, lwd=NULL, col=NULL)
x , y , ...
|
one or more objects of class |
m |
a single integer |
add.m |
a vector of integers in the range |
N0 |
if specified, draw a dashed vertical line at |
log |
a character string specifying the axis or axes for which
logarithmic scale is to be used ( |
conf.level |
confidence level for confidence intervals around
expected vocabulary growth curves (see "Details" below). The
default value of |
conf.style |
if |
bw |
if |
xlim , ylim
|
visible range on x- and y-axis. The default values are automatically determined to fit the selected data in the plot. |
xlab , ylab
|
labels for the x-axis and y-axis. The default
values nicely typeset mathematical expressions. The y-axis label
also distinguishes between observed and expected vocabulary growth
curves, as well as between |
main |
a character string or expression specifying a main title for the plot |
legend |
optional vector of character strings or expressions,
specifying labels for a legend box, which will be drawn in the lower
right-hand corner of the screen. If |
lty , lwd , col
|
style vectors that can be used to override the
global styles defined by |
By default, standard vocabulary growth curves are plotted for all
specified vgc
objects, i.e. graphs of against
. If
m
is specified, growth curves for hapax legomena
or other frequency classes are shown instead, i.e. graphs of
against
. In this case, all
vgc
objects
must contain the necessary data for .
Alternatively, the option add.m
can be used to display growth
curves for one or more spectrum elements in addition to the
standard VGCs. These growth curves are plotted as thinner lines,
otherwise matching the styles of the main curves. Since such plots
can become fairly confusing and there is no finer control over the
styles of the additional curves, it is generally not recommended to
make use of the add.m
option.
Confidence intervals are indicated for expected vocabulary growth
curves with variance data, either by short vertical lines
(conf.style="ticks"
, the default) or by thin curves above and
below the main growth curve (conf.style="lines"
). The size of
the confidence intervals is controlled by the conf.level
parameter (default: 95%). Set conf.level=NA
in order to
suppress the confidence interval indicators.
In y-logarithmic plots, data points with or
are drawn outside the plot region (below the bottom margin)
rather than skipped.
Line and point styles can be defined globally with zipfR.par
.
They can be overridden locally with the optional parameters
lty
, lwd
and col
, but this should only be used
when absolutely necessary. In most cases, it is more advisable to
change the global settings temporarily for a sequence of plots.
The bw
parameter is used to switch between B/W and color
modes. It can also be set globally with zipfR.par
.
vgc
, lnre
, lnre.vgc
,
plot.tfl
, plot.spc
, zipfR.par
,
zipfR.plotutils
## load Our Mutual Friend spectrum and empirical vgc data(DickensOurMutualFriend.emp.vgc) data(DickensOurMutualFriend.spc) ## plot empirical V and V1 growth plot(DickensOurMutualFriend.emp.vgc,add.m=1) ## use log scale for y-axis plot(DickensOurMutualFriend.emp.vgc,add.m=1,log="y") ## binomially interpolated vgc at same points as ## empirical vgc omf.bin.vgc <- vgc.interp(DickensOurMutualFriend.spc,N(DickensOurMutualFriend.emp.vgc)) ## compare empirical and interpolated vgc, also with ## thinner lines, and in black and white plot(DickensOurMutualFriend.emp.vgc,omf.bin.vgc,legend=c("observed","interpolated")) plot(DickensOurMutualFriend.emp.vgc,omf.bin.vgc,legend=c("observed","interpolated"),lwd=c(1,1)) plot(DickensOurMutualFriend.emp.vgc,omf.bin.vgc,legend=c("observed","interpolated"),bw=TRUE) ## load Great Expectations spectrum and use it to ## compute ZM model data(DickensGreatExpectations.spc) ge.zm <- lnre("zm",DickensGreatExpectations.spc) ## expected V of Great Expectations at sample ## sizes of OMF's interpolated vgc ge.zm.vgc <- lnre.vgc(ge.zm,N(omf.bin.vgc)) ## compare interpolated OMF Vs and inter/extra-polated ## GE Vs, with a vertical line at sample size ## used to compute GE model plot(omf.bin.vgc,ge.zm.vgc,N0=N(ge.zm),legend=c("OMF","GE")) ## load Italian ultra- prefix data and compute zm model data(ItaUltra.spc) ultra.zm <- lnre("zm",ItaUltra.spc) ## compute vgc up to about twice the sample size ## with variance of V ultra.zm.vgc <- lnre.vgc(ultra.zm,(1:100)*70, variances=TRUE) ## plot with confidence intervals derived from variance in ## vgc (with larger datasets, ci will typically be almost ## invisible) plot(ultra.zm.vgc) ## use more conservative confidence level, and plot ## the intervals as lines plot(ultra.zm.vgc,conf.level=.99,conf.style="lines") ## suppress ci plotting, and insert different title and labels plot(ultra.zm.vgc,conf.level=NA,main="ultra-",xlab="sample sizes",ylab="types") ## load Brown adjective spectrum ## (about 80k tokens) data(BrownAdj.spc) ## binomially interpolated curve of V and V_1 to V_5 BrownAdj.bin.vgc <- vgc.interp(BrownAdj.spc,(1:100)*800,m.max=5) ## plot with V and 5 spectrum elements plot(BrownAdj.bin.vgc,add.m=c(1:5)) ## can pass list of VGCs in first argument with explicit call plot.vgc(lapply(EvertLuedeling2001, vec2vgc), xlim=c(0, 30000), ylim=c(0, 1200), legend=names(EvertLuedeling2001))
## load Our Mutual Friend spectrum and empirical vgc data(DickensOurMutualFriend.emp.vgc) data(DickensOurMutualFriend.spc) ## plot empirical V and V1 growth plot(DickensOurMutualFriend.emp.vgc,add.m=1) ## use log scale for y-axis plot(DickensOurMutualFriend.emp.vgc,add.m=1,log="y") ## binomially interpolated vgc at same points as ## empirical vgc omf.bin.vgc <- vgc.interp(DickensOurMutualFriend.spc,N(DickensOurMutualFriend.emp.vgc)) ## compare empirical and interpolated vgc, also with ## thinner lines, and in black and white plot(DickensOurMutualFriend.emp.vgc,omf.bin.vgc,legend=c("observed","interpolated")) plot(DickensOurMutualFriend.emp.vgc,omf.bin.vgc,legend=c("observed","interpolated"),lwd=c(1,1)) plot(DickensOurMutualFriend.emp.vgc,omf.bin.vgc,legend=c("observed","interpolated"),bw=TRUE) ## load Great Expectations spectrum and use it to ## compute ZM model data(DickensGreatExpectations.spc) ge.zm <- lnre("zm",DickensGreatExpectations.spc) ## expected V of Great Expectations at sample ## sizes of OMF's interpolated vgc ge.zm.vgc <- lnre.vgc(ge.zm,N(omf.bin.vgc)) ## compare interpolated OMF Vs and inter/extra-polated ## GE Vs, with a vertical line at sample size ## used to compute GE model plot(omf.bin.vgc,ge.zm.vgc,N0=N(ge.zm),legend=c("OMF","GE")) ## load Italian ultra- prefix data and compute zm model data(ItaUltra.spc) ultra.zm <- lnre("zm",ItaUltra.spc) ## compute vgc up to about twice the sample size ## with variance of V ultra.zm.vgc <- lnre.vgc(ultra.zm,(1:100)*70, variances=TRUE) ## plot with confidence intervals derived from variance in ## vgc (with larger datasets, ci will typically be almost ## invisible) plot(ultra.zm.vgc) ## use more conservative confidence level, and plot ## the intervals as lines plot(ultra.zm.vgc,conf.level=.99,conf.style="lines") ## suppress ci plotting, and insert different title and labels plot(ultra.zm.vgc,conf.level=NA,main="ultra-",xlab="sample sizes",ylab="types") ## load Brown adjective spectrum ## (about 80k tokens) data(BrownAdj.spc) ## binomially interpolated curve of V and V_1 to V_5 BrownAdj.bin.vgc <- vgc.interp(BrownAdj.spc,(1:100)*800,m.max=5) ## plot with V and 5 spectrum elements plot(BrownAdj.bin.vgc,add.m=c(1:5)) ## can pass list of VGCs in first argument with explicit call plot.vgc(lapply(EvertLuedeling2001, vec2vgc), xlim=c(0, 30000), ylim=c(0, 1200), legend=names(EvertLuedeling2001))
Implementations of the print
and summary
methods for LNRE models (subclasses of lnre
).
## S3 method for class 'lnre' print(x, ...) ## S3 method for class 'lnre' summary(object, ...)
## S3 method for class 'lnre' print(x, ...) ## S3 method for class 'lnre' summary(object, ...)
x , object
|
an object of class |
... |
other arguments passed on from generic method will be ignored |
NB: implementation details and format of the summary are subject to change in future releases
In the current implementation, print
and summary
produce
the same output for LNRE models.
This summary comprises the type of LNRE model, its parameter values,
derived parameters such as normalization constants, and the population
size .
If the model parameters have been estimated from an observed frequency spectrum, a comparison of the observed and expected frequency spectrum is shown, including goodness-of-fit statistics.
NULL
Unlike other implementations of the summary
method,
summary.lnre
only prints a summary on screen and does not return
a special "summary" object.
See the lnre
manpage for more information on LNRE
models.
# load Brown verbs dataset and estimate lnre models data(BrownVer.spc) zm <- lnre("zm",BrownVer.spc) fzm <- lnre("fzm",BrownVer.spc,exact=FALSE) gigp <- lnre("gigp",BrownVer.spc) # look at summaries with either summary or print summary(zm) print(zm) summary(fzm) print(fzm) summary(gigp) print(gigp)
# load Brown verbs dataset and estimate lnre models data(BrownVer.spc) zm <- lnre("zm",BrownVer.spc) fzm <- lnre("fzm",BrownVer.spc,exact=FALSE) gigp <- lnre("gigp",BrownVer.spc) # look at summaries with either summary or print summary(zm) print(zm) summary(fzm) print(fzm) summary(gigp) print(gigp)
Implementations of the print
and summary
methods for frequency spectrum objects (of class spc
).
## S3 method for class 'spc' print(x, all=FALSE, ...) ## S3 method for class 'spc' summary(object, ...)
## S3 method for class 'spc' print(x, all=FALSE, ...) ## S3 method for class 'spc' summary(object, ...)
x , object
|
an object of class |
all |
if |
... |
other arguments passed on from generic method will be ignored |
NB: implementation details and format of the summary are subject to change in future releases
print.spc
works similar to the standard print
method for
data frames, but provides additional information about and
. Unless
all
is set to TRUE
, only the first ten
non-zero spectrum elements will be shown.
summary.spc
gives a concise summary of the most important
information about the frequency spectrum. In addition to
, the first spectrum elements are shown. The summary will also
indicate whether the spectrum is incomplete, an expected spectrum, or
has variances (but does not show the variances).
NULL
Unlike other implementations of the summary
method,
summary.spc
only prints a summary on screen and does not return
a special "summary" object.
See the spc
manpage for details on spc
objects.
## load Brown verbs dataset data(BrownVer.spc) ## look at summary and print BrownVer.spc summary(BrownVer.spc) print(BrownVer.spc) ## print all non-zero spectrum elements print(BrownVer.spc,all=TRUE) ## estimate zm model and construct expected spectrum with ## variances zm <- lnre("zm",BrownVer.spc) zm.spc <- lnre.spc(zm,N(zm),variances=TRUE) ## summary and print for the expected spectrum summary(zm.spc) print(zm.spc)
## load Brown verbs dataset data(BrownVer.spc) ## look at summary and print BrownVer.spc summary(BrownVer.spc) print(BrownVer.spc) ## print all non-zero spectrum elements print(BrownVer.spc,all=TRUE) ## estimate zm model and construct expected spectrum with ## variances zm <- lnre("zm",BrownVer.spc) zm.spc <- lnre.spc(zm,N(zm),variances=TRUE) ## summary and print for the expected spectrum summary(zm.spc) print(zm.spc)
Implementations of the print
and summary
methods for type frequency list objects (of class tfl
).
## S3 method for class 'tfl' print(x, all=FALSE, ...) ## S3 method for class 'tfl' summary(object, ...)
## S3 method for class 'tfl' print(x, all=FALSE, ...) ## S3 method for class 'tfl' summary(object, ...)
x , object
|
an object of class |
all |
if |
... |
other arguments passed on from generic method will be ignored |
NB: implementation details and format of the summary are subject to change in future releases
print.tfl
works similar to the standard print
method for
data frames, but provides additional information about and
. Unless
all
is set to TRUE
, only the twenty
most frequent types will be shown.
summary.tfl
gives a concise summary of the most important
information about the type frequency list. In addition to showing
, the summary also indicates whether the list is
incomplete and shows examples of type representations (if present).
NULL
Unlike other implementations of the summary
method,
summary.tfl
only prints a summary on screen and does not return
a special "summary" object.
See the tfl
manpage for details on tfl
objects.
## load Brown tfl data(Brown.tfl) ## summary and print most frequent types summary(Brown.tfl) print(Brown.tfl) ## the whole type list (don't try this unless you have some time to spare) ## Not run: print(Brown.tfl,all=TRUE) ## End(Not run)
## load Brown tfl data(Brown.tfl) ## summary and print most frequent types summary(Brown.tfl) print(Brown.tfl) ## the whole type list (don't try this unless you have some time to spare) ## Not run: print(Brown.tfl,all=TRUE) ## End(Not run)
Implementations of the print
and summary
methods for vocabulary growth curve objects (of class vgc
).
## S3 method for class 'vgc' print(x, all=FALSE, ...) ## S3 method for class 'vgc' summary(object, ...)
## S3 method for class 'vgc' print(x, all=FALSE, ...) ## S3 method for class 'vgc' summary(object, ...)
x , object
|
an object of class |
all |
if |
... |
other arguments passed on from generic method will be ignored |
NB: implementation details and format of the summary are subject to change in future releases
print.vgc
calls the standard print
method for
data frames internally, but reduces the data set randomly to
show at most 25 sample sizes (unless all=TRUE
).
summary.vgc
gives a concise summary of the available vocabulary
growth data in the vgc
object, including the number and range
of sample sizes, whether spectrum elements are included, and whether
variances are included.
NULL
Unlike other implementations of the summary
method,
summary.vgc
only prints a summary on screen and does not return
a special "summary" object.
See the vgc
manpage for details on vgc
objects.
## load Brown "informative" prose empirical vgc data(BrownInform.emp.vgc) ## summary, print (random subset) and print all summary(BrownInform.emp.vgc) print(BrownInform.emp.vgc) print(BrownInform.emp.vgc,all=TRUE) ## load Brown informative prose spectrum ## and get estimate a fzm model data(BrownInform.spc) fzm <- lnre("fzm",BrownInform.spc,exact=FALSE) ## obtain expected vgc up to 2M tokens ## with spectrum elements up to V_3 ## and variances fzm.vgc <- lnre.vgc(fzm,(1:100)*2e+4,m.max=3,variances=TRUE) ## summary and print summary(fzm.vgc) print(fzm.vgc) print(fzm.vgc,all=TRUE)
## load Brown "informative" prose empirical vgc data(BrownInform.emp.vgc) ## summary, print (random subset) and print all summary(BrownInform.emp.vgc) print(BrownInform.emp.vgc) print(BrownInform.emp.vgc,all=TRUE) ## load Brown informative prose spectrum ## and get estimate a fzm model data(BrownInform.spc) fzm <- lnre("fzm",BrownInform.spc,exact=FALSE) ## obtain expected vgc up to 2M tokens ## with spectrum elements up to V_3 ## and variances fzm.vgc <- lnre.vgc(fzm,(1:100)*2e+4,m.max=3,variances=TRUE) ## summary and print summary(fzm.vgc) print(fzm.vgc) print(fzm.vgc,all=TRUE)
Compute various measures of productivity and lexical richness from an observed frequency spectrum or type-frequency list, from an observed vocabulary growth curve, or from a vector of tokens.
productivity.measures(obj, measures, data.frame=TRUE, ...) ## S3 method for class 'tfl' productivity.measures(obj, measures, data.frame=TRUE, ...) ## S3 method for class 'spc' productivity.measures(obj, measures, data.frame=TRUE, ...) ## S3 method for class 'vgc' productivity.measures(obj, measures, data.frame=TRUE, ...) ## Default S3 method: productivity.measures(obj, measures, data.frame=TRUE, ...)
productivity.measures(obj, measures, data.frame=TRUE, ...) ## S3 method for class 'tfl' productivity.measures(obj, measures, data.frame=TRUE, ...) ## S3 method for class 'spc' productivity.measures(obj, measures, data.frame=TRUE, ...) ## S3 method for class 'vgc' productivity.measures(obj, measures, data.frame=TRUE, ...) ## Default S3 method: productivity.measures(obj, measures, data.frame=TRUE, ...)
obj |
a suitable data object from which productivity measures
can be computed. Currently either a frequency spectrum
(of class |
measures |
character vector naming the productivity measures to be computed (see "Productivity Measures" below). Names may be abbreviated as long as they remain unique. If unspecified, all supported measures are computed. |
data.frame |
if |
... |
additional arguments passed on to the method implementations (currently, no further arguments are recognized) |
This function computes productivity measures based on an observed frequency spectrum, type-frequency list or vocabulary growth curve.
If an expected spectrum or VGC is passed, the expectations ,
will simply be substituted for the sample values
,
in the equations. In most cases, this does not yield the expected value of the productivity measure!
Some measures can only be computed from a complete frequency spectrum. They will return NA
if obj
is an incomplete spectrum or type-frequency list, an expected spectrum or a vocabulary growth curve is passed.
Some other measures can only be computed is a sufficient number of spectrum elements is included in a vocabulary growth curve (usually at least
and
), and will return
NA
otherwise.
Such limitations are indicated in the list of measures below (unless spectrum elements and
are sufficient).
If obj
is a frequency spectrum, type-frequency list or token vector:
A numeric vector of the same length as measures
with the corresponding observed values of the productivity measures.
If data.frame=TRUE
(the default), a single-row data frame is returned.
If obj
is a vocabulary growth curve:
A numeric matrix with columns corresponding to the selected productivity measures and rows corresponding to the sample sizes of the vocabulary growth curve.
If data.frame=TRUE
(the default), the matrix is converted to a data frame.
The following productivity measures are currently supported:
V
:the total number of types
TTR
:the type-token ratio TTR =
R
:Guiraud's (1954) . An equivalent measure is Carroll's (1964)
.
C
:Herdan's (1964)
k
:Dugast's (1979)
U
:Dugast's (1978, 1979) .
Maas (1972) proposed an equivalent measure
.
W
:Brunet's (1978) with
.
P
:Baayen's (1991) productivity index , which corresponds to the slope of the vocabulary growth curve (under random sampling assumptions)
Hapax
:the proportion of hapax legomena is a direct estimate for the parameter
of a population following the Zipf-Mandelbrot law (Evert 2004b: 130).
H
:Honoré's (1979) , a transformation of the proportion of hapax legomena adjusted for sample size
S
:Sichel's (1975) , i.e. the proportion of dis legomena. Michéa's (1969, 1971)
is an equivalent measure.
alpha2
:Evert's is another direct estimate for the parameter
of a Zipf-Mandelbrot population (Evert 2004b: 127).
K
:Yule's (1944)
(only for complete frequency spectrum or type-frequency list). Herdan (1955) proposes an almost equivalent measure based on a different derviation. Both measures converge for large
and
.
Yule's
is almost identical to Simpson's
and is an unbiased estimator for the same population coefficient
under an independent Poisson sampling scheme.
A measure of lexical poverty, i.e. smaller values correpond to higher productivity.
D
:Simpson's (1949)
(only for complete frequency spectrum or type-frequency list) is a slightly modified version of Yule's .
This measure is an unbiased estimator for a population coefficient
, representing the probability of picking the same type twice in two consecutive draws from the population.
A measure of lexical poverty, i.e. smaller values correpond to higher productivity.
Entropy
:Entropy of the sample frequency distribution
(only for complete frequency spectrum or type-frequency list). This is not a reliable estimator of population entropy. It is therefore not recommended as a productivity measure and has only been included for evaluation studies.
A measure of lexical poverty, i.e. smaller values correpond to higher productivity.
eta
:Normalised entropy or evenness
(only for complete frequency spectrum or type-frequency list) where is the largest possible value for a sample with the observed vocabulary size (obtained for a uniform distribution). Therefore,
.
Not recommended as a productivity measure because it is expected to produce erratic and counterintuitive results.
See Sec. 2.1 of the technical report Inside zipfR for further details and references.
Evert, Stefan (2004b). The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD Thesis, IMS, University of Stuttgart. URN urn:nbn:de:bsz:93-opus-23714 http://dx.doi.org/10.18419/opus-2556
lnre.productivity.measures
for parametric bootstrapping and approximate expectations
of productivity measures in random samples from a LNRE population.
rbind( AllTexts=productivity.measures(Brown.spc), Fiction=productivity.measures(BrownImag.spc), NonFiction=productivity.measures(BrownInform.spc)) ## can be applied to token vector, type-frequency list, or frequency spectrum bar.vec <- EvertLuedeling2001$bar bar1 <- productivity.measures(bar.vec) # token vector bar2 <- productivity.measures(vec2tfl(bar.vec)) # type-frequency list bar3 <- productivity.measures(vec2spc(bar.vec)) # frequency spectrum print(rbind(tokens=bar1, tfl=bar2, spc=bar3)) ## sample-size dependency of productivity measures in Brown corpus ## (note that only a subset of the measures can be computed) n <- c(10e3, 50e3, 100e3, 200e3, 500e3, 1e6) idx <- N(Brown.emp.vgc) %in% n my.vgc <- vgc(N=N(Brown.emp.vgc)[idx], V=V(Brown.emp.vgc)[idx], Vm=list(Vm(Brown.emp.vgc, 1)[idx])) print(my.vgc) # since we don't have a subset method for VGCs yet productivity.measures(my.vgc) productivity.measures(my.vgc, measures=c("TTR", "P")) # selected measures ## parametric bootstrapping to obtain sampling distribution of measures ## (much easier with ?lnre.productivity.measures) model <- lnre("zm", spc=ItaRi.spc) # realistic LNRE model res <- lnre.bootstrap(model, 1e6, ESTIMATOR=identity, STATISTIC=productivity.measures) bootstrap.confint(res, method="normal")
rbind( AllTexts=productivity.measures(Brown.spc), Fiction=productivity.measures(BrownImag.spc), NonFiction=productivity.measures(BrownInform.spc)) ## can be applied to token vector, type-frequency list, or frequency spectrum bar.vec <- EvertLuedeling2001$bar bar1 <- productivity.measures(bar.vec) # token vector bar2 <- productivity.measures(vec2tfl(bar.vec)) # type-frequency list bar3 <- productivity.measures(vec2spc(bar.vec)) # frequency spectrum print(rbind(tokens=bar1, tfl=bar2, spc=bar3)) ## sample-size dependency of productivity measures in Brown corpus ## (note that only a subset of the measures can be computed) n <- c(10e3, 50e3, 100e3, 200e3, 500e3, 1e6) idx <- N(Brown.emp.vgc) %in% n my.vgc <- vgc(N=N(Brown.emp.vgc)[idx], V=V(Brown.emp.vgc)[idx], Vm=list(Vm(Brown.emp.vgc, 1)[idx])) print(my.vgc) # since we don't have a subset method for VGCs yet productivity.measures(my.vgc) productivity.measures(my.vgc, measures=c("TTR", "P")) # selected measures ## parametric bootstrapping to obtain sampling distribution of measures ## (much easier with ?lnre.productivity.measures) model <- lnre("zm", spc=ItaRi.spc) # realistic LNRE model res <- lnre.bootstrap(model, 1e6, ESTIMATOR=identity, STATISTIC=productivity.measures) bootstrap.confint(res, method="normal")
read.multiple.objects
constructs a list of spc
,
vgc
or tfl
objects from a set of input
text files in the specified directory
NB: This function is intended for users that want to run
advanced experiments (e.g., handling hundreds of spectra generated in
multiple randomizations experiments). For the standard
one-object-at-a-time reading functionality, look at the documentation
of read.spc
, read.vgc
and
read.tfl
read.multiple.objects(directory, prefix, class=c("spc", "vgc", "tfl"))
read.multiple.objects(directory, prefix, class=c("spc", "vgc", "tfl"))
directory |
character string specifying the directory where the target input files reside (absolute path, or path relative to current working directory) |
prefix |
character string specifying prefix that must be shared by all target input file names |
class |
one of |
read.multiple.objects
reads in all files matching the pattern
prefix.id.class
from the specified directory, where the
prefix
and class
strings are passed as arguments, and
id
is an arbitrary string that is used as index of the
corresponding object in the output list
read.multiple.objects
calls the read
function
corresponding to the class
argument. Thus, the input files must
respect the formatting conventions of the relevant reading functions
(see documentation of read.spc
, read.vgc
and read.tfl
)
read.multiple.objects
returns a list of objects of the
specified class; each object is indexed with the id extracted from the
corresponding file name (see section "Format")
See the spc
, vgc
and tfl
manpages for details on the corresponding objects;
read.spc
, read.vgc
and
read.tfl
for the single-file reading functions and input
format details
## Not run: ## These are just illustrative examples. Users should fill in their ## own files instead of the dummy names used here. ## suppose that the current working directory contains ## 100 spc files named: rand.1.spc, rand.2.spc, ..., ## rand.100.spc ## read the files in: spc.list <- read.multiple.objects(".","rand","spc") ## you can access each spc using the id extracted from ## the file name, e.g.: summary(spc.list[["1"]]) ## more usefully, you might want to iterate over the ## whole list, e.g., to calculate mean V: mean(sapply(spc.list,V)) ## notice that ids are arbitrary strings ## e.g., suppose that directory /home/me/animals ## contains sounds.dog.vgc and sounds.elephant.vgc ## we read the vgcs in: vgc.list <- read.multiple.objects("/home/me/animals","sounds","vgc") ## accessing the elephant vgc: V(vgc.list[["elephant"]]) ## of course, tfl-reading works in the same way (assuming ## that the animals directory also contains some tfl files): tfl.list <- read.multiple.objects("/home/me/animals","sounds","tfl") ## End(Not run)
## Not run: ## These are just illustrative examples. Users should fill in their ## own files instead of the dummy names used here. ## suppose that the current working directory contains ## 100 spc files named: rand.1.spc, rand.2.spc, ..., ## rand.100.spc ## read the files in: spc.list <- read.multiple.objects(".","rand","spc") ## you can access each spc using the id extracted from ## the file name, e.g.: summary(spc.list[["1"]]) ## more usefully, you might want to iterate over the ## whole list, e.g., to calculate mean V: mean(sapply(spc.list,V)) ## notice that ids are arbitrary strings ## e.g., suppose that directory /home/me/animals ## contains sounds.dog.vgc and sounds.elephant.vgc ## we read the vgcs in: vgc.list <- read.multiple.objects("/home/me/animals","sounds","vgc") ## accessing the elephant vgc: V(vgc.list[["elephant"]]) ## of course, tfl-reading works in the same way (assuming ## that the animals directory also contains some tfl files): tfl.list <- read.multiple.objects("/home/me/animals","sounds","tfl") ## End(Not run)
read.spc
loads frequency spectrum from .spc
file
write.spc
saves frequency spectrum object in .spc
file
read.spc(file) write.spc(spc, file)
read.spc(file) write.spc(spc, file)
file |
character string specifying the pathname of a disk file.
Files with extension |
spc |
a frequency spectrum, i.e.\ an object of class
|
A TAB-delimited text file with column headers but no row names
(suitable for reading with read.delim
). The file must contain
at least the following two columns:
m
frequency class
Vm
number of types in frequency class
(or expected class size
)
An optional column labelled VVm
can be used to specify
variances of expected class sizes (for a frequency spectrum derived
from a LNRE model or by binomial interpolation).
These columns may appear in any order in the text file. All other columns will be silently ignored.
If the filename file
ends in the extension .gz
, .bz2
or .xz
,
the disk file will automatically be decompressed (read.spc
) or compressed (write.spc
).
The .spc
file format does not store the values of N
,
V
and VV
explicitly. Therefore, incomplete frequency
spectra and expected spectra with variances cannot be fully
reconstructed from disk files. Saving such frequency spectra (or
loading a spectrum with variance data) will trigger corresponding
warnings.
read.spc
returns an object of class spc
(see the
spc
manpage for details)
See the spc
manpage for details on spc
objects. See read.tfl
and read.vgc
for
import/export of other data structures.
## save Italian ultra- frequency spectru to external text file fname <- tempfile(fileext=".spc") write.spc(ItaUltra.spc, fname) ## now <fname> is a TAB-delimited text file with columns m and Vm ## we ready it back in New.spc <- read.spc(fname) ## same spectrum as ItaUltra.spc, compare: summary(New.spc) summary(ItaUltra.spc) stopifnot(isTRUE(all.equal(New.spc, ItaUltra.spc))) # should be identical ## Not run: ## DON'T do the following, incomplete spectrum will not be restored properly !!! zm <- lnre("zm", ItaUltra.spc) # estimate model zm.spc <- lnre.spc(zm,N(zm)) # incomplete spectrum from model write.spc(zm.spc, fname) # WARNINGS bad.spc <- read.spc(fname) # but this function cannot know something is wrong summary(zm.spc) summary(bad.spc) # note that N and V are completely wrong !!! ## End(Not run)
## save Italian ultra- frequency spectru to external text file fname <- tempfile(fileext=".spc") write.spc(ItaUltra.spc, fname) ## now <fname> is a TAB-delimited text file with columns m and Vm ## we ready it back in New.spc <- read.spc(fname) ## same spectrum as ItaUltra.spc, compare: summary(New.spc) summary(ItaUltra.spc) stopifnot(isTRUE(all.equal(New.spc, ItaUltra.spc))) # should be identical ## Not run: ## DON'T do the following, incomplete spectrum will not be restored properly !!! zm <- lnre("zm", ItaUltra.spc) # estimate model zm.spc <- lnre.spc(zm,N(zm)) # incomplete spectrum from model write.spc(zm.spc, fname) # WARNINGS bad.spc <- read.spc(fname) # but this function cannot know something is wrong summary(zm.spc) summary(bad.spc) # note that N and V are completely wrong !!! ## End(Not run)
read.tfl
loads type frequency list from .tfl
file
write.tfl
saves type frequency list object in .tfl
file
read.tfl(file, encoding=getOption("encoding")) write.tfl(tfl, file, encoding=getOption("encoding"))
read.tfl(file, encoding=getOption("encoding")) write.tfl(tfl, file, encoding=getOption("encoding"))
file |
character string specifying the pathname of a disk file.
Files with extension |
tfl |
a type frequency list, i.e.\ an object of class |
encoding |
specifies the character encoding of the disk
file to be read or written to. See |
A TAB-delimited text file with column headers but no row names
(suitable for reading with read.delim
), containing the
following columns:
f
type frequencies
k
optional: the corresponding type IDs . If
missing, increasing non-negative integers are automatically
assigned as IDs.
type
optional: type representations (such as word forms or lemmas)
These columns may appear in any order in the text file. Only the
f
column is mandatory and all unrecognized columns will be
silently ignored.
If the filename file
ends in the extension .gz
, .bz2
pr .xz
,
the disk file will automatically be decompressed (read.tfl
) and compressed (write.tfl
).
The .tfl
file format stores neither the values of N
and
V
nor the range of type frequencies explicitly. Therefore,
incomplete type frequency lists cannot be fully reconstructed from
disk files (and will not even be recognized as such). An attempt to
save such a list will trigger a corresponding warning.
read.tfl
returns an object of class tfl
(see the
tfl
manpage for details)
See the tfl
manpage for details on tfl
objects. See read.spc
and read.vgc
for
import/export of other data structures.
## save type-frequency list for Brown corpus to external file fname <- tempfile(fileext=".tfl.gz") # automatically compresses file write.tfl(Brown.tfl, fname) ## file <fname> contains a compressed TAB-delimited table with fields ## k ... type ID (usually Zipf rank) ## f ... frequency of type ## type ... the type itself (here a word form) ## read it back in New.tfl <- read.tfl(fname) ## same as Brown.tfl summary(New.tfl) summary(Brown.tfl) print(New.tfl) print(Brown.tfl) head(New.tfl) head(Brown.tfl) stopifnot(isTRUE(all.equal(New.tfl, Brown.tfl))) # should by identical ## Not run: ## suppose you have a text file with a frequency list, one f per line, e.g.: ## f ## 14 ## 12 ## 31 ## ... ## you can import this with read.tfl MyData.tfl <- read.tfl("mylist.txt") summary(MyData.tfl) print(MyData.tfl) # ids in column k added by zipfR ## from this you can generate a spectrum with tfl2spc MyData.spc <- tfl2spc(MyData.tfl) summary(MyData.spc) ## End(Not run)
## save type-frequency list for Brown corpus to external file fname <- tempfile(fileext=".tfl.gz") # automatically compresses file write.tfl(Brown.tfl, fname) ## file <fname> contains a compressed TAB-delimited table with fields ## k ... type ID (usually Zipf rank) ## f ... frequency of type ## type ... the type itself (here a word form) ## read it back in New.tfl <- read.tfl(fname) ## same as Brown.tfl summary(New.tfl) summary(Brown.tfl) print(New.tfl) print(Brown.tfl) head(New.tfl) head(Brown.tfl) stopifnot(isTRUE(all.equal(New.tfl, Brown.tfl))) # should by identical ## Not run: ## suppose you have a text file with a frequency list, one f per line, e.g.: ## f ## 14 ## 12 ## 31 ## ... ## you can import this with read.tfl MyData.tfl <- read.tfl("mylist.txt") summary(MyData.tfl) print(MyData.tfl) # ids in column k added by zipfR ## from this you can generate a spectrum with tfl2spc MyData.spc <- tfl2spc(MyData.tfl) summary(MyData.spc) ## End(Not run)
read.vgc
loads vocabulary growth data from .vgc
file
write.vgc
saves vocabulary growth data in .vgc
file
read.vgc(file) write.vgc(vgc, file)
read.vgc(file) write.vgc(vgc, file)
file |
character string specifying the pathname of a disk file.
Files with extension |
vgc |
a vocabulary growth curve, i.e.\ an object of class
|
A TAB-delimited text file with column headers but no row names
(suitable for reading with read.delim
). The file must contain
at least the following two columns:
N
increasing integer vector of sample sizes
V
corresponding observed vocabulary sizes
or expected vocabulary sizes
Optionally, columns V1
, ..., V9
can be added to
specify the number of hapaxes (), dis legomena
(
), and further spectrum elements up to
.
It is not necessary to include all 9 columns, but for any
in the data set, all "lower" spectrum elements
(for
) must also be present. For example, it is valid to have
columns
V1 V2 V3
, but not V1 V3 V5
or V2 V3 V4
.
Variances for expected vocabulary sizes and spectrum elements can be
given in further columns VV
(for
), and
VV1
, ...,
VV9
(for ).
VV
is mandatory in this case, and columns VVm
must be specified
for exactly the same frequency classes m
as the Vm
above.
These columns may appear in any order in the text file. All other columns will be silently ignored.
If the filename file
ends in the extension .gz
, .bz2
or .xz
,
the disk file will automatically be decompressed (read.vgc
) or compressed (write.vgc
).
read.vgc
returns an object of class vgc
(see the
vgc
manpage for details)
See the vgc
manpage for details on vgc
objects.
See read.tfl
and read.spc
for
import/export of other data structures.
## save Italian ultra- prefix VGC to external text file fname <- tempfile(fileext=".vgc") write.vgc(ItaUltra.emp.vgc, fname) ## now <fname> is a TAB-delimited text file with columns N, V and V1 ## we ready it back in New.vgc <- read.vgc(fname) ## same vgc as ItaUltra.emp.vgc, compare: summary(New.vgc) summary(ItaUltra.emp.vgc) head(New.vgc) head(ItaUltra.emp.vgc) stopifnot(isTRUE(all.equal(New.vgc, ItaUltra.emp.vgc))) # should be identical
## save Italian ultra- prefix VGC to external text file fname <- tempfile(fileext=".vgc") write.vgc(ItaUltra.emp.vgc, fname) ## now <fname> is a TAB-delimited text file with columns N, V and V1 ## we ready it back in New.vgc <- read.vgc(fname) ## same vgc as ItaUltra.emp.vgc, compare: summary(New.vgc) summary(ItaUltra.emp.vgc) head(New.vgc) head(ItaUltra.emp.vgc) stopifnot(isTRUE(all.equal(New.vgc, ItaUltra.emp.vgc))) # should be identical
Compute incremental random samples from a frequency spectrum (an object
of class spc
).
sample.spc(obj, N, force.list=FALSE)
sample.spc(obj, N, force.list=FALSE)
obj |
an object of class |
N |
a vector of non-negative integers in increasing order, the sample sizes for which incremental samples will be generated |
force.list |
if |
This function is currently implemented as a wrapper around
sample.tfl
, using spc2tfl
and tfl2spc
to convert
between frequency spectra and type frequency lists. A direct
implementation might be slightly more efficient, but would very likely
not make a substantial difference.
If N
is a single integer (and the force.list
flag is not
set), a spc
object representing the frequency spectrum of a
random sample of size from
obj
.
If N
is a vector of length greater one, or if
force.list=TRUE
, a list of spc
objects representing the
frequency spectra of incremental random samples of the specified sizes
. Incremental means that each sample is a superset of
the preceding sample.
spc
for more information about frequency spectra
sample.tfl
is an analogous function for type frequency
lists (objects of class tfl
)
sample.spc
takes a single concrete random
subsample from a spectrum and returns the spectrum of the subsample,
unlike spc.interp
, that computes the expected
frequency spectrum for random subsamples of size N
by
binomial interpolation.
## read Brown spectrum data(Brown.spc) summary(Brown.spc) ## sample a spectrum of 100k tokens MiniBrown.spc <- sample.spc(Brown.spc,1e+5) summary(MiniBrown.spc) ## if we repat, we get a different sample MiniBrown.spc <- sample.spc(Brown.spc,1e+5) summary(MiniBrown.spc)
## read Brown spectrum data(Brown.spc) summary(Brown.spc) ## sample a spectrum of 100k tokens MiniBrown.spc <- sample.spc(Brown.spc,1e+5) summary(MiniBrown.spc) ## if we repat, we get a different sample MiniBrown.spc <- sample.spc(Brown.spc,1e+5) summary(MiniBrown.spc)
Compute incremental random samples from a type frequency list (an
object of class tfl
).
sample.tfl(obj, N, force.list=FALSE)
sample.tfl(obj, N, force.list=FALSE)
obj |
an object of class |
N |
a vector of non-negative integers in increasing order, the sample sizes for which incremental samples will be generated |
force.list |
if |
The current implementation is reasonably efficient, but will be rather slow when applied to very large type frequency lists.
If N
is a single integer (and the force.list
flag is not
set), a tfl
object representing a random sample of size
from the type frequency list
obj
.
If N
is a vector of length greater one, or if
force.list=TRUE
, a list of tfl
objects representing
incremental random samples of the specified sizes .
Incremental means that each sample is a superset of the
preceding sample.
tfl
for more information about type frequency lists
sample.spc
is an analogous function for frequency
spectra (objects of class spc
)
## load Brown tfl data(Brown.tfl) summary(Brown.tfl) ## sample a tfl of 100k tokens MiniBrown.tfl <- sample.tfl(Brown.tfl,1e+5) summary(MiniBrown.tfl) ## if we repat, we get a different sample MiniBrown.tfl <- sample.tfl(Brown.tfl,1e+5) summary(MiniBrown.tfl)
## load Brown tfl data(Brown.tfl) summary(Brown.tfl) ## sample a tfl of 100k tokens MiniBrown.tfl <- sample.tfl(Brown.tfl,1e+5) summary(MiniBrown.tfl) ## if we repat, we get a different sample MiniBrown.tfl <- sample.tfl(Brown.tfl,1e+5) summary(MiniBrown.tfl)
In the zipfR
library, spc
objects are used to represent
a word frequency spectrum (either an observed spectrum or the expected
spectrum of a LNRE model at a given sample size).
With the spc
constructor function, an object can be initialized
directly from the specified data vectors. It is more common to read
an observed spectrum from a disk file with read.spc
or
compute an expected spectrum with lnre.spc
, though.
spc
objects should always be treated as read-only.
spc(Vm, m=1:length(Vm), VVm=NULL, N=NA, V=NA, VV=NA, m.max=0, expected=!missing(VVm))
spc(Vm, m=1:length(Vm), VVm=NULL, N=NA, V=NA, VV=NA, m.max=0, expected=!missing(VVm))
m |
integer vector of frequency classes |
Vm |
vector of corresponding class sizes |
VVm |
optional vector of estimated variances
|
N , V
|
total sample size |
VV |
variance |
m.max |
highest frequency class |
expected |
set to |
A spc
object is a data frame with the following variables:
m
frequency class , an integer vector
Vm
class size, i.e. number of types in
frequency class
(either observed class size from a sample
or expected class size
based on a LNRE model)
VVm
optional: estimated variance of
expected class size (only meaningful for expected spectrum derived
from LNRE model)
The following attributes are used to store additional information about the frequency spectrum:
m.max
if non-zero, the frequency spectrum is
incomplete and lists only frequency classes up to m.max
N, V
sample size and vocabulary size
of the frequency spectrum. For a complete frequency spectrum,
these values could easily be determined from
m
and
Vm
, but they are essential for an incomplete spectrum.
VV
variance of expected vocabulary size; only present
if hasVariances
is TRUE
. Note that VV
may
have the value NA
is the user failed to specify it.
expected
if TRUE
, frequency spectrum lists
expected class sizes (rather than observed
sizes
). Note that the
VVm
variable is only
allowed for an expected frequency spectrum.
hasVariances
indicates whether or not the VVm
variable is present
An object of class spc
representing the specified frequency
spectrum. This object should be treated as read-only (although such
behaviour cannot be enforced in R).
read.spc
, write.spc
,
spc.vector
, sample.spc
,
spc2tfl
, tfl2spc
,
lnre.spc
, plot.spc
Generic methods supported by spc
objects are
print
, summary
, N
,
V
, Vm
, VV
, and
VVm
.
Implementation details and non-standard arguments for these methods
can be found on the manpages print.spc
,
summary.spc
, N.spc
, V.spc
,
etc.
## load Brown imaginative prose spectrum and inspect it data(BrownImag.spc) summary(BrownImag.spc) print(BrownImag.spc) plot(BrownImag.spc) N(BrownImag.spc) V(BrownImag.spc) Vm(BrownImag.spc,1) Vm(BrownImag.spc,1:5) ## compute ZM model, and generate PARTIAL expected spectrum ## with variances for a sample of 10 million tokens zm <- lnre("zm",BrownImag.spc) zm.spc <- lnre.spc(zm,1e+7,variances=TRUE) ## inspect extrapolated spectrum summary(zm.spc) print(zm.spc) plot(zm.spc,log="x") N(zm.spc) V(zm.spc) VV(zm.spc) Vm(zm.spc,1) VVm(zm.spc,1) ## generate an artificial Zipfian-looking spectrum ## and take a look at it zipf.spc <- spc(round(1000/(1:1000)^2)) summary(zipf.spc) plot(zipf.spc) ## see manpages of lnre, and the various *.spc mapages ## for more examples of spc usage
## load Brown imaginative prose spectrum and inspect it data(BrownImag.spc) summary(BrownImag.spc) print(BrownImag.spc) plot(BrownImag.spc) N(BrownImag.spc) V(BrownImag.spc) Vm(BrownImag.spc,1) Vm(BrownImag.spc,1:5) ## compute ZM model, and generate PARTIAL expected spectrum ## with variances for a sample of 10 million tokens zm <- lnre("zm",BrownImag.spc) zm.spc <- lnre.spc(zm,1e+7,variances=TRUE) ## inspect extrapolated spectrum summary(zm.spc) print(zm.spc) plot(zm.spc,log="x") N(zm.spc) V(zm.spc) VV(zm.spc) Vm(zm.spc,1) VVm(zm.spc,1) ## generate an artificial Zipfian-looking spectrum ## and take a look at it zipf.spc <- spc(round(1000/(1:1000)^2)) summary(zipf.spc) plot(zipf.spc) ## see manpages of lnre, and the various *.spc mapages ## for more examples of spc usage
spc.interp
computes the expected frequency spectrum for a
random sample of specified size , taken from a data set
described by the frequency spectrum object
obj
.
spc.interp(obj, N, m.max=max(obj$m), allow.extrapolation=FALSE)
spc.interp(obj, N, m.max=max(obj$m), allow.extrapolation=FALSE)
obj |
an object of class |
N |
a single non-negative integer specifying the sample size for which the expected frequency spectrum is calculated |
m.max |
number of spectrum elements listed in the expected
frequency spectrum. By default, as many spectrum elements are
included as the spectrum |
allow.extrapolation |
if |
See the EVm.spc
manpage for more information, especially
concerning binomial extrapolation.
For large frequency spectra, the default value of m.max
may
lead to very long computation times. It is therefore recommended to
specify m.max
explicitly and calculate only as many spectrum
elements as are actually required.
An object of class spc
, representing the expected frequency
spectrum for a random sample of size N
taken from the data set
that is described by obj
.
spc
for more information about frequency spectra and
links to relevant functions
The implementation of spc.interp
is based on the functions
EV.spc
and EVm.spc
. See the respective
manpages for technical details.
vgc.interp
computes expected vocabulary growth curves by
binomial interpolation from a frequency spectrum
sample.spc
takes a single concrete random
subsample from a spectrum and returns the spectrum of the subsample,
unlike spc.interp
, that computes the expected
frequency spectrum for random subsamples of size N
by
binomial interpolation.
## load the Tiger NP expansion spectrum ## (sample size: about 109k tokens) data(TigerNP.spc) ## interpolated expected frequency subspectrum of 50k tokens TigerNP.sub.spc <- spc.interp(TigerNP.spc,5e+4) summary(TigerNP.sub.spc) ## previous is slow since it calculates all expected spectrum ## elements; suppose we only need the first 10 expected ## spectrum element frequencies; then we can do: TigerNP.sub.spc <- spc.interp(TigerNP.spc,5e+4,m.max=10) # much faster! summary(TigerNP.sub.spc)
## load the Tiger NP expansion spectrum ## (sample size: about 109k tokens) data(TigerNP.spc) ## interpolated expected frequency subspectrum of 50k tokens TigerNP.sub.spc <- spc.interp(TigerNP.spc,5e+4) summary(TigerNP.sub.spc) ## previous is slow since it calculates all expected spectrum ## elements; suppose we only need the first 10 expected ## spectrum element frequencies; then we can do: TigerNP.sub.spc <- spc.interp(TigerNP.spc,5e+4,m.max=10) # much faster! summary(TigerNP.sub.spc)
spc.vector
returns a selected range of elements from a
frequency spectrum as a plain numeric vector (which may contain
entries with , unlike the
spc
object
itself).
spc.vector(obj, m.min=1, m.max=15, all=FALSE)
spc.vector(obj, m.min=1, m.max=15, all=FALSE)
obj |
an object of class |
m.min , m.max
|
determine the range of frequency classes to be returned (defaulting to 1 ... 15) |
all |
if |
spc.vector(obj, a, b)
is fully equivalent to Vm(obj,
a:b)
(and is implemented in this way).
A numeric vector with the selected elements of the frequency spectrum.
In this vector, empty frequency classes () are
represented by 0 entries (unlike the
spc
object, which omits
all empty classes).
spc
for more information about spc
objects and
links to relevant functions
Vm.spc
for an alternative way of extracting spectrum
vectors from a .spc
object, and N.spc
,
V.spc
, VV.spc
, VVm.spc
for
extracting related information
## Brown Noun spectrum data(BrownNoun.spc) ## by default, extract first 15 elements spc.vector(BrownNoun.spc) ## first five elements spc.vector(BrownNoun.spc,1,5) ## just frequencies of spc elements 4 and 5 spc.vector(BrownNoun.spc,4,5) ## same as Vm(BrownNoun.spc,4:5)
## Brown Noun spectrum data(BrownNoun.spc) ## by default, extract first 15 elements spc.vector(BrownNoun.spc) ## first five elements spc.vector(BrownNoun.spc,1,5) ## just frequencies of spc elements 4 and 5 spc.vector(BrownNoun.spc,4,5) ## same as Vm(BrownNoun.spc,4:5)
tfl2spc
computes an observed frequency spectrum from a type
frequency list, while spc2tfl
reconstructs the type frequency
list underlying a frequency spectrum (but without type
representations).
tfl2spc(tfl) spc2tfl(spc)
tfl2spc(tfl) spc2tfl(spc)
tfl |
an object of class |
spc |
an object of class |
The current implementation of these functions does not support incomplete type frequency lists and frequency spectra.
spc2tfl
can only convert frequency spectra where all class
sizes are integers. For this reason, expected frequency spectra
(including all spectra with variance data) are not supported.
For tfl2spc
, an object of class spc
representing the
frequency spectrum corresponding to the type frequency list tfl
.
For spc2tfl
, an object of class tfl
representing type
frequency list underlying the observed frequency spectrum tfl
.
spc
for more information about spc
objects and
links to relevant functions; tfl
for more information
about tfl
objects and links to relevant functions
## Brown tfl and spc data(Brown.tfl) data(Brown.spc) ## a spectrum from a tfl Brown.spc2 <- tfl2spc(Brown.tfl) ## identical to Brown.spc: summary(Brown.spc) summary(Brown.spc2) tail(Brown.spc) tail(Brown.spc2) ## a tfl from a spectrum Brown.tfl2 <- spc2tfl(Brown.spc) ## same frequency information as Brown.tfl ## but with different ids and no type labels summary(Brown.tfl) summary(Brown.tfl2) print(Brown.tfl2) print(Brown.tfl)
## Brown tfl and spc data(Brown.tfl) data(Brown.spc) ## a spectrum from a tfl Brown.spc2 <- tfl2spc(Brown.tfl) ## identical to Brown.spc: summary(Brown.spc) summary(Brown.spc2) tail(Brown.spc) tail(Brown.spc2) ## a tfl from a spectrum Brown.tfl2 <- spc2tfl(Brown.spc) ## same frequency information as Brown.tfl ## but with different ids and no type labels summary(Brown.tfl) summary(Brown.tfl2) print(Brown.tfl2) print(Brown.tfl)
In the zipfR
library, tfl
objects are used to represent
a type frequency list, which specifies the observed frequency of each
type in a corpus. For mathematical reasons, expected type frequencies
are rarely considered.
With the tfl
constructor function, an object can be initialized
directly from the specified data vectors. It is more common to read
a type frequency list from a disk file with read.tfl
or,
in some cases, derive it from an observed frequency spectrum with
spc2tfl
.
tfl
objects should always be treated as read-only.
tfl(f, k=seq_along(f), type=NULL, f.min=min(f), f.max=max(f), incomplete=!(missing(f.min) && missing(f.max)), N=NA, V=NA, delete.zeros=FALSE)
tfl(f, k=seq_along(f), type=NULL, f.min=min(f), f.max=max(f), incomplete=!(missing(f.min) && missing(f.max)), N=NA, V=NA, delete.zeros=FALSE)
k |
integer vector of type IDs |
f |
vector of corresponding type frequencies |
type |
optional character vector of type representations (e.g. word forms or lemmata), used for informational and printing purposes only |
incomplete |
indicates that the type frequency list is incomplete, i.e. only contains types in a certain frequency range (typically, the lowest-frequency types may be excluded). Incomplete type frequency lists are rarely useful. |
N , V
|
sample size and vocabulary size corresponding to the type frequency list have to be specified explicitly for incomplete lists |
f.min , f.max
|
frequency range represented in an incomplete type frequency list (see details below) |
delete.zeros |
if |
If f.min
and f.max
are not specified, but the list is
marked as incomplete (with incomplete=TRUE
), they are
automatically determined from the frequency vector f
(making
the assumption that all types in this frequency range are listed).
Explicit specification of either f.min
or f.max
implies
an incomplete list. In this case, all types outside the specified
range will be deleted from the list. If incomplete=FALSE
is
explicitly given, N
and V
will be determined
automatically from the input data (which is assumed to be complete),
but the resulting type frequency list will still be incomplete.
If you just want to remove types with without marking the
type frequency list as incomplete, use the option
delete.zeros=TRUE
.
A tfl
object is a data frame with the following variables:
k
integer type ID
f
corresponding type frequency
type
optional: character vector with type representations used for printing
The data frame always has to be sorted with respect to the k
column (ascending order). If a type
column is present,
rownames are set to the types and can be used for character indexing.
The following attributes are used to store additional information about the frequency spectrum:
N, V
sample size and vocabulary size
corresponding to the type frequency list. For a complete list,
these values could easily be determined from the
f
variable, but they are essential for an incomplete list.
incomplete
if TRUE
, the type frequency list is
incomplete, i.e. it lists only types in the frequency range given
by f.min
and f.max
f.min
, f.max
range of type frequencies
represented in the list (should be ignored unless the
incomplete
flag is set)
hasTypes
indicates whether or not the type
variable is present
An object of class tfl
representing the specified type
frequency list. This object should be treated as read-only (although
such behaviour cannot be enforced in R).
read.tfl
, write.tfl
, plot.tfl
,
sample.tfl
, spc2tfl
, tfl2spc
Generic methods supported by tfl
objects are
print
, summary
, N
,
V
and Vm
.
Implementation details and non-standard arguments for these methods
can be found on the manpages print.tfl
,
summary.tfl
, N.tfl
, V.tfl
,
etc.
## typically, you will read a tfl from a file ## (see examples in the read.tfl manpage) ## or you can load a ready-made tfl data(Brown.tfl) summary(Brown.tfl) print(Brown.tfl) ## or create it from a spectrum (with different ids and ## no type labels) data(Brown.spc) Brown.tfl2 <- spc2tfl(Brown.spc) ## same frequency information as Brown.tfl ## but with different ids and no type labels summary(Brown.tfl2) print(Brown.tfl2) ## how to display draw a Zipf's rank/frequency plot ## by extracting frequencies from a tfl plot(sort(Brown.tfl$f,decreasing=TRUE),log="y",xlab="rank",ylab="frequency") ## simulating a tfl Zipfian.tfl <- tfl(1000/(1:1000)) plot(Zipfian.tfl$f,log="y")
## typically, you will read a tfl from a file ## (see examples in the read.tfl manpage) ## or you can load a ready-made tfl data(Brown.tfl) summary(Brown.tfl) print(Brown.tfl) ## or create it from a spectrum (with different ids and ## no type labels) data(Brown.spc) Brown.tfl2 <- spc2tfl(Brown.spc) ## same frequency information as Brown.tfl ## but with different ids and no type labels summary(Brown.tfl2) print(Brown.tfl2) ## how to display draw a Zipf's rank/frequency plot ## by extracting frequencies from a tfl plot(sort(Brown.tfl$f,decreasing=TRUE),log="y",xlab="rank",ylab="frequency") ## simulating a tfl Zipfian.tfl <- tfl(1000/(1:1000)) plot(Zipfian.tfl$f,log="y")
Objects of classes tfl
, spc
and
vgc
that contain frequency data for the syntactic
expansions of Noun Phrases (NP) and Prepositional Phrases (PP) in
the Tiger German treebank.
TigerNP.tfl TigerNP.spc TigerNP.emp.vgc TigerPP.tfl TigerPP.spc TigerPP.emp.vgc
TigerNP.tfl TigerNP.spc TigerNP.emp.vgc TigerPP.tfl TigerPP.spc TigerPP.emp.vgc
In this dataset, types are not words, but syntactic expansions,
i.e., sequences of syntactic categories that form NPs (in
TigerNP
) or PPs (in TigerPP
), according to the Tiger
annotation scheme for German. Thus, for example, among the expansion
types in the TigerNP
dataset, we find ART_NN
and
ART_ADJA_NN
, whereas among the PP expansions in
TigerPP
we find APPR_ART_NN
and APPR_NN
(APPR
is the tag for prepositions in the Tiger tagset).
The Tiger treebank contains about 900,000 tokens (50,000 sentences) of German newspaper text from the Frankfurter Rundschau. The token frequencies of the expansion types are taken from this corpus.
TigerNP.tfl
and TigerPP.tfl
are the type frequency
lists. TigerNP.spc
and TigerPP.spc
are frequency
spectra. TigerNP.emp.vgc
and TigerPP.emp.vgc
are the
corresponding observed vocabulary growth curves (tracking the
development of V
and V(1)
in the original order of
occurrence of the expansion tokens in the source corpus).
Tiger Project: https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/
TigerNP.tfl summary(TigerNP.spc) summary(TigerNP.emp.vgc) TigerPP.tfl summary(TigerPP.spc) summary(TigerPP.emp.vgc)
TigerNP.tfl summary(TigerNP.spc) summary(TigerNP.emp.vgc) TigerPP.tfl summary(TigerPP.spc) summary(TigerPP.emp.vgc)
Compute type-frequency list, frequency spectrum and vocabulary growth curve from a token vector representing a random sample or an observed sequence of tokens.
vec2tfl(x) vec2spc(x) vec2vgc(x, steps=200, stepsize=NA, m.max=0)
vec2tfl(x) vec2spc(x) vec2vgc(x, steps=200, stepsize=NA, m.max=0)
x |
a vector of length |
steps |
number of steps for which vocabulary growth data
|
stepsize |
alternative way of specifying the steps of the
vocabulary growth curve. In this case, vocabulary growth data will
be calculated every |
m.max |
an integer in the range $1 ... 9$, specifying how many
spectrum elements |
There are two main applications for the vec2xxx
functions:
They can be used to calculate type-token statistics and
vocabulary growth curves for random samples generated from a LNRE
model (with the rlnre
function).
They provide an easy way to process a user's own data without having to rely on external scripts to compute frequency spectra and vocabulary growth curves. All that is needed is a text file in one-token-per-line formt (i.e. where each token is given on a separate line). See "Examples" below for further hints.
Both applications work well for samples of up to approx. 1 million
tokens. For considerably larger data sets, specialized external
software should be used, such as the Perl scripts provided on the
zipfR
homepage.
An object of class tfl
, spc
or vgc
, representing
the type frequency list, frequency spectrum or vocabulary growth curve
of the token vector x
, respectively.
tfl
, spc
and vgc
for more
information about type frequency lists, frequency spectra and
vocabulary growth curves
rlnre
for generating random samples (in the form of the
required token vectors) from a LNRE model
readLines
and scan
for loading token
vectors from disk files
## type-token statistics for random samples from a LNRE distribution model <- lnre("fzm", alpha=.5, A=1e-6, B=.05) x <- rlnre(model, 100000) vec2tfl(x) vec2spc(x) # same as tfl2spc(vec2tfl(x)) vec2vgc(x) sample.spc <- vec2spc(x) exp.spc <- lnre.spc(model, 100000) plot(exp.spc, sample.spc) sample.vgc <- vec2vgc(x, m.max=1, steps=500) exp.vgc <- lnre.vgc(model, N=N(sample.vgc), m.max=1) plot(exp.vgc, sample.vgc, add.m=1) ## Not run: ## load token vector from a file in one-token-per-line format x <- readLines(filename) x <- readLines(file.choose()) # with file selection dialog ## you can also perform whitespace tokenization and filter the data brown <- scan("brown.pos", what=character(0), quote="") nouns <- grep("/NNS?$", brown, value=TRUE) plot(vec2spc(nouns)) plot(vec2vgc(nouns, m.max=1), add.m=1) ## End(Not run)
## type-token statistics for random samples from a LNRE distribution model <- lnre("fzm", alpha=.5, A=1e-6, B=.05) x <- rlnre(model, 100000) vec2tfl(x) vec2spc(x) # same as tfl2spc(vec2tfl(x)) vec2vgc(x) sample.spc <- vec2spc(x) exp.spc <- lnre.spc(model, 100000) plot(exp.spc, sample.spc) sample.vgc <- vec2vgc(x, m.max=1, steps=500) exp.vgc <- lnre.vgc(model, N=N(sample.vgc), m.max=1) plot(exp.vgc, sample.vgc, add.m=1) ## Not run: ## load token vector from a file in one-token-per-line format x <- readLines(filename) x <- readLines(file.choose()) # with file selection dialog ## you can also perform whitespace tokenization and filter the data brown <- scan("brown.pos", what=character(0), quote="") nouns <- grep("/NNS?$", brown, value=TRUE) plot(vec2spc(nouns)) plot(vec2vgc(nouns, m.max=1), add.m=1) ## End(Not run)
In the zipfR
library, vgc
objects are used to represent
a vocabulary growth curve (VGC). This can be an observed VGC from an
incremental set of sample (such as a corpus), a randomized VGC
obtained by binomial interpolation, or the expected VGC according to a
LNRE model.
With the vgc
constructor function, an object can be initialized
directly from the specified data vectors. It is more common to read
an observed VGC from a disk file with read.vgc
, generate
a randomized VGC with vgc.interp
or compute an expected
VGC with lnre.vgc
, though.
vgc
objects should always be treated as read-only.
vgc(N, V, Vm=NULL, VV=NULL, VVm=NULL, expected=FALSE, check=TRUE)
vgc(N, V, Vm=NULL, VV=NULL, VVm=NULL, expected=FALSE, check=TRUE)
N |
integer vector of sample sizes |
V |
vector of corresponding vocabulary sizes |
Vm |
optional list of growth vectors for hapaxes |
VV |
optional vector of variances
|
VVm |
optional list of variance vectors
|
expected |
if |
check |
by default, various sanity checks are performed on the
data supplied to the |
If variances (VV
or VVm
) are specified for an expected
VGC, all relevant vectors must be given. In other words, VV
always has to be present in this case, and VVm
has to be
present whenever Vm
is specified, and must contain vectors for
exactly the same frequency classes.
V
and VVm
are integer vectors for an observed VGC, but
will usually be fractional for an interpolated or expected VGC.
A vgc
object is a data frame with the following variables:
N
sample size
V
corresponding vocabulary size (either observed
vocabulary size or expected vocabulary size
)
V1
... V9
optional: observed or expected
spectrum elements ( or
). Not all of
these variables have to be present, but there must not be any
"gaps" in the spectrum.
VV
optional: variance of expected vocabulary size,
VV1
... VV9
optional: variances of expected
spectrum elements, . If
variances are present, they must be available for exactly the same
frequency classes as the corresponding expected values.
The following attributes are used to store additional information about the vocabulary growth curve:
m.max
if non-zero, the VGC includes spectrum elements
for
up to
m.max
. For m.max=0
,
no spectrum elements are present.
expected
if TRUE
, the object represents an
interpolated or expected VGC, with expected vocabulary size and
spectrum elements. Otherwise, the object represents an observed
VGC.
hasVariances
indicates whether or not the VV
variable is present (as well as VV1
, VV2
, etc., if
appropriate)
An object of class vgc
representing the specified vocabulary
growth curve. This object should be treated as read-only (although
such behaviour cannot be enforced in R).
read.vgc
, write.vgc
, plot.vgc
,
vgc.interp
, lnre.vgc
Generic methods supported by vgc
objects are
print
, summary
, N
,
V
, Vm
, VV
, and
VVm
.
Implementation details and non-standard arguments for these methods
can be found on the manpages print.vgc
,
summary.vgc
, N.vgc
, V.vgc
,
etc.
## load Dickens' work empirical vgc and take a look at it data(Dickens.emp.vgc) summary(Dickens.emp.vgc) print(Dickens.emp.vgc) plot(Dickens.emp.vgc,add.m=1) ## vectors of sample sizes in the vgc, and the ## corresponding V and V_1 vectors Ns <- N(Dickens.emp.vgc) Vs <- V(Dickens.emp.vgc) Vm <- V(Dickens.emp.vgc,1) ## binomially interpolated V and V_1 at the same sample sizes ## as the empirical curve data(Dickens.spc) Dickens.bin.vgc <- vgc.interp(Dickens.spc,N(Dickens.emp.vgc),m.max=1) ## compare observed and interpolated plot(Dickens.emp.vgc,Dickens.bin.vgc,add.m=1,legend=c("observed","interpolated")) ## load Italian ultra- prefix data data(ItaUltra.spc) ## compute zm model zm <- lnre("zm",ItaUltra.spc) ## compute vgc up to about twice the sample size ## with variance of V zm.vgc <- lnre.vgc(zm,(1:100)*70, variances=TRUE) summary(zm.vgc) print(zm.vgc) ## plot with confidence intervals derived from variance in ## vgc (with larger datasets, ci will typically be almost ## invisible) plot(zm.vgc) ## for more examples of vgc usages, see manpages of lnre.vgc, ## plot.vgc, print.vgc and vgc.interp
## load Dickens' work empirical vgc and take a look at it data(Dickens.emp.vgc) summary(Dickens.emp.vgc) print(Dickens.emp.vgc) plot(Dickens.emp.vgc,add.m=1) ## vectors of sample sizes in the vgc, and the ## corresponding V and V_1 vectors Ns <- N(Dickens.emp.vgc) Vs <- V(Dickens.emp.vgc) Vm <- V(Dickens.emp.vgc,1) ## binomially interpolated V and V_1 at the same sample sizes ## as the empirical curve data(Dickens.spc) Dickens.bin.vgc <- vgc.interp(Dickens.spc,N(Dickens.emp.vgc),m.max=1) ## compare observed and interpolated plot(Dickens.emp.vgc,Dickens.bin.vgc,add.m=1,legend=c("observed","interpolated")) ## load Italian ultra- prefix data data(ItaUltra.spc) ## compute zm model zm <- lnre("zm",ItaUltra.spc) ## compute vgc up to about twice the sample size ## with variance of V zm.vgc <- lnre.vgc(zm,(1:100)*70, variances=TRUE) summary(zm.vgc) print(zm.vgc) ## plot with confidence intervals derived from variance in ## vgc (with larger datasets, ci will typically be almost ## invisible) plot(zm.vgc) ## for more examples of vgc usages, see manpages of lnre.vgc, ## plot.vgc, print.vgc and vgc.interp
vgc.interp
computes the expected vocabulary growth curve for
random sample taken from a data set described by the frequency
spectrum object obj
.
vgc.interp(obj, N, m.max=0, allow.extrapolation=FALSE)
vgc.interp(obj, N, m.max=0, allow.extrapolation=FALSE)
obj |
an object of class |
N |
a vector of increasing non-negative integers specifying the sample sizes for the expected vocabulary size is calculated (as well as expected spectrum elements if requested) |
m.max |
an integer in the range |
allow.extrapolation |
if |
See the EV.spc
manpage for more information, especially
concerning binomial extrapolation.
Note that the result of vgc.interp
is an object of class
vgc
(a vocabulary growth curve), but its input is an
object of class spc
(a frequency spectrum).
An object of class vgc
, representing the expected vocabulary
growth curves for random samples taken from the data set described by
obj
. Data points will be generated for the specified sample
sizes N
.
vgc
for more information about vocabulary growth curves
and links to relevant functions; spc
for more
information about frequency spectra
The implementation of vgc.interp
is based on the functions
EV.spc
and EVm.spc
. See the respective
manpages for technical details.
spc.interp
computes the expected frequency spectrum for
a random sample by binomial interpolation.
## load the Tiger PP expansion spectrum ## (sample size: about 91k tokens) data(TigerPP.spc) ## binomially interpolated curve TigerPP.bin.vgc <- vgc.interp(TigerPP.spc,(1:100)*910) summary(TigerPP.bin.vgc) ## let's also add growth of V_1 to V_5 and plot TigerPP.bin.vgc <- vgc.interp(TigerPP.spc,(1:100)*910,m.max=5) plot(TigerPP.bin.vgc,add.m=c(1:5))
## load the Tiger PP expansion spectrum ## (sample size: about 91k tokens) data(TigerPP.spc) ## binomially interpolated curve TigerPP.bin.vgc <- vgc.interp(TigerPP.spc,(1:100)*910) summary(TigerPP.bin.vgc) ## let's also add growth of V_1 to V_5 and plot TigerPP.bin.vgc <- vgc.interp(TigerPP.spc,(1:100)*910,m.max=5) plot(TigerPP.bin.vgc,add.m=c(1:5))
VV
and VVm
are generic methods that can (and should) be
used to compute the variance of the vocabulary size and the variances
of spectrum elements according to an LNRE model (i.e. an object of
class lnre
). These methods are also used to access variance
information stored in some objects of class spc
and vgc
.
VV(obj, N=NA, ...) VVm(obj, m, N=NA, ...)
VV(obj, N=NA, ...) VVm(obj, m, N=NA, ...)
obj |
an object of class |
m |
positive integer value determining the frequency class
|
N |
sample size |
... |
additional arguments passed on to the method implementation (see respective manpages for details) |
spc
and vgc
objects must represent an expected or
interpolated frequency spectrum or VGC, and must include variance
data.
For vgc
objects, the VVm
method allows only a single
value m
to be specified.
The argument N
is only allowed for LNRE models and will trigger
an error message otherwise.
For a LNRE model (class lnre
), VV
computes the variance
of the random variable (vocabulary size), and
VVm
computes the variance of the random variables (spectrum
elements), for a sample of specified size
.
For an observed or interpolated frequency spectrum (class spc
),
VV
returns the variance of the expected vocabulary size, and
VVm
returns variances of the spectrum elements. These methods
are only applicable if the spc
object includes variance
information.
For an expected or interpolated vocabulary growth curve (class
vgc
), VV
returns the variance vector of the expected
vocabulary sizes , and
VVm
the corresponding vector for
. These methods are only applicable if the
vgc
object
includes variance information.
For details on the implementations of these methods, see VV.spc
, VV.vgc
, etc.
Expected vocabulary size and frequency spectrum for a sample of size
according to a LNRE model can be computed with the analogous
methods
EV
and EVm
. For spc
and
vgc
objects, and
are always accessed with the
methods
V
and Vm
, even if they represent
expected or interpolated values.
## see lnre documentation for examples
## see lnre documentation for examples
Set default graphics parameters for zipfR
high-level plots and
plot utilities, similar to par
for general graphics parameters.
The current parameter values are queried by giving their names as
character strings. The values can be set by specifying them as
arguments in name=value
form, or by passing a single list of
named values.
NB: This is an advanced function to fine-tune zipfR
plots. For basic plotting options (that are likely to be sufficient
for most purposes) see plot.spc
and
plot.vgc
instead.
zipfR.par(..., bw.mode=FALSE)
zipfR.par(..., bw.mode=FALSE)
... |
either character strings (or vectors) specifying the names
of parameters to be queried, or parameters to be set in
|
bw.mode |
if |
Parameters are set by specifying their names and the new values as
name=value
pairs. Such a list can also be passed as a single
argument to zipfR.par
, which is typically used to restore previous
parameter values (that have been saved in a list variable).
Most of the default values can be manually overridden in the high-level plots.
zipfR.par()
shows all parameters with their current values, and
names(zipfR.par())
produces a listing of valid parameter names.
When parameters are set, their former values are returned in an
invisible named list. Such a list can be passed as a single argument
to zipfR.par
to restore the parameter values.
When a single parameter is queried, its value is returned directly. When two or more parameters are queried, the result is a named list.
Note the inconsistency, which is the same as for par
: setting
one parameter returns a list, but querying one parameter returns a
vector (or a scalar, i.e. a vector of length 1).
col
a character or integer vector specifying up to 10
line colours (see the par
manpage for
details). Values of shorter vectors are recycled as necessary.
lty
a character or integer vector specifying up to 10
line styles (see the par
manpage for
details). Values of shorter vectors are recycled as necessary.
lwd
a numeric vector specifying up to 10 line widths
(see the par
manpage for details). Values of shorter
vectors are recycled as necessary.
pch
a character or integer vector specifying up to 10 plot symbols. Values of shorter vectors are recycled as necessary.
barcol
a character or integer vector specifying up to 10 colours for the bars in non-logarithmic spectrum plots. Values of shorter vectors are recycled as necessary.
col.bw
the line colours used in B/W mode
(bw=TRUE
)
lty.bw
the line styles used in B/W mode
(bw=TRUE
)
lwd.bw
the line widths used in B/W mode
(bw=TRUE
)
pch.bw
the plot symbols used in B/W mode
(bw=TRUE
)
barcol.bw
the bar colours used in B/W mode
(bw=TRUE
)
bw
if TRUE
, plots are drawn in B/W mode unless
specified otherwise (default: FALSE
, i.e. colour mode
device
plot device used by the zipfR
plotutils
(see zipfR.begin.plot
for details). Currently
supported devices are x11
(default on most platforms), eps
,
pdf
, as well as png
and quartz
where
available (default on Mac OS X).
init.par
list of named graphics parameters passed to
the par
function whenever a new viewport is created with
zipfR.begin.plot
width
, height
default width and height of the
plotting window opened by zipfR.begin.plot
plot.spc
, plot.vgc
,
zipfR.begin.plot
, zipfR.end.plot
print(names(zipfR.par())) # list available parameters zipfR.par("col", "lty", "lwd") # the default line styles zipfR.par(c("col", "lty", "lwd")) # works as well ## temporary changes to graphics paramters: par.save <- zipfR.par(bw=TRUE, lwd.bw=2) ## plots use the modified parameters here zipfR.par(par.save) # restore previous values
print(names(zipfR.par())) # list available parameters zipfR.par("col", "lty", "lwd") # the default line styles zipfR.par(c("col", "lty", "lwd")) # works as well ## temporary changes to graphics paramters: par.save <- zipfR.par(bw=TRUE, lwd.bw=2) ## plots use the modified parameters here zipfR.par(par.save) # restore previous values
These functions are deprecated and should not be used in new code.
Conveniently create plots with different layout and in different output formats (both on-screen and various graphics file formats).
Each plot is wrapped in a pair of zipfR.begin.plot
and
zipfR.end.plot
commands, which make sure that a suitable
plotting window / image file is opened and closed as required. Format
and dimensions of the plots are controlled by global settings made
with zipfR.par
, but can be overridden in the
zipfR.begin.plot
call.
zipfR.pick.device
automatically selects a default device by
scanning the specified vector for strings of the form --pdf
,
--eps
, etc.
NB: These are advanced functions intended to make it easier
to produce plots in different formats. Most users will only need
the basic plotting functionalities provided by plot.tfl
,
plot.spc
and plot.vgc
.
zipfR.pick.device(args=commandArgs()) zipfR.begin.plot(device=zipfR.par("device"), filename="", width=zipfR.par("width"), height=zipfR.par("height"), bg=zipfR.par("bg"), pointsize=zipfR.par("pointsize")) ## plotting commands go here zipfR.end.plot(shutdown=FALSE)
zipfR.pick.device(args=commandArgs()) zipfR.begin.plot(device=zipfR.par("device"), filename="", width=zipfR.par("width"), height=zipfR.par("height"), bg=zipfR.par("bg"), pointsize=zipfR.par("pointsize")) ## plotting commands go here zipfR.end.plot(shutdown=FALSE)
args |
a character vector, which will be scanned for strings of
the form |
device |
name of plotting device to be used (see "Devices" below) |
filename |
for graphics file devices, basename of the
output file. A suitable extension for the selected file format will
be added automatically to |
width , height
|
width and height of the plotting window or image, in inches |
bg |
background colour of the plotting window or image (use
|
pointsize |
default point size for text in the plot |
shutdown |
if set to FALSE (the default), on-screen plot devices
will be kept open for re-use in the next plot. Specify |
zipfR.begin.plot
opens a new plotting window or image file of
the specified dimensions (width
, height
), using the
selected graphics device (device
). Background colour
(bg
) and default point size (pointsize
) are set as
requested. Then, any global graphics parameter settings (defined with
the init.par
option of zipfR.par
) are applied.
See the zipfR.par
manpage for the "factory default"
settings of these options.
zipfR.end.plot
finalizes the current plot. For image file
devices, the device will be closed, writing the generated file to
disk. For screen devices, the plotting window remains visible until a
new plot is started (which will close and re-open the plotting
window).
The main purpose of the zipfR
plotting utilities is to make it
easier to draw plots that are both shown on screen (for interactive
work) and saved to image files in various formats. If an R script
specifies filename
s in all zipfR.begin.plot
commands, a
single global parameter setting at the start of the script is
sufficient to switch from screen graphics to EPS files, or any other
supported file format.
On-screen plotting devices are platform-dependent, and there may be
different devices available depending on which version of R is used.
For this reason, zipfR.begin.plot
no longer allows users to
pick an on-screen device explicitly, but rather opens a default device
with dev.new
. Note that this default device may write
output to a graphics file, but is usually set to a suitable on-screen
device in an interactive R session. In any case, users can change the
default by setting options(device=...)
. For backwards-compatibility,
the device name x11
(and quartz
on macOS is accepted
for the default graphics device.
The png
bitmap device may not be available on all platforms,
and may also require access to an X server. Since the width and
height of a PNG device have to be specified in pixels rather than
inches, zipfR.begin.plot
translates the width
and
height
settings, assuming a resolution of 150 dpi. Use of
the png
device is strongly discouraged. A better way of
producing high-quality bitmaps is to generate EPS image (with the
eps
device) and convert them to PNG or JPEG format with the
external pstoimg
program (part of the latex2html
distribution).
zipfR.pick.device
will issue a warning if multiple flags
matching supported graphics devices are found. However, it is not an
error to find no matching flag, and all unrecognized strings are
silently ignored.
zipfR.begin.plot
invisibly returns the ID of the active plot device.
Currently, the following devices are supported (and can be used in the
device
argument).
Screen devices:
x11
opens the default graphic device set by
getOption("device")
. In an interactive R sessions,
this will usually be a suitable on-screen device.
quartz
accepted as an alias for x11
on
macOS platforms
Graphics file devices:
eps
Encapsulated PostScript (EPS) output (using
postscript
device with appropriate settings)
pdf
PDF output
png
PNG bitmap file (may not be available on all platforms)
Devices, dev.new
, postscript
,
pdf
and png
for more information about the
supported graphics devices
zipfR
-specific plotting commands are plot.spc
,
plot.spc
and plot.vgc
## Not run: ## these graphics parameters will be set for every new plot zipfR.par(init.par=list(bg="lightblue", cex=1.3)) zipfR.par(width=12, height=9) ## will be shown on screen or saved to specified file, depending on ## selected device (eps -> "myplot.eps", pdf -> "myplot.pdf", etc.) zipfR.begin.plot(filename="myplot") plot.spc(Brown100k.spc) zipfR.end.plot() ## By starting an R script "myplots.R" with this command, you can ## display plots on screen when stepping through the script in an ## interactive session, or save them to disk files in various ## graphics formats with "R --no-save --args --pdf < myplots.R" etc. zipfR.pick.device() ## End(Not run)
## Not run: ## these graphics parameters will be set for every new plot zipfR.par(init.par=list(bg="lightblue", cex=1.3)) zipfR.par(width=12, height=9) ## will be shown on screen or saved to specified file, depending on ## selected device (eps -> "myplot.eps", pdf -> "myplot.pdf", etc.) zipfR.begin.plot(filename="myplot") plot.spc(Brown100k.spc) zipfR.end.plot() ## By starting an R script "myplots.R" with this command, you can ## display plots on screen when stepping through the script in an ## interactive session, or save them to disk files in various ## graphics formats with "R --no-save --args --pdf < myplots.R" etc. zipfR.pick.device() ## End(Not run)