Package 'corpora' reference manual

Title:	Statistics and Data Sets for Corpus Frequency Data
Description:	Utility functions for the statistical analysis of corpus frequency data. This package is a companion to the open-source course "Statistical Inference: A Gentle Introduction for Computational Linguists and Similar Creatures" ('SIGIL').
Authors:	Stephanie Evert [cre, aut]
Maintainer:	Stephanie Evert <[email protected]>
License:	GPL-3
Version:	0.6-1
Built:	2025-03-04 03:01:01 UTC
Source:	https://github.com/r-forge/sigil

corpora: Statistical Inference from Corpus Frequency Data

Description

The corpora package provides a collection of functions for statistical inference from corpus frequency data, as well as some convenience functions and example data sets.

It is a companion package to the open-source course Statistical Inference: a Gentle Introduction for Linguists and similar creatures originally developed by Marco Baroni and Stephanie Evert. Statistical methods implemented in the package are described and illustrated in the units of this course.

Starting with version 0.6 the package also includes best-practice implementations of various corpus-linguistic analysis techniques.

Details

An overview of some important functions and data sets included in the corpora package. See the package index for a complete listing.

Analysis functions

keyness() provides reference implementations for best-practice keyness measures, including the recommended LRC measure (Evert 2022)
binom.pval() is a vectorised function that computes p-values of the binomial test more efficiently than binom.test (using central p-values in the two-sided case)
fisher.pval() is a vectorised function that efficiently computes p-values of Fisher's exact test on $2\times 2$ contingency tables for large samples (using central p-values in the two-sided case)
prop.cint() is a vectorised function that computes multiple binomial confidence intervals much more efficiently than binom.test
z.score() and z.score.pval() can be used to carry out a z-test for a single proportion (as an approximation to binom.test)
chisq() and chisq.pval() are vectorised functions that compute the test statistic and p-value of a chi-squared test for $2\times 2$ contingency tables more efficiently than chisq.test

Utility functions

cont.table() creates $2\times 2$ contingency tables for frequency comparison test that can be passed to chisq.test and fisher.test
sample.df() extracts random samples of rows from a data frame
qw() splits a string on whitespace or a user-specified regular expression (similar to Perl's qw// construct)
corpora.palette() provides some nice colour palettes (better than R's default colours)
rowVector() and colVector() convert a vector into a single-row or single-column matrix

Data sets

Several data sets based on the British National Corpus, including complete metadata for all 4048 text files (BNCmeta), per-text frequency counts for a number of linguistic corpus queries (BNCqueries), and relative frequencies of 65 lexico-grammatical features for each text (BNCbiber)
Frequency counts of passive constructions in all texts of the Brown and LOB corpora (BrownLOBPassives) for frequency comparison with regression models, complemented by distributional features (DistFeatBrownFam) as additional predictors
A small text corpus of Very Short Stories in the form of a data frame VSS, with one row for each token in the corpus.
Small example tables to illustrate frequency comparison of lexical items (BNCcomparison) and collocation analysis (BNCInChargeOf)
KrennPPV is a data set of German verb-preposition-noun collocation candidates with manual annotation of true positives and pre-computed association scores
Three functions for generating large synthetic data sets used in the SIGIL course: simulated.census(), simulated.language.course() and simulated.wikipedia()

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

The official homepage of the corpora package and the SIGIL course is http://SIGIL.R-Forge.R-Project.org/.

P-values of the binomial test for frequency counts (corpora)

Description

This function computes the p-value of a binomial test for frequency counts. In the two-sided case, a “central” p-value (Fay 2010) provides better numerical efficiency than the likelihood-based approach of binom.test and is always consistent with confidence intervals.

Usage


binom.pval(k, n, p = 0.5,
           alternative = c("two.sided", "less", "greater"))

binom.pval(k, n, p = 0.5,
           alternative = c("two.sided", "less", "greater"))

Arguments

`k`	frequency of a type in the corpus (or an integer vector of frequencies)
`n`	number of tokens in the corpus, i.e. sample size (or an integer vector specifying the sizes of different samples)
`p`	null hypothesis, giving the assumed proportion of this type in the population (or a vector of proportions for different types and/or different populations)
`alternative`	a character string specifying the alternative hypothesis; must be one of `two.sided` (default), `less` or `greater`

Details

For alternative="two.sided" (the default), a “central” p-value is computed (Fay 2010: 53f), which differs from the likelihood-based two-sided p-value determined by binom.test (the “minlike” method in Fay's terminology). This approach has two advantages: (i) it is numerically robust and efficient, even for very large samples and frequency counts; (ii) it is always consistent with Clopper-Pearson confidence intervals (see examples below).

Value

The p-value of a binomial test applied to the given data (or a vector of p-values).

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Fay, Michael P. (2010). Two-sided exact tests and matching confidence intervals for discrete data. The R Journal, 2(1), 53-58.

Examples

# inconsistency btw likelihood-based two-sided binomial test and confidence interval
binom.test(2, 10, p=0.555)

# central two-sided test as implemented by binom.pval is always consistent
binom.pval(2, 10, p=0.555)
prop.cint(2, 10, method="binomial")
# inconsistency btw likelihood-based two-sided binomial test and confidence interval
binom.test(2, 10, p=0.555)

# central two-sided test as implemented by binom.pval is always consistent
binom.pval(2, 10, p=0.555)
prop.cint(2, 10, method="binomial")

Biber's (1988) register features for the British National Corpus

Description

This data set contains a table of the relative frequencies (per 1000 words) of 65 linguistic features (Biber 1988, 1995) for each text document in the British National Corpus (Aston & Burnard 1998).

Biber (1988) introduced these features for the purpose of a multidimensional register analysis. Variables in the data set are numbered according to Biber's list (see e.g. Biber 1995, 95f).

Feature frequencies were automatically extracted from the British National Corpus using query patterns based on part-of-speech tags (Gasthaus 2007). Note that features 60 and 65 had to be omitted because they cannot be identified with sufficient accuracy by the automatic methods. For further information on the extraction methodology, see Gasthaus (2007, 20-21). Unfortunately, the original data set and the Python scripts used for feature extraction do not seem to be publicly available any more; the version included here contains some bug fixes.

Usage


BNCbiber

BNCbiber

Format

A numeric matrix with 4048 rows and 65 columns, specifying the relative frequencies (per 1000 words) of 65 linguistic features. Documents are listed in the same order as the metadata in BNCmeta and rows are labelled with text IDs, so it is straightforward to combine the two data sets.

	A. Tense and aspect markers
`f_01_past_tense`	Past tense
`f_02_perfect_aspect`	Perfect aspect
`f_03_present_tense`	Present tense
	B. Place and time adverbials
`f_04_place_adverbials`	Place adverbials (e.g., above, beside, outdoors)
`f_05_time_adverbials`	Time adverbials (e.g., early, instantly, soon)
	C. Pronouns and pro-verbs
`f_06_first_person_pronouns`	First-person pronouns
`f_07_second_person_pronouns`	Second-person pronouns
`f_08_third_person_pronouns`	Third-person personal pronouns (excluding it)
`f_09_pronoun_it`	Pronoun it
`f_10_demonstrative_pronoun`	Demonstrative pronouns (that, this, these, those as pronouns)
`f_11_indefinite_pronoun`	Indefinite pronounes (e.g., anybody, nothing, someone)
`f_12_proverb_do`	Pro-verb do
	D. Questions
`f_13_wh_question`	Direct wh-questions
	E. Nominal forms
`f_14_nominalization`	Nominalizations (ending in -tion, -ment, -ness, -ity)
`f_15_gerunds`	Gerunds (participial forms functioning as nouns)
`f_16_other_nouns`	Total other nouns
	F. Passives
`f_17_agentless_passives`	Agentless passives
`f_18_by_passives`	by-passives
	G. Stative forms
`f_19_be_main_verb`	be as main verb
`f_20_existential_there`	Existential there
	H. Subordination features
`f_21_that_verb_comp`	that verb complements (e.g., I said that he went.)
`f_22_that_adj_comp`	that adjective complements (e.g., I'm glad that you like it.)
`f_23_wh_clause`	wh-clauses (e.g., I believed what he told me.)
`f_24_infinitives`	Infinitives
`f_25_present_participle`	Present participial adverbial clauses (e.g., Stuffing his mouth with cookies, Joe ran out the door.)
`f_26_past_participle`	Past participial adverbial clauses (e.g., Built in a single week, the house would stand for fifty years.)
`f_27_past_participle_whiz`	Past participial postnominal (reduced relative) clauses (e.g., the solution produced by this process)
`f_28_present_participle_whiz`	Present participial postnominal (reduced relative) clauses (e.g., the event causing this decline)
`f_29_that_subj`	that relative clauses on subject position (e.g., the dog that bit me)
`f_30_that_obj`	that relative clauses on object position (e.g., the dog that I saw)
`f_31_wh_subj`	wh relatives on subject position (e.g., the man who likes popcorn)
`f_32_wh_obj`	wh relatives on object position (e.g., the man who Sally likes)
`f_33_pied_piping`	Pied-piping relative clauses (e.g., the manner in which he was told)
`f_34_sentence_relatives`	Sentence relatives (e.g., Bob likes fried mangoes, which is the most disgusting thing I've ever heard of.)
`f_35_because`	Causative adverbial subordinator (because)
`f_36_though`	Concessive adverbial subordinators (although, though)
`f_37_if`	Conditional adverbial subordinators (if, unless)
`f_38_other_adv_sub`	Other adverbial subordinators (e.g., since, while, whereas)
	I. Prepositional phrases, adjectives and adverbs
`f_39_prepositions`	Total prepositional phrases
`f_40_adj_attr`	Attributive adjectives (e.g., the big horse)
`f_41_adj_pred`	Predicative adjectives (e.g., The horse is big.)
`f_42_adverbs`	Total adverbs
	J. Lexical specificity
`f_43_type_token`	Type-token ratio (including punctuation)
`f_44_mean_word_length`	Average word length (across tokens, excluding punctuation)
	K. Lexical classes
`f_45_conjuncts`	Conjuncts (e.g., consequently, furthermore, however)
`f_46_downtoners`	Downtoners (e.g., barely, nearly, slightly)
`f_47_hedges`	Hedges (e.g., at about, something like, almost)
`f_48_amplifiers`	Amplifiers (e.g., absolutely, extremely, perfectly)
`f_49_emphatics`	Emphatics (e.g., a lot, for sure, really)
`f_50_discourse_particles`	Discourse particles (e.g., sentence-initial well, now, anyway)
`f_51_demonstratives`	Demonstratives
	L. Modals
`f_52_modal_possibility`	Possibility modals (can, may, might, could)
`f_53_modal_necessity`	Necessity modals (ought, should, must)
`f_54_modal_predictive`	Predictive modals (will, would, shall)
	M. Specialized verb classes
`f_55_verb_public`	Public verbs (e.g., assert, declare, mention)
`f_56_verb_private`	Private verbs (e.g., assume, believe, doubt, know)
`f_57_verb_suasive`	Suasive verbs (e.g., command, insist, propose)
`f_58_verb_seem`	seem and appear
	N. Reduced forms and dispreferred structures
`f_59_contractions`	Contractions
n/a	Subordinator that deletion (e.g., I think [that] he went.)
`f_61_stranded_preposition`	Stranded prepositions (e.g., the candidate that I was thinking of)
`f_62_split_infinitve`	Split infinitives (e.g., He wants to convincingly prove that ...)
`f_63_split_auxiliary`	Split auxiliaries (e.g., They were apparently shown to ...)
	O. Co-ordination
`f_64_phrasal_coordination`	Phrasal co-ordination (N and N; Adj and Adj; V and V; Adv and Adv)
n/a	Independent clause co-ordination (clause-initial and)
	P. Negation
`f_66_neg_synthetic`	Synthetic negation (e.g., No answer is good enough for Jones.)
`f_67_neg_analytic`	Analytic negation (e.g., That's not likely.)

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert); feature extractor by Jan Gasthaus (2007).

References

Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh University Press, Edinburgh. See also the BNC homepage at http://www.natcorp.ox.ac.uk/.

Biber, Douglas (1988). Variations Across Speech and Writing. Cambridge University Press, Cambridge.

Biber, Douglas (1995). Dimensions of Register Variation: A cross-linguistic comparison. Cambridge University Press, Cambridge.

Gasthaus, Jan (2007). Prototype-Based Relevance Learning for Genre Classification. B.Sc.\ thesis, Institute of Cognitive Science, University of Osnabrück.

Comparison of written and spoken noun frequencies in the British National Corpus

Description

This data set compares the frequencies of 60 selected nouns in the written and spoken parts of the British National Corpus, World Edition (BNC). Nouns were chosen from three frequency bands, namely the 20 most frequent nouns in the corpus, 20 nouns with approximately 1000 occurrences, and 20 nouns with approximately 100 occurrences.

See Aston & Burnard (1998) for more information about the BNC, or go to http://www.natcorp.ox.ac.uk/.

Usage


BNCcomparison

BNCcomparison

Format

A data frame with 61 rows and the following columns:

noun:: lemmatised noun (aka stem form)
written:: frequency in the written part of the BNC
spoken:: frequency in the spoken part of the BNC

Details

In addition to the 60 nouns, the data set contains a row labelled OTHER, which represents the total frequency of all other nouns in the BNC. This value is needed in order to calculate the sample sizes of the written and spoken part for frequency comparison tests.

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh University Press, Edinburgh. See also the BNC homepage at http://www.natcorp.ox.ac.uk/.

Distribution of domains in the British National Corpus (BNC)

Description

This data set gives the number of documents and tokens in each of the 18 domains represented in the British National Corpus, World Edition (BNC). See Aston & Burnard (1998) for more information about the BNC and the domain classification, or go to http://www.natcorp.ox.ac.uk/.

Usage


BNCdomains

BNCdomains

Format

A data frame with 19 rows and the following columns:

domain:: name of the respective domain in the BNC
documents:: number of documents from this domain
tokens:: total number of tokens in all documents from this domain

Details

For one document in the BNC, the domain classification is missing. This document is represented by the code Unlabeled in the data set.

Author(s)

Marco Baroni <[email protected]>

References

Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh University Press, Edinburgh. See also the BNC homepage at http://www.natcorp.ox.ac.uk/.

Collocations of the phrase "in charge of" (BNC)

Description

This data set lists collocations (in the sense of Sinclair 1991) of the phrase in charge of found in the British National Corpus, World Edition (BNC). A span size of 3 and a frequency threshold of 5 were used, i.e. all words that occur at least five times within a distance of three tokens from the key phrase in charge of are listed as collocates. Note that collocations were not allowed to cross sentence boundaries.

See Aston & Burnard (1998) for more information about the BNC, or go to http://www.natcorp.ox.ac.uk/.

Usage


BNCInChargeOf

BNCInChargeOf

Format

A data frame with 250 rows and the following columns:

collocate:: a collocate of the key phrase in charge of (word form)
f.in:: occurrences of the collocate within a distance of 3 tokens from the key phrase, i.e. inside the span
N.in:: total number of tokens inside the span
f.out:: occurrences of the collocate outside the span
N.out:: total number of tokens outside the span

Details

Punctuation, numbers and any words containing non-alphabetic characters (except for -) were not considered as potential collocates. Likewise, the number of tokens inside / outside the span given in the columns N.in and N.out only includes simple alphabetic word forms.

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh University Press, Edinburgh. See also the BNC homepage at http://www.natcorp.ox.ac.uk/.

Sinclair, John (1991). Corpus, Concordance, Collocation. Oxford University Press, Oxford.

Metadata for the British National Corpus (XML edition)

Description

This data set provides complete metadata for all 4048 texts of the British National Corpus (XML edition). See Aston & Burnard (1998) for more information about the BNC, or go to http://www.natcorp.ox.ac.uk/.

The data have automatically been extracted from the original BNC source files. Some transformations were applied so that all attribute names and their values are given in a human-readable form. The Perl scripts used in the extraction procedure are available from https://cwb.sourceforge.io/install.php#other.

Usage


BNCmeta

BNCmeta

Format

A data frame with 4048 rows and the columns listed below. Unless specified otherwise, columns are coded as factors.

id:: BNC document ID; character vector
title:: Title of the document; character vector
n_words:: Number of words in the document; integer vector
n_tokens:: Total number of tokens (including punctuation and deleted material); integer vector
n_w:: Number of w-units (words); integer vector
n_c:: Number of c-units (punctuation); integer vector
n_s:: Number of s-units (sentences); integer vector
publication_date:: Publication date
text_type:: Text type
context:: Spoken context
respondent_age:: Age-group of respondent
respondent_class:: Social class of respondent (NRS social grades)
respondent_sex:: Sex of respondent
interaction_type:: Interaction type
region:: Region
author_age:: Author age-group
author_domicile:: Domicile of author
author_sex:: Sex of author
author_type:: Author type
audience_age:: Audience age
domain:: Written domain
difficulty:: Written difficulty
medium:: Written medium
publication_place:: Publication place
sampling_type:: Sampling type
circulation:: Estimated circulation size
audience_sex:: Audience sex
availability:: Availability
mode:: Text mode (written/spoken)
derived_type:: Text class
genre:: David Lee's genre classification

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh University Press, Edinburgh. See also the BNC homepage at http://www.natcorp.ox.ac.uk/.

Per-text frequency counts for a selection of BNCweb corpus queries

Description

This data set contains a table of frequency counts obtained with a selection of BNCweb (Hoffmann et al. 2008) queries for each text document in the British National Corpus (Aston & Burnard 1998).

Usage


BNCqueries

BNCqueries

Format

A data frame with 4048 rows and 12 columns. The first column (id) contains a character vector of text IDs, the remaining columns contain integer vector of the corresponding per-text frequency counts for various BNCweb queries. Column names ending in .S indicate sentence counts rather than token counts.

The list below shows the BNCweb query used for each feature in CEQL syntax (Hoffmann et al. 2008, Ch. 6).

id:: text ID
split.inf.S:: number of sentences containing a split infinitive with -ly adverb; query: _TO0 +ly_AV0 _V?I
adv.inf.S:: number of sentences containing a non-split infinitive with -ly adverb; query: +ly_AV0 _TO0 _V?I
superlative.S:: number of sentences containing a superlative adjective; query: the (_AJS | most _AJ0)
past.S:: number of sentences containing a paste tense verb; query: _V?D
wh.question.S:: number of wh-questions; query: <s> _[PNQ,AVQ] _{V}
stop.to:: frequency of the expression stop to + verb; query: {stop/V} to _{V}
time:: frequency of the noun time; query: {time/N}
click:: frequency of the verb to click; query: {click/V}
noun:: frequency of common nouns; query: _NN?
nominalization:: frequency of nominalizations; query: +[tion,tions,ment,ments,ity,ities]_NN?
downtoner:: frequency of downtoners; query: [almost,barely,hardly,merely,mildly,nearly,only,partially,partly,practically,scarcely,slightly,somewhat]

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh University Press, Edinburgh. See also the BNC homepage at http://www.natcorp.ox.ac.uk/.

Hoffmann, Sebastian; Evert, Stefan; Smith, Nicholas; Lee, David; Berglund Prytz, Ylva (2008). Corpus Linguistics with BNCweb – a Practical Guide, volume 6 of English Corpus Linguistics. Peter Lang, Frankfurt am Main. See also http://corpora.lancs.ac.uk/BNCweb/.

Bigrams of adjacent words from the Brown corpus

Description

This data set contains bigrams of adjacent word forms from the Brown corpus of written American English (Francis & Kucera 1964). Co-occurrence frequencies are specified in the form of an observed contingency table, using the notation suggested by Evert (2008).

Only bigrams that occur at least 5 times in the corpus are included.

Usage


BrownBigrams

BrownBigrams

Format

A data frame with 24167 rows and the following columns:

id:: unique ID of the bigram entry
word1:: the first word form in the bigram (character)
pos1:: part-of-speech category of the first word (factor)
word2:: the second word form in the bigram (character)
pos2:: part-of-speech category of the second word (factor)
O11:: co-occurrence frequency of the bigram (numeric)
O12:: occurrences of the first word without the second (numeric)
O21:: occurrences of the second word without the first (numeric)
O22:: number of bigram tokens containing neither the first nor the second word (numeric)

Details

Part-of-speech categories are identified by single-letter codes, corresponding of the first character of the Penn tagset.

Some important POS codes are N (noun), V (verb), J (adjective), R (adverb or particle), I (preposition), D (determiner), W (wh-word) and M (modal).

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 58, pages 1212–1248. Mouton de Gruyter, Berlin, New York.

Francis, W.~N. and Kucera, H. (1964). Manual of information to accompany a standard sample of present-day edited American English, for use with digital computers. Technical report, Department of Linguistics, Brown University, Providence, RI.

Frequency counts of passive verb phrases in the Brown and LOB corpora

Description

This data set contains frequency counts of passive verb phrases for selected texts from the Brown corpus of written American English (Francis & Kucera 1964) and the comparable LOB corpus of written British English (Johansson et al. 1978).

Usage


BrownLOBPassives

BrownLOBPassives

Format

A data frame with 622 rows and the following columns:

id:: a unique ID for each text (character)
passive:: number of passive verb phrases
n_w:: total number of words in the genre category
n_s:: total number of sentences in the genre category
cat:: genre category code (A ... R; factor)
genre:: descriptive label for the genre category (factor)
lang:: descriptive label for the genre category

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Johansson, Stig; Leech, Geoffrey; Goodluck, Helen (1978). Manual of information to accompany the Lancaster-Oslo/Bergen corpus of British English, for use with digital computers. Technical report, Department of English, University of Oslo, Oslo.

Frequency counts of passive verb phrases in the Brown corpus

Description

This data set contains frequency counts of passive verb phrases in the Brown corpus of written American English (Francis & Kucera 1964), aggregated by genre category.

Usage


BrownPassives

BrownPassives

Format

A data frame with 15 rows and the following columns:

cat:: genre category code (A ... R)
passive:: number of passive verb phrases
n_w:: total number of words in the genre category
n_s:: total number of sentences in the genre category
name:: descriptive label for the genre category

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Basic statistics of texts in the Brown corpus

Description

This data set provides some basic quantiative measures for all texts in the Brown corpus of written American English (Francis & Kucera 1964),

Usage


BrownStats

BrownStats

Format

A data frame with 500 rows and the following columns:

ty:: number of distinct types
to:: number of tokens (including punctuation)
se:: number of sentences
towl:: mean word length in characters, averaged over tokens
tywl:: mean word length in characters, averaged over types

Author(s)

Marco Baroni <[email protected]>

References

Pearson's chi-squared statistic for frequency comparisons (corpora)

Description

This function computes Pearson's chi-squared statistic (often written as $X^2$ ) for frequency comparison data, with or without Yates' continuity correction. The implementation is based on the formula given by Evert (2004, 82).

Usage


chisq(k1, n1, k2, n2, correct = TRUE, one.sided=FALSE)

chisq(k1, n1, k2, n2, correct = TRUE, one.sided=FALSE)

Arguments

`k1`	frequency of a type in the first corpus (or an integer vector of type frequencies)
`n1`	the sample size of the first corpus (or an integer vector specifying the sizes of different samples)
`k2`	frequency of the type in the second corpus (or an integer vector of type frequencies, in parallel to `k1`)
`n2`	the sample size of the second corpus (or an integer vector specifying the sizes of different samples, in parallel to `n1`)
`correct`	if `TRUE`, apply Yates' continuity correction (default)
`one.sided`	if `TRUE`, compute the signed square root of $X^2$ as a statistic for a one-sided test (see details below; the default value is `FALSE`)

Details

The $X^2$ values returned by this function are identical to those computed by chisq.test. Unlike the latter, chisq accepts vector arguments so that a large number of frequency comparisons can be carried out with a single function call.

The one-sided test statistic (for one.sided=TRUE) is the signed square root of $X^2$ . It is positive for $k_1/n_1 > k_2/n_2$ and negative for $k_1/n_1 < k_2/n_2$ . Note that this statistic has a standard normal distribution rather than a chi-squared distribution under the null hypothesis of equal proportions.

Value

The chi-squared statistic $X^2$ corresponding to the specified data (or a vector of $X^2$ values). This statistic has a chi-squared distribution with $df=1$ under the null hypothesis of equal proportions.

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. Ph.D. thesis, Institut f?r maschinelle Sprachverarbeitung, University of Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714. Available from http://www.collocations.de/phd.html.

Examples

chisq.test(cont.table(99, 1000, 36, 1000))
chisq(99, 1000, 36, 1000)
chisq.test(cont.table(99, 1000, 36, 1000))
chisq(99, 1000, 36, 1000)

P-values of Pearson's chi-squared test for frequency comparisons (corpora)

Description

This function computes the p-value of Pearsons's chi-squared test for the comparison of corpus frequency counts (under the null hypothesis of equal population proportions). It is based on the chi-squared statistic $X^2$ implemented by the chisq function.

Usage


chisq.pval(k1, n1, k2, n2, correct = TRUE,
           alternative = c("two.sided", "less", "greater"))

chisq.pval(k1, n1, k2, n2, correct = TRUE,
           alternative = c("two.sided", "less", "greater"))

Arguments

`k1`	frequency of a type in the first corpus (or an integer vector of type frequencies)
`n1`	the sample size of the first corpus (or an integer vector specifying the sizes of different samples)
`k2`	frequency of the type in the second corpus (or an integer vector of type frequencies, in parallel to `k1`)
`n2`	the sample size of the second corpus (or an integer vector specifying the sizes of different samples, in parallel to `n1`)
`correct`	if `TRUE`, apply Yates' continuity correction (default)
`alternative`	a character string specifying the alternative hypothesis; must be one of `two.sided` (default), `less` or `greater`

Details

The p-values returned by this functions are identical to those computed by chisq.test (two-sided only) and prop.test (one-sided and two-sided) for two-by-two contingency tables.

Value

The p-value of Pearson's chi-squared test applied to the given data (or a vector of p-values).

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

Examples

chisq.test(cont.table(99, 1000, 36, 1000))
chisq.pval(99, 1000, 36, 1000)
chisq.test(cont.table(99, 1000, 36, 1000))
chisq.pval(99, 1000, 36, 1000)

Build contingency tables for frequency comparison (corpora)

Description

This is a convenience function which constructs 2x2 contingency tables needed for frequency comparisons with chisq.test, fisher.test and similar functions.

Usage


cont.table(k1, n1, k2, n2, as.list=NA)

cont.table(k1, n1, k2, n2, as.list=NA)

Arguments

`k1`	frequency of a type in the first corpus, a numeric scalar or vector
`n1`	the size of the first corpus (sample size), a numeric scalar or vector
`k2`	frequency of the type in the second corpus, a numeric scalar or vector
`n2`	the size of the second corpus (sample size), a numeric scalar or vector
`as.list`	whether multiple contingency tables can be constructed and are returned as a list (see "Details" below)

Details

If all four arguments k1 n1 k2 n2 are scalars (vectors of length 1), cont.table constructs a single contingency table, i.e. a 2x2 matrix. If at least one argument has length > 1, shorter vectors are replicated as necessary, and a list of 2x2 contingency tables is constructed.

With as.list=TRUE, the return value is always a list, even if it contains just a single contingency table. With as.list=FALSE, only scalar arguments are accepted and the return value is guaranteed to be a 2x2 matrix.

Value

A numeric matrix containing a two-by-two contingency table for the specified frequency comparison, or a list of such matrices (see "Details").

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

Examples

ct <- cont.table(42, 100, 66, 200)
ct

chisq.test(ct)
ct <- cont.table(42, 100, 66, 200)
ct

chisq.test(ct)

Colour palettes for linguistic visualization (corpora)

Description

Several useful colour palettes for plots and other visualizations.

The function alpha.col can be used to turn colours (partially) translucent for used in crowded scatterplots.

Usage

corpora.palette(name=c("seaborn", "muted", "bright", "simple"), 
                n=NULL, alpha=1)

alpha.col(col, alpha)
corpora.palette(name=c("seaborn", "muted", "bright", "simple"), 
                n=NULL, alpha=1)

alpha.col(col, alpha)

Arguments

`name`	name of the desired colour palette (see Details below)
`n`	optional: number of colours to return. The palette will be shortened or recycled as necessary.
`col`	a vector of R colour specifications (as accepted by `col2rgb`)
`alpha`	alpha value between 0 and 1; values below 1 make the colours translucent

Details

Every colour palette starts with the colours black, red, green and blue in this order.

seaborn, muted and bright are 7-colour palettes inspired by the seaborn data visualization library, but add a shade of dark grey as first colour.

simple is a 10-colour palette based on R's default palette.

Value

A character vector with colour names or hexadecimal RGB specifications.

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

Examples

par.save <- par(mfrow=c(2, 2))
for (name in qw("seaborn muted bright simple")) {
  barplot(rep(1, 10), col=corpora.palette(name, 10), main=name)
}
par(par.save)
par.save <- par(mfrow=c(2, 2))
for (name in qw("seaborn muted bright simple")) {
  barplot(rep(1, 10), col=corpora.palette(name, 10), main=name)
}
par(par.save)

Latent dimension scores from a distributional analysis of the Brown Family corpora

Description

This data frame provides unsupervised distributional features for each text in the extended Brown Family of corpora (Brown, LOB, Frown, FLOB, BLOB), covering edited written American and British English from 1930s, 1960s and 1990s (see Xiao 2008, 395–397).

Latent topic dimensions were obtained by a method similar to Latent Semantic Indexing (Deerwester et al. 1990), applying singular value decomposition to bag-of-words vectors for the 2500 texts in the extended Brown Family. Register dimensions were obtained with the same methodology, using vectors of part-of-speech frequencies (separately for all verb-related tags and all other tags).

Usage


DistFeatBrownFam

DistFeatBrownFam

Format

A data frame with 2500 rows and the following 23 columns:

id:: A unique ID for each text (also used as row name)
top1, top2, top3, top4, top5, top6, top7, top8, top9:: latent dimension scores for the first 9 topic dimensions
reg1, reg2, reg3, reg4, reg5, reg6, reg7, reg8, reg9:: latent dimension scores for the first 9 register dimensions (excluding verb-related tags)
vreg1, vreg2, vreg3, vreg4:: latent dimension scores for the first 4 register dimensions based only on verb-related tags

Details

TODO

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K.; Harshman, Richard (1990). Indexing by latent semantic analysis. Journal of the American Society For Information Science, 41(6), 391–407.

Xiao, Richard (2008). Well-known and influential corpora. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 20, pages 383–457. Mouton de Gruyter, Berlin.

P-values of Fisher's exact test for frequency comparisons (corpora)

Description

This function computes the p-value of Fisher's exact test (Fisher 1934) for the comparison of corpus frequency counts (under the null hypothesis of equal population proportions). In the two-sided case, a “central” p-value (Fay 2010) provides better numerical efficiency than the likelihood-based approach of fisher.test and is always consistent with confidence intervals.

Usage


fisher.pval(k1, n1, k2, n2, 
            alternative = c("two.sided", "less", "greater"),
            log.p = FALSE)

fisher.pval(k1, n1, k2, n2, 
            alternative = c("two.sided", "less", "greater"),
            log.p = FALSE)

Arguments

`k1`	frequency of a type in the first corpus (or an integer vector of type frequencies)
`n1`	the sample size of the first corpus (or an integer vector specifying the sizes of different samples)
`k2`	frequency of the type in the second corpus (or an integer vector of type frequencies, in parallel to `k1`)
`n2`	the sample size of the second corpus (or an integer vector specifying the sizes of different samples, in parallel to `n1`)
`alternative`	a character string specifying the alternative hypothesis; must be one of `two.sided` (default), `less` or `greater`
`log.p`	if TRUE, the natural logarithm of the p-value is returned

Details

For alternative="two.sided" (the default), the p-value of the “central” Fisher's exact test (Fay 2010) is computed, which differs from the more common likelihood-based method implemented by fisher.test (and referred to as the “two-sided Fisher's exact test” by Fay). This approach has two advantages: (i) it is numerically robust and efficient, even for very large samples and frequency counts; (ii) it is consistent with Clopper-Pearson type confidence intervals (see examples below).

For one-sided tests, the p-values returned by this function are identical to those computed by fisher.test on two-by-two contingency tables.

Value

The p-value of Fisher's exact test applied to the given data (or a vector of p-values).

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Fay, Michael P. (2010). Confidence intervals that match Fisher's exact or Blaker's exact tests. Biostatistics, 11(2), 373-374.

Fisher, R. A. (1934). Statistical Methods for Research Workers. Oliver & Boyd, Edinburgh, 2nd edition (1st edition 1925, 14th edition 1970).

Examples

## Fisher's Tea Drinker (see ?fisher.test)
TeaTasting <-
matrix(c(3, 1, 1, 3),
       nrow = 2,
       dimnames = list(Guess = c("Milk", "Tea"),
                       Truth = c("Milk", "Tea")))
print(TeaTasting)
##  - the "corpora" consist of 4 cups of tea each (n1 = n2 = 4)
##     => columns of TeaTasting
##  - frequency counts are the number of cups selected by drinker (k1 = 3, k2 = 1)
##     => first row of TeaTasting
##  - null hypothesis of equal type probability = drinker makes random guesses
fisher.pval(3, 4, 1, 4, alternative="greater")
fisher.test(TeaTasting, alternative="greater")$p.value # should be the same

fisher.pval(3, 4, 1, 4)         # central Fisher's exact test is equal to
fisher.test(TeaTasting)$p.value # standard two-sided Fisher's test for symmetric distribution

# inconsistency btw likelihood-based two-sided Fisher's test and confidence interval
# for 4/15 vs. 50/619 successes
fisher.test(cbind(c(4, 11), c(50, 619)))

# central Fisher's exact test is always consistent
fisher.pval(4, 15, 50, 619)
## Fisher's Tea Drinker (see ?fisher.test)
TeaTasting <-
matrix(c(3, 1, 1, 3),
       nrow = 2,
       dimnames = list(Guess = c("Milk", "Tea"),
                       Truth = c("Milk", "Tea")))
print(TeaTasting)
##  - the "corpora" consist of 4 cups of tea each (n1 = n2 = 4)
##     => columns of TeaTasting
##  - frequency counts are the number of cups selected by drinker (k1 = 3, k2 = 1)
##     => first row of TeaTasting
##  - null hypothesis of equal type probability = drinker makes random guesses
fisher.pval(3, 4, 1, 4, alternative="greater")
fisher.test(TeaTasting, alternative="greater")$p.value # should be the same

fisher.pval(3, 4, 1, 4)         # central Fisher's exact test is equal to
fisher.test(TeaTasting)$p.value # standard two-sided Fisher's test for symmetric distribution

# inconsistency btw likelihood-based two-sided Fisher's test and confidence interval
# for 4/15 vs. 50/619 successes
fisher.test(cbind(c(4, 11), c(50, 619)))

# central Fisher's exact test is always consistent
fisher.pval(4, 15, 50, 619)

Compute best-practice keyness measures (corpora)

Description

Compute best-practice keyness measures (according to Evert 2022) for the frequency comparison of lexical items in two corpora. The function is fully vectorised and should be applied to a complete data set of candidate items (so statistical analysis can be adjusted to control the family-wise error rate).

Usage


keyness(f1, n1, f2, n2, 
        measure=c("LRC", "PositiveLRC", "G2", "LogRatio", "SimpleMaths", "Lockwords"),
        conf.level=.95, alpha=NULL, p.adjust=TRUE, lambda=1)

keyness(f1, n1, f2, n2, 
        measure=c("LRC", "PositiveLRC", "G2", "LogRatio", "SimpleMaths", "Lockwords"),
        conf.level=.95, alpha=NULL, p.adjust=TRUE, lambda=1)

Arguments

`f1`	a numeric vector specifying the frequencies of candidate items in corpus A (target corpus)
`n1`	sample size of target corpus, i.e. the total number of tokens in corpus A (usually a scalar, but can also be a vector parallel to `f1`)
`f2`	a numeric vector parallel to `f1`, specifying the frequencies of candidate items in corpus B (reference corpus)
`n2`	sample size of reference corpus, i.e. the total number of tokens in corpus B (usually a scalar, but can also be a vector parallel to `f2`)
`measure`	the keyness measure to be computed (see “Details” below)
`conf.level`	the desired confidence level for the `LRC` and `PositiveLRC` measures (defaults to 95%)
`alpha`	if specified, filter out candidate items whose frequency difference between $f_1$ and $f_2$ is not significant at level $\alpha$ . This is achieved by setting the score of such candidates to 0.
`p.adjust`	if `TRUE`, apply a Bonferroni correction in order to control the family-wise error rate across all tests carried out in a single function call (i.e. the common length of `f1` and `f2`). Alternatively, the desired family size can be specified instead of `TRUE` (useful if a larger data set is processed in batches). The adjustment applied both the the significance filter (`alpha`) and the confidence intervals (`conf.level`) underlying `LRC` and `PositiveLRC` measures.
`lambda`	parameter $\lambda$ of the `SimpleMaths` measure.

Details

This function computes a range of best-practice keyness measures comparing the relative frequencies $\pi_1$ and $\pi_2$ of lexical items in populations (i.e. sublanguages) A and B, based on the observed sample frequencies $f_1, f_2$ and the corresponding sample sizes $n_1, n_2$ . The function is fully vectorised with respect to arguments f1, f2, n1 and n2, but only a single keyness measure can be selected for each function call. All implemented measures are robust for the corner cases $f_1 = 0$ and $f_2 = 0$ , but $f_1 = f_2 = 0$ is not allowed.

Most of the keyness measures are directional, i.e. positive scores indicate positive keyness in A ( $\pi_1 > \pi_2$ ) and negative scores indicate negative keyness in A ( $\pi_1 < \pi_2$ ). By contrast, the one-sided measures PositiveLRC and SimpleMaths only detect positive keyness in A, returning small (and possibly negative) scores otherwise, i.e. in case of insufficient evidence for $\pi_1 > \pi_2$ and in case of strong evidence for $\pi_1 < \pi_2$ . One-sided measures can be useful for a ranking of the entire data set as positive keyword candidates.

Hardie (2014) and other authors recommend to combine effect-size measures (in particular LogRatio) with a significance filter in order to weed out candidate items for which there is no significant evidence against the null hypothesis $H_0: \pi_1 = \pi_2$ . Such a filter is activated by specifying the desired significance level alpha, and can be combined with all keyness measures. In this case, the scores of all non-significant candidate items are set to 0. The decision is based in the likelihood-ratio test implemented by the G2 measure and its asymptotic $\chi^2_1$ distribution under $H_0$ .

Note that the significance filter can also be applied to the G2 measure itself, setting all scores below the critical value for the significance test to 0. For one-sided measures (PositiveLRC and SimpleMaths), candidates with significant evidence for negative keyness are also filtered out (i.e. their scores are set to 0) in order to ensure a consistent ranking.

By default, statistical inference corrects for multiple testing in order to control family-wise error rates. This applies to the significance filter as well as to the confidence intervals underlying LRC and PositiveLRC. Note that the G2 scores themselves are never adjusted (only the critical value for the significance filter is modified).

Family size $m$ is automatically determined from the number of candidate items processed in a single function call. Alternatively, the family size can be specified explicitly in the p.adjust argument, e.g. if a large data set is processed in multiple batches, or p.adjust=FALSE can be used to disable the correction.

For the adjustment, a highly conservative Bonferroni correction $\alpha' = \alpha / m$ is applied to significance levels. Since the large candidate sets and sample sizes often found in corpus linguistics tend to produce large numbers of false positives, this conservative approach is considered to be useful.

See Evert (2022) and its supplementary materials for a more detailed discussion of the implemented best-practice measures and some alternatives.

Keyness Measures

G2

The log-likelihood measure (Rayson & Garside 2003: 3) computes the score $G^2$ of a likelihood-ratio test for $H_0: \pi_1 = \pi_2$ . This test is two-sided and always returns positive values, so the sign of its score is inverted for $f_1 / n_1 < f_2 / n_2$ in order to obtain a directional keyness measure. As a pure significance measure, it tends to prefer high-frequency candidates with large $f_1$ .

LogRatio

A point estimate of the log relative risk $\log_2 (\pi_1 / \pi_2)$ , which has been suggested as an intuitive keyness measure under the name LogRatio by Hardie (2014: 45). The implementation uses Walter's (1975) adjusted estimator

$% \log_2 \dfrac{f_1 + \frac12}{n_1 + \frac12} - \log_2 \dfrac{f_2 + \frac12}{n_2 + \frac12}$

which is less biased and robust against $f_i = 0$ . As a pure effect-size measure, LogRatio tends to assign spuriously high scores to low-frequency candidates with small $f_1$ and $f_2$ (due to sampling variation). Combination with a significance filter is strongly recommended.

LRC (default)

A conservative estimate for LogRatio recommended by Evert (2022) in order to combine and balance the advantages of effect-size and significance measures. A confidence interval (according to the specified conf.level) for relative risk $r = \pi_1 / \pi_2$ is obtained from an exact conditional Poisson test (Fay 2010: 55), adjusted for multiple testing by default. If a candidate is not significant (i.e. the confidence interval includes $H_0: r = 1$ ) its score is set to 0. Otherwise the boundary of the confidence interval closer to 1 is taken as a conservative directional estimate of $r$ and its $\log_2$ is returned.

PositiveLRC

A one-sided variant of LRC, which returns the lower boundary of a one-sided confidence interval for $\log_2 r$ . Scores $\leq 0$ indicate that there is no significant evidence for positive keyness. The directional version of LRC is recommended for general use, but PositiveLRC may be preferred if the hermeneutic interpretation should also consider non-significant candidates (especially with small data sets).

SimpleMaths

The simple maths keyness measure (Kilgarriff 2009) used by the commercial corpus analysis platform Sketch Engine:

$\dfrac{10^6 \cdot \frac{f_1}{n_1} + \lambda}{10^6 \cdot \frac{f_2}{n_2} + \lambda}$

Its frequency bias can be adjusted with the user parameter $\lambda > 0$ . The scaling factor $10^6$ was chosen so that $\lambda = 1$ is a practical default value.

There does not appear to be a convincing mathematical justification behind this measure. It is included here only because of the popularity of the Sketch Engine platform.

Lockwords

This measure is designed for identifying so-called lockwords, whose frequencies are remarkably stable across different corpora (Baker 2011: 73). It is based on the same confidence intervals for $\log_2 r$ as the LRC measure. Here, the maximum over all $|\log_2 r|$ values inside the confidence interval is taken as a conservative upper bound for the true log ratio of the relative frequencies (which satisfies all the requirements on such a measure discussed by Hardie 2014).

In line with its purpose, the Lockwords measure does not distinguish between positive and negative differences and always returns a positive value. For example, a Lockwords value of 2 means that we have good evidence that the candidate item is at most 4 times as frequent in one population than in the other. In other words, the true relative risk $r = \pi_1 / \pi_2$ falls into the range $[\frac14, 4]$ . Note that the Lockwords value will be $+\infty$ if $f_1 = 0$ or $f_2 = 0$ .

Value

A numeric vector of the same length as f1 and f2, containing keyness scores for all candidate lexical items. For most measures, positive scores indicate positive keywords (i.e. higher frequency in the population underlying corpus A) and negative scores indicate negative keywords (i.e. higher frequency in the population underlying corpus B). If alpha is specified, non-significant candidates always have a score of 0.

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Baker, P. (2011). Times may change, but we will always have money: Diachronic variation in recent British English. Journal of English Linguistics, 39(1):65-88.

Evert, S. (2022). Measuring keyness. In Digital Humanities 2022: Conference Abstracts, pages 202-205, Tokyo, Japan / online. https://osf.io/cy6mw/

Fay, Michael P. (2010). Two-sided exact tests and matching confidence intervals for discrete data. The R Journal, 2(1), 53-58.

Hardie, A. (2014). A single statistical technique for keywords, lockwords, and collocations. Internal CASS working paper no. 1, unpublished.

Kilgarriff, A. (2009). Simple maths for keywords. In Proceedings of the Corpus Linguistics 2009 Conference, Liverpool, UK.

Rayson, P. and Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the ACL Workshop on Comparing Corpora, pages 1-6, Hong Kong.

Walter, S. D. (1975). The distribution of Levin’s measure of attributable risk. Biometrika, 62(2): 371-374.

Examples

# compute all keyness measures for a single candidate item with f1=7, f2=2 and n1=n2=1000
keyness(7, 1000, 2, 1000, measure="G2") # log-likelihood
keyness(7, 1000, 2, 1000, measure="LogRatio")
keyness(7, 1000, 2, 1000, measure="LogRatio", alpha=0.05) # with significance filter
keyness(7, 1000, 2, 1000, measure="LRC") # the default measure
keyness(7, 1000, 2, 1000, measure="PositiveLRC")
keyness(7, 1000, 2, 1000, measure="SimpleMaths")

# a practical example: keywords of spoken British English (from BNC corpus)
n1 <- sum(BNCcomparison$spoken) # sample sizes
n2 <- sum(BNCcomparison$written)
kw <- transform(BNCcomparison,
  G2 = keyness(spoken, n1, written, n2, measure="G2"),
  LogRatio = keyness(spoken, n1, written, n2, measure="LogRatio"),
  LRC = keyness(spoken, n1, written, n2))
kw <- kw[order(-kw$LogRatio), ]
head(kw, 20)               # top LogRatio keywords

kw <- transform(kw,
  Lock = keyness(spoken, n1, written, n2, measure="Lockwords"))
kw <- kw[order(kw$Lock), ] # note increasing sort
head(kw, 20)               # top lockwords

# collocations of "in charge of" with LRC as an association measure
colloc <- transform(BNCInChargeOf, 
  PosLRC = keyness(f.in, N.in, f.out, N.out, measure="PositiveLRC"))
colloc <- colloc[order(-colloc$PosLRC), ]
head(colloc, 30)
# compute all keyness measures for a single candidate item with f1=7, f2=2 and n1=n2=1000
keyness(7, 1000, 2, 1000, measure="G2") # log-likelihood
keyness(7, 1000, 2, 1000, measure="LogRatio")
keyness(7, 1000, 2, 1000, measure="LogRatio", alpha=0.05) # with significance filter
keyness(7, 1000, 2, 1000, measure="LRC") # the default measure
keyness(7, 1000, 2, 1000, measure="PositiveLRC")
keyness(7, 1000, 2, 1000, measure="SimpleMaths")

# a practical example: keywords of spoken British English (from BNC corpus)
n1 <- sum(BNCcomparison$spoken) # sample sizes
n2 <- sum(BNCcomparison$written)
kw <- transform(BNCcomparison,
  G2 = keyness(spoken, n1, written, n2, measure="G2"),
  LogRatio = keyness(spoken, n1, written, n2, measure="LogRatio"),
  LRC = keyness(spoken, n1, written, n2))
kw <- kw[order(-kw$LogRatio), ]
head(kw, 20)               # top LogRatio keywords

kw <- transform(kw,
  Lock = keyness(spoken, n1, written, n2, measure="Lockwords"))
kw <- kw[order(kw$Lock), ] # note increasing sort
head(kw, 20)               # top lockwords

# collocations of "in charge of" with LRC as an association measure
colloc <- transform(BNCInChargeOf, 
  PosLRC = keyness(f.in, N.in, f.out, N.out, measure="PositiveLRC"))
colloc <- colloc[order(-colloc$PosLRC), ]
head(colloc, 30)

German PP-Verb collocation candidates annotated by Brigitte Krenn (2000)

Description

This data set lists 5102 frequent combinations of verbs and prepositional phrases (PP) extracted from a German newspaper corpus. The collocational status of each PP-verb combination was manually annotated by Brigitte Krenn (2000). In addition, pre-computed scores of several standard association measures are provided.

The KrennPPV candidate set forms part of the data used in the evaluation study of Evert & Krenn (2005).

Usage


KrennPPV

KrennPPV

Format

A data frame with 5102 rows and the following columns:

PP:: the prepositional phrase, represented by preposition and lemma of the nominal head (character). Preposition-article fusion is indicated by a + sign. For example, the prepositional phrase im letzten Jahr would appear as in:Jahr in the data set.
verb:: the verb lemma (character). Separated particle verbs have been recombined.
is.colloc:: whether the PP-verb combination is a lexical collocation (logical)
is.SVC:: whether a PP-verb collocation is a support verb construction (logical)
is.figur:: whether a PP-verb-collocation is a figurative expression (logical)
freq:: co-occurrence frequency of the PP-verb combination within clauses (integer)
MI:: Mutual Information association measure
Dice:: Dice coefficient association measure
z.score:: z-score association measure
t.score:: t-score association measure
chisq:: chi-squared association measure (without Yates' continuity correction)
chisq.corr:: chi-squared association measure (with Yates' continuity correction)
log.like:: log-likelihood association measure
Fisher:: Fisher's exact test as an association measure (negative logarithm of one-sided p-value)

See Evert (2008) and http://www.collocations.de/AM/ for details on these association measures.

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Evert, Stefan and Krenn, Brigitte (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech and Language, 19(4), 450–466.

Krenn, Brigitte (2000). The Usual Suspects: Data-Oriented Models for the Identification and Representation of Lexical Collocations, volume~7 of Saarbrücken Dissertations in Computational Linguistics and Language Technology. DFKI & Universität des Saarlandes, Saarbrücken, Germany.

Frequency counts of passive verb phrases in the LOB corpus

Description

This data set contains frequency counts of passive verb phrases in the LOB corpus of written British English (Johansson et al. 1978), aggregated by genre category.

Usage


BrownPassives

BrownPassives

Format

A data frame with 15 rows and the following columns:

cat:: genre category code (A ... R)
passive:: number of passive verb phrases
n_w:: total number of words in the genre category
n_s:: total number of sentences in the genre category
name:: descriptive label for the genre category

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Basic statistics of texts in the LOB corpus

Description

This data set provides some basic quantiative measures for all texts in the LOB corpus of written British English (Johansson et al. 1978).

Usage


LOBStats

LOBStats

Format

A data frame with 500 rows and the following columns:

ty:: number of distinct types
to:: number of tokens (including punctuation)
se:: number of sentences
towl:: mean word length in characters, averaged over tokens
tywl:: mean word length in characters, averaged over types

Author(s)

Marco Baroni <[email protected]>

References

By-text frequencies of passive verb phrases in the Brown Family corpora.

Description

This data set specifies the number of passive and active verb phrases for each text in the extended Brown Family of corpora (Brown, LOB, Frown, FLOB, BLOB), covering edited written American and British English from 1930s, 1960s and 1990s (see Xiao 2008, 395–397).

Verb phrase and passive/active aspect counts are based on a fully automatic analysis of the texts, using the Pro3Gres parser (Schneider et al. 2004).

Usage


PassiveBrownFam

PassiveBrownFam

Format

A data frame with 2499 rows and the following 11 columns:

id:: A unique ID for each text (also used as row name)
corpus:: Corpus, a factor with five levels BLOB, Brown, LOB, Frown, FLOB
section:: Genre, a factor with fifteen levels A, ..., R (Brown section codes)
genre:: Genre labels, a factor with fifteen levels (e.g. press reportage)
period:: Date of publication, a factor with three levels (1930, 1960, 1990)
lang:: Language variety / region, a factor with levels AmE (U.S.) and BrE (UK)
n.words:: Number of word tokens, an integer vector
act:: Number of active verb phrases, an integer vector
pass:: Number of passive verb phrases, an integer vector
verbs:: Total number of verb phrases, an integer vector
p.pass:: Percentage of passive verb phrases in the text, a numeric vector

Details

No frequency data could be obtained for text N02 in the Frown corpus. This entry has been omitted from the table.

Acknowledgements

Frequency information for this data set was kindly provided by Gerold Schneider, University of Zurich (https://www.cl.uzh.ch/de/about-us/people/team/compling/gschneid.html).

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Schneider, Gerold; Rinaldi, Fabio; Dowdall, James (2004). Fast, deep-linguistic statistical dependency parsing. In G.-J. M. Kruijff and D. Duchier (eds.), Proceedings of the COLING 2004 Workshop on Recent Advances in Dependency Grammar, pages 33-40, Geneva, Switzerland. https://files.ifi.uzh.ch/cl/gschneid/parser/

Xiao, Richard (2008). Well-known and influential corpora. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 20, pages 383–457. Mouton de Gruyter, Berlin.

Confidence interval for proportion based on frequency counts (corpora)

Description

This function computes a confidence interval for a population proportion from the corresponding frequency count in a sample. It either uses the Clopper-Pearson method (inverted exact binomial test) or the Wilson score method (inversion of a z-score test, with or without continuity correction).

Usage


prop.cint(k, n, method = c("binomial", "z.score"), correct = TRUE, p.adjust=FALSE,
          conf.level = 0.95, alternative = c("two.sided", "less", "greater"))

prop.cint(k, n, method = c("binomial", "z.score"), correct = TRUE, p.adjust=FALSE,
          conf.level = 0.95, alternative = c("two.sided", "less", "greater"))

Arguments

`k`	frequency of a type in the corpus (or an integer vector of frequencies)
`n`	number of tokens in the corpus, i.e. sample size (or an integer vector specifying the sizes of different samples)
`method`	a character string specifying whether to compute a Clopper-Pearson confidence interval (`binomial`) or a Wilson score interval (`z.score`)
`correct`	if `TRUE`, apply Yates' continuity correction for the z-score test (default)
`p.adjust`	if `TRUE`, apply a Bonferroni correction to ensure a family-wise confidence level over all tests carried out in a single function call (i.e. the length of `k`). Alternatively, the desired family size can be specified instead of `TRUE`.
`conf.level`	the desired confidence level (defaults to 95%)
`alternative`	a character string specifying the alternative hypothesis, yielding a two-sided (`two.sided`, default), lower one-sided (`less`) or upper one-sided (`greater`) confidence interval

Details

The confidence intervals computed by this function correspond to those returned by binom.test and prop.test, respectively. However, prop.cint accepts vector arguments, allowing many confidence intervals to be computed with a single function call in a computationally efficient manner.

The Clopper-Pearson confidence interval (binomial) is obtained by inverting the exact binomial test at significance level $\alpha$ = 1 - confidence.level. In the two-sided case, the p-value of the test is computed using the “central” method Fay (2010: 53), i.e. as twice the tail probability of the matching tail. This corresponds to the algorithm originally proposed by Clopper & Pearson (1934).

The limits of the confidence interval are computed in an efficient and numerically robust manner via (the inverse of) the incomplete Beta function.

The Wilscon score confidence interval (z.score) is computed by solving the equation of the z-score test

$% \frac{k - np}{\sqrt{n p (1-p)}} = A$

for $p$ , where $A$ is the $z$ -value corresponding to the chosen confidence level (e.g. $\pm 1.96$ for a two-sided test with 95% confidence). This leads to the quadratic equation

$% p^2 (n + A^2) + p (-2k - A^2) + \frac{k^2}{n} = 0$

whose two solutions correspond to the lower and upper boundary of the confidence interval.

When Yates' continuity correction is applied, the value $k$ in the numerator of the $z$ -score equation has to be replaced by $k^*$ , with $k^* = k - 1/2$ for the lower boundary of the confidence interval (where $k > np$ ) and $k^* = k + 1/2$ for the upper boundary of the confidence interval (where $k < np$ ). In each case, the corresponding solution of the quadratic equation has to be chosen (i.e., the solution with $k > np$ for the lower boundary and vice versa).

If a Bonferroni correction is applied, the significance level $\alpha$ of the underlying test is divided by the number $m$ of tests carried out (specified explicitly by the user or given implicitly by length(k)): $\alpha' = \alpha / m$ .

Value

A data frame with two columns, labelled lower for the lower boundary and upper for the upper boundary of the confidence interval. The number of rows is determined by the length of the longest input vector (k, n and conf.level).

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Clopper, C. J. & Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4), 404-413.

Fay, Michael P. (2010). Two-sided exact tests and matching confidence intervals for discrete data. The R Journal, 2(1), 53-58.

https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval

Examples

# Clopper-Pearson confidence interval
binom.test(19, 100)
prop.cint(19, 100, method="binomial")

# Wilson score confidence interval
prop.test(19, 100)
prop.cint(19, 100, method="z.score")
# Clopper-Pearson confidence interval
binom.test(19, 100)
prop.cint(19, 100, method="binomial")

# Wilson score confidence interval
prop.test(19, 100)
prop.cint(19, 100, method="z.score")

Split string into words, similar to qw() in Perl (corpora)

Description

This function splits one or more character strings into words. By default, the strings are split on whitespace in order to emulate Perl's qw() (quote words) functionality.

Usage

qw(s, sep="\\s+", names=FALSE)
qw(s, sep="\\s+", names=FALSE)

Arguments

`s`	one or more strings to be split (a character vector)
`sep`	PCRE regular expression on which to split (defaults to whitespace)
`names`	if TRUE, the resulting character vector is labelled with itself, which is convenient for `lapply` and similar functions

Value

A character vector of the resulting words. Multiple strings in s are flattened into a single vector.

If names=TRUE, the words are used both as values and as labels of the character vectors, which is convenient when iterating over it with lapply or sapply.

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

Examples

qw(c("alpha beta gamma", "42 111" ))
qw("alpha beta gamma", names=TRUE)
qw("words with blanks,  sep by commas", sep="\\s*,\\s*")
qw(c("alpha beta gamma", "42 111" ))
qw("alpha beta gamma", names=TRUE)
qw("words with blanks,  sep by commas", sep="\\s*,\\s*")

Propagate vector to single-row or single-column matrix (corpora)

Description

This utility function converts a plain vector into a row or column vector, i.e. a single-row or single-column matrix.

Usage

rowVector(x, label=NULL)
colVector(x, label=NULL)
rowVector(x, label=NULL)
colVector(x, label=NULL)

Arguments

`x`	a (typically numeric) vector
`label`	an optional character string specifying a label for the single row or column returned

Value

A single-row or single-column matrix of the same data type as x. Labels of x are preserved as column/row names of the matrix.

See matrix for details on how non-atomic objects are handled.

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

Examples

rowVector(1:5, "myvec")
colVector(c(A=1, B=2, C=3), label="myvec")
rowVector(1:5, "myvec")
colVector(c(A=1, B=2, C=3), label="myvec")

Random samples from data frames (corpora)

Description

This function takes a random sample of rows from a data frame, in analogy to the built-in function sample (which sadly does not accept a data frame).

Usage


sample.df(df, size, replace=FALSE, sort=FALSE, prob=NULL)

sample.df(df, size, replace=FALSE, sort=FALSE, prob=NULL)

Arguments

`df`	a data frame to be sampled from
`size`	positive integer giving the number of rows to choose
`replace`	Should sampling be with replacement?
`sort`	Should rows in sample be sorted in original order?
`prob`	a vector of probability weights for obtaining the elements of the vector being sampled

Details

Internally, rows are selected with the function sample.int. See its manual page for details on the arguments (except for sort) and implementation.

Value

A data frame containing the sampled rows of df, either their original order (sort=TRUE) or shuffled randomly (sort=FALSE).

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

Examples

sample.df(BrownLOBPassives, 20, sort=TRUE)
sample.df(BrownLOBPassives, 20, sort=TRUE)

Simulated census data for examples and illustrations (corpora)

Description

This function generates a large simulated census data frame with body measurements (height, weight, shoe size) for male and female inhabitants of a highly fictitious country.

The generated data set is usually named FakeCensus (see code examples below) and is used for various exercises and illustrations in the SIGIL course.

Usage


simulated.census(N=502202, p.male=0.55, seed.rng=42)

simulated.census(N=502202, p.male=0.55, seed.rng=42)

Arguments

`N`	population size, i.e. number of inhabitants of the fictitious country
`p.male`	proportion of males in the country
`seed.rng`	seed for the random number generator, so data sets with the same parameters (`N`, `p.male`, etc.) are reproducible

Details

The default population size corresponds to the estimated populace of Luxembourg on 1 January 2010 (according to https://en.wikipedia.org/wiki/Luxembourg).

Further parameters of the simulation (standard deviation, correlations, non-linearity) will be exposed as function arguments in future releases.

Value

A data frame with N rows corresponding to inhabitants and the following columns:

height:: body height in cm
height:: body weight in kg
shoe.size:: shoe size in Paris points (Continental European scale)
sex:: sex, either m or f

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

Examples


FakeCensus <- simulated.census()
summary(FakeCensus)



FakeCensus <- simulated.census()
summary(FakeCensus)

Simulated study on effectiveness of language course (corpora)

Description

This function generates simulated results of a study measuring the effectiveness of a new corpus-driven foreign language teaching course.

The generated data set is usually named LanguageCourse (see code examples below) and is used for various exercises and illustrations in the SIGIL course.

Usage


simulated.language.course(n=c(15,20,10,10,14,18,15), mean=c(60,50,30,70,55,50,60),
                          effect=c(5,8,12,-4,2,6,-5), sd.subject=15, sd.effect=5,
                          seed.rng=42)

simulated.language.course(n=c(15,20,10,10,14,18,15), mean=c(60,50,30,70,55,50,60),
                          effect=c(5,8,12,-4,2,6,-5), sd.subject=15, sd.effect=5,
                          seed.rng=42)

Arguments

`n`	number of participants in each class
`mean`	average score of each class before the course
`effect`	improvement of each class during the course
`sd.subject`	inter-subject variability, may be different in each class
`sd.effect`	inter-subject variability of effect size, may also be different in each class
`seed.rng`	seed for the random number generator, so data sets with the same parameters are reproducible

Details

TODO

Value

A data frame with sum(n) rows corresponding to individual subjects participating in the study and the following columns

id:: unique ID code of subject
class:: name of the teaching class
pre:: score in standardized language test before the course (pre-test)
post:: score in standardized language test after the course (post-test)

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

Examples


LanguageCourse <- simulated.language.course()
head(LanguageCourse, 20)
summary(LanguageCourse)

LanguageCourse <- simulated.language.course()
head(LanguageCourse, 20)
summary(LanguageCourse)

Simulated type and token counts for Wikipedia articles (corpora)

Description

This function generates type and token counts, token-type ratios (TTR) and average word length for simulated articles from the English Wikipedia. Simulation paramters are based on data from the Wackypedia corpus.

The generated data set is usually named WackypediaStats (see code examples below) and is used for various exercises and illustrations in the SIGIL course.

Usage


simulated.wikipedia(N=1429649, length=c(100,1000), seed.rng=42)

simulated.wikipedia(N=1429649, length=c(100,1000), seed.rng=42)

Arguments

`N`	population size, i.e. total number of Wikipedia articles
`length`	a numeric vector of length 2, specifying the typical range of Wikipedia article lengths
`seed.rng`	seed for the random number generator, so data sets with the same parameters (`N` and `lenght`) are reproducible

Details

The default population size corresponds to the subset of the Wackypedia corpus from which the simulation parameters were obtained. This excludes all articles with extreme type-token statistics (very short, very long, extremely long words, etc.).

Article lengths are sampled from a lognormal distribution which is scaled so that the central 95% of the values fall into the range specified by the length argument.

The simulated data are surprising close to the original Wackypedia statistics.

Value

A data frame with N rows corresponding to Wikipedia articles and the following columns:

tokens:: number of word tokens in the article
types:: number of distinct word types in the article
ttr:: token-type ratio (TTR) for the article
avglen:: average word length in characters (averaged across tokens)

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

The Wackypedia corpus can be obtained from https://wacky.sslmit.unibo.it/doku.php?id=corpora.

Examples


WackypediaStats <- simulated.wikipedia()
summary(WackypediaStats)



WackypediaStats <- simulated.wikipedia()
summary(WackypediaStats)

Show p-values as significance stars (corpora)

Description

A simple utility function that converts p-values into the customary significance stars.

Usage

stars.pval(x)
stars.pval(x)

Arguments

`x`	a numeric vector of non-negative p-values

Value

A character vector with significance stars corresponding to the p-values.

Significance levels are *** ( $p < .001$ ), ** ( $p < .01$ ), * ( $p < .05$ ) and . ( $p < .1$ ). For non-significant p-values ( $p \ge .1$ ), an empty string is returned.

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

Examples

stars.pval(c(0, .007, .01, .04, .1))
stars.pval(c(0, .007, .01, .04, .1))

A small corpus of very short stories with linguistic annotations

Description

This data set contains a small corpus (8043 tokens) of short stories from the collection Very Short Stories (VSS, see http://www.schtepf.de/History/pages/stories.html). The text was automatically segmented (tokenised) and annotated with part-of-speech tags (from the Penn tagset) and lemmas (base forms), using the IMS TreeTagger (Schmid 1994) and a custom lemmatizer.

Usage

VSS
VSS

Format

A data set with 8043 rows corresponding to tokens and the following columns:

word:: the word form (or surface form) of the token
pos:: the part-of-speech tag of the token (Penn tagset)
lemma:: the lemma (or base form) of the token
sentence:: number of the sentence in which the token occurs (integer)
story:: title of the story to which the token belongs (factor)

Details

The Penn tagset defines the following part-of-speech tags:

`CC`	Coordinating conjunction
`CD`	Cardinal number
`DT`	Determiner
`EX`	Existential there
`FW`	Foreign word
`IN`	Preposition or subordinating conjunction
`JJ`	Adjective
`JJR`	Adjective, comparative
`JJS`	Adjective, superlative
`LS`	List item marker
`MD`	Modal
`NN`	Noun, singular or mass
`NNS`	Noun, plural
`NP`	Proper noun, singular
`NPS`	Proper noun, plural
`PDT`	Predeterminer
`POS`	Possessive ending
`PP`	Personal pronoun
`PP$`	Possessive pronoun
`RB`	Adverb
`RBR`	Adverb, comparative
`RBS`	Adverb, superlative
`RP`	Particle
`SYM`	Symbol
`TO`	to
`UH`	Interjection
`VB`	Verb, base form
`VBD`	Verb, past tense
`VBG`	Verb, gerund or present participle
`VBN`	Verb, past participle
`VBP`	Verb, non-3rd person singular present
`VBZ`	Verb, 3rd person singular present
`WDT`	Wh-determiner
`WP`	Wh-pronoun
`WP$`	Possessive wh-pronoun
`WRB`	Wh-adverb

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

Schmid, Helmut (1994). Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing (NeMLaP), pages 44-49.

The z-score statistic for frequency counts (corpora)

Description

This function computes a z-score statistic for frequency counts, based on a normal approximation to the correct binomial distribution under the random sampling model.

Usage


z.score(k, n, p = 0.5, correct = TRUE)

z.score(k, n, p = 0.5, correct = TRUE)

Arguments

`k`	frequency of a type in the corpus (or an integer vector of frequencies)
`n`	number of tokens in the corpus, i.e. sample size (or an integer vector specifying the sizes of different samples)
`p`	null hypothesis, giving the assumed proportion of this type in the population (or a vector of proportions for different types and/or different populations)
`correct`	if `TRUE`, apply Yates' continuity correction (default)

Details

The $z$ statistic is given by

$% z := \dfrac{k - np}{\sqrt{n p (1-p)}}$

When Yates' continuity correction is enabled, the absolute value of the numerator $d := k - np$ is reduced by $1/2$ , but clamped to a non-negative value.

Value

The $z$ -score corresponding to the specified data (or a vector of $z$ -scores).

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

Examples

# z-test for H0: pi = 0.15 with observed counts 10..30 in a sample of n=100 tokens
k <- c(10:30)
z <- z.score(k, 100, p=.15)
names(z) <- k
round(z, 3)

abs(z) >= 1.96  # significant results at p < .05
# z-test for H0: pi = 0.15 with observed counts 10..30 in a sample of n=100 tokens
k <- c(10:30)
z <- z.score(k, 100, p=.15)
names(z) <- k
round(z, 3)

abs(z) >= 1.96  # significant results at p < .05

P-values of the z-score test for frequency counts (corpora)

Description

This function computes the p-value of a z-score test for frequency counts, based on the z-score statistic implemented by z.score.

Usage


z.score.pval(k, n, p = 0.5, correct = TRUE,
             alternative = c("two.sided", "less", "greater"))

z.score.pval(k, n, p = 0.5, correct = TRUE,
             alternative = c("two.sided", "less", "greater"))

Arguments

`k`	frequency of a type in the corpus (or an integer vector of frequencies)
`n`	number of tokens in the corpus, i.e. sample size (or an integer vector specifying the sizes of different samples)
`p`	null hypothesis, giving the assumed proportion of this type in the population (or a vector of proportions for different types and/or different populations)
`correct`	if `TRUE`, apply Yates' continuity correction (default)
`alternative`	a character string specifying the alternative hypothesis; must be one of `two.sided` (default), `less` or `greater`

Value

The p-value of a $z$ -score test applied to the given data (or a vector of p-values).

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

Examples

# compare z-test for H0: pi = 0.15 against binomial test
# with observed counts 10..30 in a sample of n=100 tokens
k <- c(10:30)
p.compare <- rbind(
  z.score = z.score.pval(k, 100, p=.15),
  binomial = binom.pval(k, 100, p=.15))
colnames(p.compare) <- k
round(p.compare, 4)
# compare z-test for H0: pi = 0.15 against binomial test
# with observed counts 10..30 in a sample of n=100 tokens
k <- c(10:30)
p.compare <- rbind(
  z.score = z.score.pval(k, 100, p=.15),
  binomial = binom.pval(k, 100, p=.15))
colnames(p.compare) <- k
round(p.compare, 4)

Package 'corpora'

Help Index

corpora: Statistical Inference from Corpus Frequency Data

Description

Details

Analysis functions

Utility functions

Data sets

Author(s)

References

P-values of the binomial test for frequency counts (corpora)

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Biber's (1988) register features for the British National Corpus

Description

Usage

Format

Author(s)

References

See Also

Comparison of written and spoken noun frequencies in the British National Corpus

Description

Usage

Format

Details

Author(s)

References

Distribution of domains in the British National Corpus (BNC)

Description

Usage

Format

Details

Author(s)

References

Collocations of the phrase "in charge of" (BNC)

Description

Usage

Format

Details

Author(s)

References

Metadata for the British National Corpus (XML edition)

Description

Usage

Format

Author(s)

References

Per-text frequency counts for a selection of BNCweb corpus queries

Description

Usage

Format

Author(s)

References

See Also

Bigrams of adjacent words from the Brown corpus

Description

Usage

Format

Details

Author(s)

References

Frequency counts of passive verb phrases in the Brown and LOB corpora

Description

Usage

Format

Author(s)

References

See Also

Frequency counts of passive verb phrases in the Brown corpus

Description

Usage

Format

Author(s)