Title: | Statistics and Data Sets for Corpus Frequency Data |
---|---|
Description: | Utility functions for the statistical analysis of corpus frequency data. This package is a companion to the open-source course "Statistical Inference: A Gentle Introduction for Computational Linguists and Similar Creatures" ('SIGIL'). |
Authors: | Stephanie Evert [cre, aut] |
Maintainer: | Stephanie Evert <[email protected]> |
License: | GPL-3 |
Version: | 0.6 |
Built: | 2024-10-29 05:46:34 UTC |
Source: | https://github.com/r-forge/sigil |
The corpora
package provides a collection of functions for statistical inference
from corpus frequency data, as well as some convenience functions and example data sets.
It is a companion package to the open-source course Statistical Inference: a Gentle Introduction for Linguists and similar creatures originally developed by Marco Baroni and Stephanie Evert. Statistical methods implemented in the package are described and illustrated in the units of this course.
Starting with version 0.6 the package also includes best-practice implementations of various corpus-linguistic analysis techniques.
An overview of some important functions and data sets included in the corpora
package.
See the package index for a complete listing.
keyness()
provides reference implementations for best-practice keyness measures, including the recommended LRC measure (Evert 2022)
binom.pval()
is a vectorised function that computes p-values of the binomial test more efficiently than binom.test
(using central p-values in the two-sided case)
fisher.pval()
is a vectorised function that efficiently computes p-values of Fisher's exact test on contingency tables for large samples (using central p-values in the two-sided case)
prop.cint()
is a vectorised function that computes multiple binomial confidence intervals much more efficiently than binom.test
z.score()
and z.score.pval()
can be used to carry out a z-test for a single proportion (as an approximation to binom.test
)
chisq()
and chisq.pval()
are vectorised functions that compute the test statistic and p-value of a chi-squared test for contingency tables more efficiently than
chisq.test
cont.table()
creates contingency tables for frequency comparison test that can be passed to
chisq.test
and fisher.test
sample.df()
extracts random samples of rows from a data frame
qw()
splits a string on whitespace or a user-specified regular expression (similar to Perl's qw//
construct)
corpora.palette()
provides some nice colour palettes (better than R's default colours)
rowVector()
and colVector()
convert a vector into a single-row or single-column matrix
Several data sets based on the British National Corpus, including complete metadata for all 4048 text files (BNCmeta
), per-text frequency counts for a number of linguistic corpus queries (BNCqueries
), and relative frequencies of 65 lexico-grammatical features for each text (BNCbiber
)
Frequency counts of passive constructions in all texts of the Brown and LOB corpora (BrownLOBPassives
) for frequency comparison with regression models, complemented by distributional features (DistFeatBrownFam
) as additional predictors
A small text corpus of Very Short Stories in the form of a data frame VSS
, with one row for each token in the corpus.
Small example tables to illustrate frequency comparison of lexical items (BNCcomparison
) and collocation analysis (BNCInChargeOf
)
KrennPPV
is a data set of German verb-preposition-noun collocation candidates with manual annotation of true positives and pre-computed association scores
Three functions for generating large synthetic data sets used in the SIGIL course: simulated.census()
, simulated.language.course()
and simulated.wikipedia()
Stephanie Evert (https://purl.org/stephanie.evert)
The official homepage of the corpora
package and the SIGIL course is http://SIGIL.R-Forge.R-Project.org/.
This function computes the p-value of a binomial test for frequency
counts. In the two-sided case, a “central” p-value (Fay 2010)
provides better numerical efficiency than the likelihood-based approach
of binom.test
and is always consistent with confidence intervals.
binom.pval(k, n, p = 0.5, alternative = c("two.sided", "less", "greater"))
binom.pval(k, n, p = 0.5, alternative = c("two.sided", "less", "greater"))
k |
frequency of a type in the corpus (or an integer vector of frequencies) |
n |
number of tokens in the corpus, i.e. sample size (or an integer vector specifying the sizes of different samples) |
p |
null hypothesis, giving the assumed proportion of this type in the population (or a vector of proportions for different types and/or different populations) |
alternative |
a character string specifying the alternative
hypothesis; must be one of |
For alternative="two.sided"
(the default), a “central” p-value
is computed (Fay 2010: 53f), which differs from the likelihood-based two-sided
p-value determined by binom.test
(the “minlike” method in Fay's
terminology). This approach has two advantages: (i) it is numerically robust
and efficient, even for very large samples and frequency counts; (ii) it is
always consistent with Clopper-Pearson confidence intervals (see examples below).
The p-value of a binomial test applied to the given data (or a vector of p-values).
Stephanie Evert (https://purl.org/stephanie.evert)
Fay, Michael P. (2010). Two-sided exact tests and matching confidence intervals for discrete data. The R Journal, 2(1), 53-58.
# inconsistency btw likelihood-based two-sided binomial test and confidence interval binom.test(2, 10, p=0.555) # central two-sided test as implemented by binom.pval is always consistent binom.pval(2, 10, p=0.555) prop.cint(2, 10, method="binomial")
# inconsistency btw likelihood-based two-sided binomial test and confidence interval binom.test(2, 10, p=0.555) # central two-sided test as implemented by binom.pval is always consistent binom.pval(2, 10, p=0.555) prop.cint(2, 10, method="binomial")
This data set contains a table of the relative frequencies (per 1000 words) of 65 linguistic features (Biber 1988, 1995) for each text document in the British National Corpus (Aston & Burnard 1998).
Biber (1988) introduced these features for the purpose of a multidimensional register analysis. Variables in the data set are numbered according to Biber's list (see e.g. Biber 1995, 95f).
Feature frequencies were automatically extracted from the British National Corpus using query patterns based on part-of-speech tags (Gasthaus 2007). Note that features 60 and 65 had to be omitted because they cannot be identified with sufficient accuracy by the automatic methods. For further information on the extraction methodology, see Gasthaus (2007, 20-21). The original data set and the Python scripts used for feature extraction are available from https://portal.ikw.uni-osnabrueck.de/~CL/download/BSc_Gasthaus2007/; the version included here contains some bug fixes.
BNCbiber
BNCbiber
A numeric matrix with 4048 rows and 65 columns, specifying the relative frequencies
(per 1000 words) of 65 linguistic features. Documents are listed in the same order
as the metadata in BNCmeta
and rows are labelled with text IDs, so it
is straightforward to combine the two data sets.
A. Tense and aspect markers | |
f_01_past_tense |
Past tense |
f_02_perfect_aspect |
Perfect aspect |
f_03_present_tense |
Present tense |
B. Place and time adverbials | |
f_04_place_adverbials |
Place adverbials (e.g., above, beside, outdoors) |
f_05_time_adverbials |
Time adverbials (e.g., early, instantly, soon) |
C. Pronouns and pro-verbs | |
f_06_first_person_pronouns |
First-person pronouns |
f_07_second_person_pronouns |
Second-person pronouns |
f_08_third_person_pronouns |
Third-person personal pronouns (excluding it) |
f_09_pronoun_it |
Pronoun it |
f_10_demonstrative_pronoun |
Demonstrative pronouns (that, this, these, those as pronouns) |
f_11_indefinite_pronoun |
Indefinite pronounes (e.g., anybody, nothing, someone) |
f_12_proverb_do |
Pro-verb do |
D. Questions | |
f_13_wh_question |
Direct wh-questions |
E. Nominal forms | |
f_14_nominalization |
Nominalizations (ending in -tion, -ment, -ness, -ity) |
f_15_gerunds |
Gerunds (participial forms functioning as nouns) |
f_16_other_nouns |
Total other nouns |
F. Passives | |
f_17_agentless_passives |
Agentless passives |
f_18_by_passives |
by-passives |
G. Stative forms | |
f_19_be_main_verb |
be as main verb |
f_20_existential_there |
Existential there |
H. Subordination features | |
f_21_that_verb_comp |
that verb complements (e.g., I said that he went.) |
f_22_that_adj_comp |
that adjective complements (e.g., I'm glad that you like it.) |
f_23_wh_clause |
wh-clauses (e.g., I believed what he told me.) |
f_24_infinitives |
Infinitives |
f_25_present_participle |
Present participial adverbial clauses (e.g., Stuffing his mouth with cookies, Joe ran out the door.) |
f_26_past_participle |
Past participial adverbial clauses (e.g., Built in a single week, the house would stand for fifty years.) |
f_27_past_participle_whiz |
Past participial postnominal (reduced relative) clauses (e.g., the solution produced by this process) |
f_28_present_participle_whiz |
Present participial postnominal (reduced relative) clauses (e.g., the event causing this decline) |
f_29_that_subj |
that relative clauses on subject position (e.g., the dog that bit me) |
f_30_that_obj |
that relative clauses on object position (e.g., the dog that I saw) |
f_31_wh_subj |
wh relatives on subject position (e.g., the man who likes popcorn) |
f_32_wh_obj |
wh relatives on object position (e.g., the man who Sally likes) |
f_33_pied_piping |
Pied-piping relative clauses (e.g., the manner in which he was told) |
f_34_sentence_relatives |
Sentence relatives (e.g., Bob likes fried mangoes, which is the most disgusting thing I've ever heard of.) |
f_35_because |
Causative adverbial subordinator (because) |
f_36_though |
Concessive adverbial subordinators (although, though) |
f_37_if |
Conditional adverbial subordinators (if, unless) |
f_38_other_adv_sub |
Other adverbial subordinators (e.g., since, while, whereas) |
I. Prepositional phrases, adjectives and adverbs | |
f_39_prepositions |
Total prepositional phrases |
f_40_adj_attr |
Attributive adjectives (e.g., the big horse) |
f_41_adj_pred |
Predicative adjectives (e.g., The horse is big.) |
f_42_adverbs |
Total adverbs |
J. Lexical specificity | |
f_43_type_token |
Type-token ratio (including punctuation) |
f_44_mean_word_length |
Average word length (across tokens, excluding punctuation) |
K. Lexical classes | |
f_45_conjuncts |
Conjuncts (e.g., consequently, furthermore, however) |
f_46_downtoners |
Downtoners (e.g., barely, nearly, slightly) |
f_47_hedges |
Hedges (e.g., at about, something like, almost) |
f_48_amplifiers |
Amplifiers (e.g., absolutely, extremely, perfectly) |
f_49_emphatics |
Emphatics (e.g., a lot, for sure, really) |
f_50_discourse_particles |
Discourse particles (e.g., sentence-initial well, now, anyway) |
f_51_demonstratives |
Demonstratives |
L. Modals | |
f_52_modal_possibility |
Possibility modals (can, may, might, could) |
f_53_modal_necessity |
Necessity modals (ought, should, must) |
f_54_modal_predictive |
Predictive modals (will, would, shall) |
M. Specialized verb classes | |
f_55_verb_public |
Public verbs (e.g., assert, declare, mention) |
f_56_verb_private |
Private verbs (e.g., assume, believe, doubt, know) |
f_57_verb_suasive |
Suasive verbs (e.g., command, insist, propose) |
f_58_verb_seem |
seem and appear |
N. Reduced forms and dispreferred structures | |
f_59_contractions |
Contractions |
n/a | Subordinator that deletion (e.g., I think [that] he went.) |
f_61_stranded_preposition |
Stranded prepositions (e.g., the candidate that I was thinking of) |
f_62_split_infinitve |
Split infinitives (e.g., He wants to convincingly prove that ...) |
f_63_split_auxiliary |
Split auxiliaries (e.g., They were apparently shown to ...) |
O. Co-ordination | |
f_64_phrasal_coordination |
Phrasal co-ordination (N and N; Adj and Adj; V and V; Adv and Adv) |
n/a | Independent clause co-ordination (clause-initial and) |
P. Negation | |
f_66_neg_synthetic |
Synthetic negation (e.g., No answer is good enough for Jones.) |
f_67_neg_analytic |
Analytic negation (e.g., That's not likely.) |
Stephanie Evert (https://purl.org/stephanie.evert); feature extractor by Jan Gasthaus (2007).
Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh University Press, Edinburgh. See also the BNC homepage at http://www.natcorp.ox.ac.uk/.
Biber, Douglas (1988). Variations Across Speech and Writing. Cambridge University Press, Cambridge.
Biber, Douglas (1995). Dimensions of Register Variation: A cross-linguistic comparison. Cambridge University Press, Cambridge.
Gasthaus, Jan (2007). Prototype-Based Relevance Learning for Genre Classification. B.Sc.\ thesis, Institute of Cognitive Science, University of Osnabrück. Data sets and software available from https://portal.ikw.uni-osnabrueck.de/~CL/download/BSc_Gasthaus2007/.
This data set compares the frequencies of 60 selected nouns in the written and spoken parts of the British National Corpus, World Edition (BNC). Nouns were chosen from three frequency bands, namely the 20 most frequent nouns in the corpus, 20 nouns with approximately 1000 occurrences, and 20 nouns with approximately 100 occurrences.
See Aston & Burnard (1998) for more information about the BNC, or go to http://www.natcorp.ox.ac.uk/.
BNCcomparison
BNCcomparison
A data frame with 61 rows and the following columns:
noun
:lemmatised noun (aka stem form)
written
:frequency in the written part of the BNC
spoken
:frequency in the spoken part of the BNC
In addition to the 60 nouns, the data set contains a row labelled
OTHER
, which represents the total frequency of all other nouns
in the BNC. This value is needed in order to calculate the sample
sizes of the written and spoken part for frequency comparison tests.
Stephanie Evert (https://purl.org/stephanie.evert)
Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh University Press, Edinburgh. See also the BNC homepage at http://www.natcorp.ox.ac.uk/.
This data set gives the number of documents and tokens in each of the 18 domains represented in the British National Corpus, World Edition (BNC). See Aston & Burnard (1998) for more information about the BNC and the domain classification, or go to http://www.natcorp.ox.ac.uk/.
BNCdomains
BNCdomains
A data frame with 19 rows and the following columns:
domain
:name of the respective domain in the BNC
documents
:number of documents from this domain
tokens
:total number of tokens in all documents from this domain
For one document in the BNC, the domain classification is missing.
This document is represented by the code Unlabeled
in the data
set.
Marco Baroni <[email protected]>
Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh University Press, Edinburgh. See also the BNC homepage at http://www.natcorp.ox.ac.uk/.
This data set lists collocations (in the sense of Sinclair 1991) of the phrase in charge of found in the British National Corpus, World Edition (BNC). A span size of 3 and a frequency threshold of 5 were used, i.e. all words that occur at least five times within a distance of three tokens from the key phrase in charge of are listed as collocates. Note that collocations were not allowed to cross sentence boundaries.
See Aston & Burnard (1998) for more information about the BNC, or go to http://www.natcorp.ox.ac.uk/.
BNCInChargeOf
BNCInChargeOf
A data frame with 250 rows and the following columns:
collocate
:a collocate of the key phrase in charge of (word form)
f.in
:occurrences of the collocate within a distance of 3 tokens from the key phrase, i.e. inside the span
N.in
:total number of tokens inside the span
f.out
:occurrences of the collocate outside the span
N.out
:total number of tokens outside the span
Punctuation, numbers and any words containing non-alphabetic
characters (except for -
) were not considered as potential
collocates. Likewise, the number of tokens inside / outside the span
given in the columns N.in
and N.out
only includes simple
alphabetic word forms.
Stephanie Evert (https://purl.org/stephanie.evert)
Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh University Press, Edinburgh. See also the BNC homepage at http://www.natcorp.ox.ac.uk/.
Sinclair, John (1991). Corpus, Concordance, Collocation. Oxford University Press, Oxford.
This data set provides complete metadata for all 4048 texts of the British National Corpus (XML edition). See Aston & Burnard (1998) for more information about the BNC, or go to http://www.natcorp.ox.ac.uk/.
The data have automatically been extracted from the original BNC source files. Some transformations were applied so that all attribute names and their values are given in a human-readable form. The Perl scripts used in the extraction procedure are available from https://cwb.sourceforge.io/install.php#other.
BNCmeta
BNCmeta
A data frame with 4048 rows and the columns listed below. Unless specified otherwise, columns are coded as factors.
id
:BNC document ID; character vector
title
:Title of the document; character vector
n_words
:Number of words in the document; integer vector
n_tokens
:Total number of tokens (including punctuation and deleted material); integer vector
n_w
:Number of w-units (words); integer vector
n_c
:Number of c-units (punctuation); integer vector
n_s
:Number of s-units (sentences); integer vector
publication_date
:Publication date
text_type
:Text type
context
:Spoken context
respondent_age
:Age-group of respondent
respondent_class
:Social class of respondent (NRS social grades)
respondent_sex
:Sex of respondent
interaction_type
:Interaction type
region
:Region
author_age
:Author age-group
author_domicile
:Domicile of author
author_sex
:Sex of author
author_type
:Author type
audience_age
:Audience age
domain
:Written domain
difficulty
:Written difficulty
medium
:Written medium
publication_place
:Publication place
sampling_type
:Sampling type
circulation
:Estimated circulation size
audience_sex
:Audience sex
availability
:Availability
mode
:Text mode (written/spoken)
derived_type
:Text class
genre
:David Lee's genre classification
Stephanie Evert (https://purl.org/stephanie.evert)
Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh University Press, Edinburgh. See also the BNC homepage at http://www.natcorp.ox.ac.uk/.
This data set contains a table of frequency counts obtained with a selection of BNCweb (Hoffmann et al. 2008) queries for each text document in the British National Corpus (Aston & Burnard 1998).
BNCqueries
BNCqueries
A data frame with 4048 rows and 12 columns. The first column (id
) contains a character vector of
text IDs, the remaining columns contain integer vector of the corresponding per-text frequency counts for
various BNCweb queries. Column names ending in .S
indicate sentence counts rather than token counts.
The list below shows the BNCweb query used for each feature in CEQL syntax (Hoffmann et al. 2008, Ch. 6).
id
:text ID
split.inf.S
:number of sentences containing a split infinitive with -ly adverb; query: _TO0 +ly_AV0 _V?I
adv.inf.S
:number of sentences containing a non-split infinitive with -ly adverb; query: +ly_AV0 _TO0 _V?I
superlative.S
:number of sentences containing a superlative adjective; query: the (_AJS | most _AJ0)
past.S
:number of sentences containing a paste tense verb; query: _V?D
wh.question.S
:number of wh-questions; query: <s> _[PNQ,AVQ] _{V}
stop.to
:frequency of the expression stop to + verb; query: {stop/V} to _{V}
time
:frequency of the noun time; query: {time/N}
click
:frequency of the verb to click; query: {click/V}
noun
:frequency of common nouns; query: _NN?
nominalization
:frequency of nominalizations; query: +[tion,tions,ment,ments,ity,ities]_NN?
downtoner
:frequency of downtoners; query: [almost,barely,hardly,merely,mildly,nearly,only,partially,partly,practically,scarcely,slightly,somewhat]
Stephanie Evert (https://purl.org/stephanie.evert)
Aston, Guy and Burnard, Lou (1998). The BNC Handbook. Edinburgh University Press, Edinburgh. See also the BNC homepage at http://www.natcorp.ox.ac.uk/.
Hoffmann, Sebastian; Evert, Stefan; Smith, Nicholas; Lee, David; Berglund Prytz, Ylva (2008). Corpus Linguistics with BNCweb – a Practical Guide, volume 6 of English Corpus Linguistics. Peter Lang, Frankfurt am Main. See also http://corpora.lancs.ac.uk/BNCweb/.
This data set contains bigrams of adjacent word forms from the Brown corpus of written American English (Francis & Kucera 1964). Co-occurrence frequencies are specified in the form of an observed contingency table, using the notation suggested by Evert (2008).
Only bigrams that occur at least 5 times in the corpus are included.
BrownBigrams
BrownBigrams
A data frame with 24167 rows and the following columns:
id
:unique ID of the bigram entry
word1
:the first word form in the bigram (character)
pos1
:part-of-speech category of the first word (factor)
word2
:the second word form in the bigram (character)
pos2
:part-of-speech category of the second word (factor)
O11
:co-occurrence frequency of the bigram (numeric)
O12
:occurrences of the first word without the second (numeric)
O21
:occurrences of the second word without the first (numeric)
O22
:number of bigram tokens containing neither the first nor the second word (numeric)
Part-of-speech categories are identified by single-letter codes, corresponding of the first character of the Penn tagset.
Some important POS codes are
N
(noun), V
(verb), J
(adjective), R
(adverb or particle),
I
(preposition), D
(determiner), W
(wh-word) and M
(modal).
Stephanie Evert (https://purl.org/stephanie.evert)
Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 58, pages 1212–1248. Mouton de Gruyter, Berlin, New York.
Francis, W.~N. and Kucera, H. (1964). Manual of information to accompany a standard sample of present-day edited American English, for use with digital computers. Technical report, Department of Linguistics, Brown University, Providence, RI.
This data set contains frequency counts of passive verb phrases for selected texts from the Brown corpus of written American English (Francis & Kucera 1964) and the comparable LOB corpus of written British English (Johansson et al. 1978).
BrownLOBPassives
BrownLOBPassives
A data frame with 622 rows and the following columns:
id
:a unique ID for each text (character)
passive
:number of passive verb phrases
n_w
:total number of words in the genre category
n_s
:total number of sentences in the genre category
cat
:genre category code (A
... R
; factor)
genre
:descriptive label for the genre category (factor)
lang
:descriptive label for the genre category
Stephanie Evert (https://purl.org/stephanie.evert)
Francis, W.~N. and Kucera, H. (1964). Manual of information to accompany a standard sample of present-day edited American English, for use with digital computers. Technical report, Department of Linguistics, Brown University, Providence, RI.
Johansson, Stig; Leech, Geoffrey; Goodluck, Helen (1978). Manual of information to accompany the Lancaster-Oslo/Bergen corpus of British English, for use with digital computers. Technical report, Department of English, University of Oslo, Oslo.
This data set contains frequency counts of passive verb phrases in the Brown corpus of written American English (Francis & Kucera 1964), aggregated by genre category.
BrownPassives
BrownPassives
A data frame with 15 rows and the following columns:
cat
:genre category code (A
... R
)
passive
:number of passive verb phrases
n_w
:total number of words in the genre category
n_s
:total number of sentences in the genre category
name
:descriptive label for the genre category
Stephanie Evert (https://purl.org/stephanie.evert)
Francis, W.~N. and Kucera, H. (1964). Manual of information to accompany a standard sample of present-day edited American English, for use with digital computers. Technical report, Department of Linguistics, Brown University, Providence, RI.
This data set provides some basic quantiative measures for all texts in the Brown corpus of written American English (Francis & Kucera 1964),
BrownStats
BrownStats
A data frame with 500 rows and the following columns:
ty
:number of distinct types
to
:number of tokens (including punctuation)
se
:number of sentences
towl
:mean word length in characters, averaged over tokens
tywl
:mean word length in characters, averaged over types
Marco Baroni <[email protected]>
Francis, W.~N. and Kucera, H. (1964). Manual of information to accompany a standard sample of present-day edited American English, for use with digital computers. Technical report, Department of Linguistics, Brown University, Providence, RI.
This function computes Pearson's chi-squared statistic (often written
as ) for frequency comparison data, with or without Yates'
continuity correction. The implementation is based on the formula
given by Evert (2004, 82).
chisq(k1, n1, k2, n2, correct = TRUE, one.sided=FALSE)
chisq(k1, n1, k2, n2, correct = TRUE, one.sided=FALSE)
k1 |
frequency of a type in the first corpus (or an integer vector of type frequencies) |
n1 |
the sample size of the first corpus (or an integer vector specifying the sizes of different samples) |
k2 |
frequency of the type in the second corpus (or an integer
vector of type frequencies, in parallel to |
n2 |
the sample size of the second corpus (or an integer vector
specifying the sizes of different samples, in parallel to
|
correct |
if |
one.sided |
if |
The values returned by this function are identical to those
computed by
chisq.test
. Unlike the latter, chisq
accepts vector arguments so that a large number of frequency
comparisons can be carried out with a single function call.
The one-sided test statistic (for one.sided=TRUE
) is the signed
square root of . It is positive for
and negative for
. Note that this statistic
has a standard normal distribution rather than a chi-squared
distribution under the null hypothesis of equal proportions.
The chi-squared statistic corresponding to the specified
data (or a vector of
values). This statistic has a
chi-squared distribution with
under the null
hypothesis of equal proportions.
Stephanie Evert (https://purl.org/stephanie.evert)
Evert, Stefan (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. Ph.D. thesis, Institut f?r maschinelle Sprachverarbeitung, University of Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714. Available from http://www.collocations.de/phd.html.
chisq.pval
, chisq.test
,
cont.table
chisq.test(cont.table(99, 1000, 36, 1000)) chisq(99, 1000, 36, 1000)
chisq.test(cont.table(99, 1000, 36, 1000)) chisq(99, 1000, 36, 1000)
This function computes the p-value of Pearsons's chi-squared test for
the comparison of corpus frequency counts (under the null hypothesis
of equal population proportions). It is based on the chi-squared
statistic implemented by the
chisq
function.
chisq.pval(k1, n1, k2, n2, correct = TRUE, alternative = c("two.sided", "less", "greater"))
chisq.pval(k1, n1, k2, n2, correct = TRUE, alternative = c("two.sided", "less", "greater"))
k1 |
frequency of a type in the first corpus (or an integer vector of type frequencies) |
n1 |
the sample size of the first corpus (or an integer vector specifying the sizes of different samples) |
k2 |
frequency of the type in the second corpus (or an integer
vector of type frequencies, in parallel to |
n2 |
the sample size of the second corpus (or an integer vector
specifying the sizes of different samples, in parallel to
|
correct |
if |
alternative |
a character string specifying the alternative
hypothesis; must be one of |
The p-values returned by this functions are identical to those
computed by chisq.test
(two-sided only) and
prop.test
(one-sided and two-sided) for two-by-two
contingency tables.
The p-value of Pearson's chi-squared test applied to the given data (or a vector of p-values).
Stephanie Evert (https://purl.org/stephanie.evert)
chisq
, fisher.pval
,
chisq.test
, prop.test
chisq.test(cont.table(99, 1000, 36, 1000)) chisq.pval(99, 1000, 36, 1000)
chisq.test(cont.table(99, 1000, 36, 1000)) chisq.pval(99, 1000, 36, 1000)
This is a convenience function which constructs 2x2 contingency tables
needed for frequency comparisons with chisq.test
, fisher.test
and similar functions.
cont.table(k1, n1, k2, n2, as.list=NA)
cont.table(k1, n1, k2, n2, as.list=NA)
k1 |
frequency of a type in the first corpus, a numeric scalar or vector |
n1 |
the size of the first corpus (sample size), a numeric scalar or vector |
k2 |
frequency of the type in the second corpus, a numeric scalar or vector |
n2 |
the size of the second corpus (sample size), a numeric scalar or vector |
as.list |
whether multiple contingency tables can be constructed and are returned as a list (see "Details" below) |
If all four arguments k1 n1 k2 n2
are scalars (vectors of length 1),
cont.table
constructs a single contingency table, i.e. a 2x2 matrix.
If at least one argument has length > 1, shorter vectors are replicated as
necessary, and a list of 2x2 contingency tables is constructed.
With as.list=TRUE
, the return value is always a list, even if it contains
just a single contingency table. With as.list=FALSE
, only scalar arguments
are accepted and the return value is guaranteed to be a 2x2 matrix.
A numeric matrix containing a two-by-two contingency table for the specified frequency comparison, or a list of such matrices (see "Details").
Stephanie Evert (https://purl.org/stephanie.evert)
ct <- cont.table(42, 100, 66, 200) ct chisq.test(ct)
ct <- cont.table(42, 100, 66, 200) ct chisq.test(ct)
Several useful colour palettes for plots and other visualizations.
The function alpha.col
can be used to turn colours (partially) translucent for used in crowded scatterplots.
corpora.palette(name=c("seaborn", "muted", "bright", "simple"), n=NULL, alpha=1) alpha.col(col, alpha)
corpora.palette(name=c("seaborn", "muted", "bright", "simple"), n=NULL, alpha=1) alpha.col(col, alpha)
name |
name of the desired colour palette (see Details below) |
n |
optional: number of colours to return. The palette will be shortened or recycled as necessary. |
col |
a vector of R colour specifications (as accepted by |
alpha |
alpha value between 0 and 1; values below 1 make the colours translucent |
Every colour palette starts with the colours black, red, green and blue in this order.
seaborn
, muted
and bright
are 7-colour palettes inspired by the seaborn data visualization library, but add a shade of dark grey as first colour.
simple
is a 10-colour palette based on R's default palette.
A character vector with colour names or hexadecimal RGB specifications.
Stephanie Evert (https://purl.org/stephanie.evert)
rgb
for R colour specification formats, palette
for setting the default colour palette
par.save <- par(mfrow=c(2, 2)) for (name in qw("seaborn muted bright simple")) { barplot(rep(1, 10), col=corpora.palette(name, 10), main=name) } par(par.save)
par.save <- par(mfrow=c(2, 2)) for (name in qw("seaborn muted bright simple")) { barplot(rep(1, 10), col=corpora.palette(name, 10), main=name) } par(par.save)
This data frame provides unsupervised distributional features for each text in the extended Brown Family of corpora (Brown, LOB, Frown, FLOB, BLOB), covering edited written American and British English from 1930s, 1960s and 1990s (see Xiao 2008, 395–397).
Latent topic dimensions were obtained by a method similar to Latent Semantic Indexing (Deerwester et al. 1990), applying singular value decomposition to bag-of-words vectors for the 2500 texts in the extended Brown Family. Register dimensions were obtained with the same methodology, using vectors of part-of-speech frequencies (separately for all verb-related tags and all other tags).
DistFeatBrownFam
DistFeatBrownFam
A data frame with 2500 rows and the following 23 columns:
id
:A unique ID for each text (also used as row name)
top1
, top2
, top3
, top4
, top5
, top6
, top7
, top8
, top9
:latent dimension scores for the first 9 topic dimensions
reg1
, reg2
, reg3
, reg4
, reg5
, reg6
, reg7
, reg8
, reg9
:latent dimension scores for the first 9 register dimensions (excluding verb-related tags)
vreg1
, vreg2
, vreg3
, vreg4
:latent dimension scores for the first 4 register dimensions based only on verb-related tags
TODO
Stephanie Evert (https://purl.org/stephanie.evert)
Deerwester, Scott; Dumais, Susan T.; Furnas, George W.; Landauer, Thomas K.; Harshman, Richard (1990). Indexing by latent semantic analysis. Journal of the American Society For Information Science, 41(6), 391–407.
Xiao, Richard (2008). Well-known and influential corpora. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 20, pages 383–457. Mouton de Gruyter, Berlin.
This function computes the p-value of Fisher's exact test (Fisher
1934) for the comparison of corpus frequency counts (under the null
hypothesis of equal population proportions). In the two-sided case,
a “central” p-value (Fay 2010) provides better numerical efficiency
than the likelihood-based approach of fisher.test
and is always
consistent with confidence intervals.
fisher.pval(k1, n1, k2, n2, alternative = c("two.sided", "less", "greater"), log.p = FALSE)
fisher.pval(k1, n1, k2, n2, alternative = c("two.sided", "less", "greater"), log.p = FALSE)
k1 |
frequency of a type in the first corpus (or an integer vector of type frequencies) |
n1 |
the sample size of the first corpus (or an integer vector specifying the sizes of different samples) |
k2 |
frequency of the type in the second corpus (or an integer
vector of type frequencies, in parallel to |
n2 |
the sample size of the second corpus (or an integer vector
specifying the sizes of different samples, in parallel to
|
alternative |
a character string specifying the alternative
hypothesis; must be one of |
log.p |
if TRUE, the natural logarithm of the p-value is returned |
For alternative="two.sided"
(the default), the p-value of the
“central” Fisher's exact test (Fay 2010) is computed, which
differs from the more common likelihood-based method implemented by
fisher.test
(and referred to as the “two-sided Fisher's
exact test” by Fay). This approach has two advantages:
(i) it is numerically robust and efficient, even for very large samples and frequency counts;
(ii) it is consistent with Clopper-Pearson type confidence intervals (see examples below).
For one-sided tests, the p-values returned by this function are identical
to those computed by fisher.test
on two-by-two contingency tables.
The p-value of Fisher's exact test applied to the given data (or a vector of p-values).
Stephanie Evert (https://purl.org/stephanie.evert)
Fay, Michael P. (2010). Confidence intervals that match Fisher's exact or Blaker's exact tests. Biostatistics, 11(2), 373-374.
Fisher, R. A. (1934). Statistical Methods for Research Workers. Oliver & Boyd, Edinburgh, 2nd edition (1st edition 1925, 14th edition 1970).
## Fisher's Tea Drinker (see ?fisher.test) TeaTasting <- matrix(c(3, 1, 1, 3), nrow = 2, dimnames = list(Guess = c("Milk", "Tea"), Truth = c("Milk", "Tea"))) print(TeaTasting) ## - the "corpora" consist of 4 cups of tea each (n1 = n2 = 4) ## => columns of TeaTasting ## - frequency counts are the number of cups selected by drinker (k1 = 3, k2 = 1) ## => first row of TeaTasting ## - null hypothesis of equal type probability = drinker makes random guesses fisher.pval(3, 4, 1, 4, alternative="greater") fisher.test(TeaTasting, alternative="greater")$p.value # should be the same fisher.pval(3, 4, 1, 4) # central Fisher's exact test is equal to fisher.test(TeaTasting)$p.value # standard two-sided Fisher's test for symmetric distribution # inconsistency btw likelihood-based two-sided Fisher's test and confidence interval # for 4/15 vs. 50/619 successes fisher.test(cbind(c(4, 11), c(50, 619))) # central Fisher's exact test is always consistent fisher.pval(4, 15, 50, 619)
## Fisher's Tea Drinker (see ?fisher.test) TeaTasting <- matrix(c(3, 1, 1, 3), nrow = 2, dimnames = list(Guess = c("Milk", "Tea"), Truth = c("Milk", "Tea"))) print(TeaTasting) ## - the "corpora" consist of 4 cups of tea each (n1 = n2 = 4) ## => columns of TeaTasting ## - frequency counts are the number of cups selected by drinker (k1 = 3, k2 = 1) ## => first row of TeaTasting ## - null hypothesis of equal type probability = drinker makes random guesses fisher.pval(3, 4, 1, 4, alternative="greater") fisher.test(TeaTasting, alternative="greater")$p.value # should be the same fisher.pval(3, 4, 1, 4) # central Fisher's exact test is equal to fisher.test(TeaTasting)$p.value # standard two-sided Fisher's test for symmetric distribution # inconsistency btw likelihood-based two-sided Fisher's test and confidence interval # for 4/15 vs. 50/619 successes fisher.test(cbind(c(4, 11), c(50, 619))) # central Fisher's exact test is always consistent fisher.pval(4, 15, 50, 619)
Compute best-practice keyness measures (according to Evert 2022) for the frequency comparison of lexical items in two corpora. The function is fully vectorised and should be applied to a complete data set of candidate items (so statistical analysis can be adjusted to control the family-wise error rate).
keyness(f1, n1, f2, n2, measure=c("LRC", "PositiveLRC", "G2", "LogRatio", "SimpleMaths"), conf.level=.95, alpha=NULL, p.adjust=TRUE, lambda=1)
keyness(f1, n1, f2, n2, measure=c("LRC", "PositiveLRC", "G2", "LogRatio", "SimpleMaths"), conf.level=.95, alpha=NULL, p.adjust=TRUE, lambda=1)
f1 |
a numeric vector specifying the frequencies of candidate items in corpus A (target corpus) |
n1 |
sample size of target corpus, i.e. the total number of tokens in corpus A (usually a scalar, but can also be a vector parallel to |
f2 |
a numeric vector parallel to |
n2 |
sample size of reference corpus, i.e. the total number of tokens in corpus B (usually a scalar, but can also be a vector parallel to |
measure |
the keyness measure to be computed (see “Details” below) |
conf.level |
the desired confidence level for the |
alpha |
if specified, filter out candidate items whose frequency difference between |
p.adjust |
if |
lambda |
parameter |
This function computes a range of best-practice keyness measures comparing the relative frequencies
and
of lexical items in populations (i.e. sublanguages) A and B,
based on the observed sample frequencies
and the corresponding sample sizes
.
The function is fully vectorised with respect to arguments
f1
, f2
, n1
and n2
,
but only a single keyness measure can be selected for each function call.
All implemented measures are robust for the corner cases and
, but
is not allowed.
Most of the keyness measures are directional,
i.e. positive scores indicate positive keyness in A ()
and negative scores indicate negative keyness in A (
).
By contrast, the one-sided measures
PositiveLRC
and SimpleMaths
only detect positive keyness in A,
returning small (and possibly negative) scores otherwise, i.e. in case of insufficient evidence for
and in case of strong evidence for
.
One-sided measures can be useful for a ranking of the entire data set as positive keyword candidates.
Hardie (2014) and other authors recommend to combine effect-size measures (in particular LogRatio
) with
a significance filter in order to weed out candidate items for which there is no significant evidence
against the null hypothesis . Such a filter is activated by specifying the desired
significance level
alpha
, and can be combined with all keyness measures.
In this case, the scores of all non-significant candidate items are set to 0.
The decision is based in the likelihood-ratio test implemented by the G2
measure
and its asymptotic distribution under
.
Note that the significance filter can also be applied to the G2
measure itself, setting all scores
below the critical value for the significance test to 0.
For one-sided measures (PositiveLRC
and SimpleMaths
), candidates with significant evidence
for negative keyness are also filtered out (i.e. their scores are set to 0) in order to ensure a consistent ranking.
By default, statistical inference corrects for multiple testing in order to control family-wise error rates.
This applies to the significance filter as well as to the confidence intervals underlying LRC
and PositiveLRC
.
Note that the G2
scores themselves are never adjusted (only the critical value for the significance filter is modified).
Family size is automatically determined from the number of candidate items processed in a single function call.
Alternatively, the family size can be specified explicitly in the
p.adjust
argument, e.g. if a large data set
is processed in multiple batches, or p.adjust=FALSE
can be used to disable the correction.
For the adjustment, a highly conservative Bonferroni correction is applied to significance levels.
Since the large candidate sets and sample sizes often found in corpus linguistics tend to produce large numbers of false positives,
this conservative approach is considered to be useful.
See Evert (2022) and its supplementary materials for a more detailed discussion of the implemented best-practice measures and some alternatives.
G2
The log-likelihood measure (Rayson & Garside 2003: 3) computes the score
of a likelihood-ratio test for
. This test is two-sided and
always returns positive values, so the sign of its score is inverted for
in order to obtain a directional keyness measure.
As a pure significance measure, it tends to prefer high-frequency candidates with large
.
LogRatio
A point estimate of the log relative risk , which has been suggested
as an intuitive keyness measure under the name LogRatio by Hardie (2014: 45).
The implementation uses Walter's (1975) adjusted estimator
which is less biased and robust against .
As a pure effect-size measure, LogRatio tends to assign spuriously high scores to low-frequency candidates
with small
and
(due to sampling variation).
Combination with a significance filter is strongly recommended.
LRC
(default)A conservative estimate for LogRatio recommended by Evert (2022) in order to combine
and balance the advantages of effect-size and significance measures.
A confidence interval (according to the specified conf.level
) for relative risk
is obtained from an exact conditional Poisson test (Fay 2010: 55), adjusted for multiple testing by default.
If a candidate is not significant (i.e. the confidence interval includes
) its score is set to 0.
Otherwise the boundary of the confidence interval closer to 1 is taken as a conservative directional estimate
of
and its
is returned.
PositiveLRC
A one-sided variant of LRC, which returns the lower boundary of a one-sided confidence interval
for . Scores
indicate that there is no significant evidence for positive keyness.
The directional version of LRC is recommended for general use, but PositiveLRC may be preferred if the
hermeneutic interpretation should also consider non-significant candidates (especially with small data sets).
SimpleMaths
The simple maths keyness measure (Kilgarriff 2009) used by the commercial corpus analysis platform Sketch Engine:
Its frequency bias can be adjusted with the user parameter . The scaling
factor
was chosen so that
is a practical default value.
There does not appear to be a convincing mathematical justification behind this measure. It is included here only because of the popularity of the Sketch Engine platform.
A numeric vector of the same length as f1
and f2
, containing keyness scores for all candidate lexical items.
For most measures, positive scores indicate positive keywords (i.e. higher frequency in the population underlying corpus A)
and negative scores indicate negative keywords (i.e. higher frequency in the population underlying corpus B).
If alpha
is specified, non-significant candidates always have a score of 0.
Stephanie Evert (https://purl.org/stephanie.evert)
Evert, S. (2022). Measuring keyness. In Digital Humanities 2022: Conference Abstracts, pages 202-205, Tokyo, Japan / online. https://osf.io/cy6mw/
Fay, Michael P. (2010). Two-sided exact tests and matching confidence intervals for discrete data. The R Journal, 2(1), 53-58.
Hardie, A. (2014). A single statistical technique for keywords, lockwords, and collocations. Internal CASS working paper no. 1, unpublished.
Kilgarriff, A. (2009). Simple maths for keywords. In Proceedings of the Corpus Linguistics 2009 Conference, Liverpool, UK.
Rayson, P. and Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the ACL Workshop on Comparing Corpora, pages 1-6, Hong Kong.
Walter, S. D. (1975). The distribution of Levin’s measure of attributable risk. Biometrika, 62(2): 371-374.
prop.cint
, which is used by the exact conditional Poisson test of the LRC measure
# compute all keyness measures for a single candidate item with f1=7, f2=2 and n1=n2=1000 keyness(7, 1000, 2, 1000, measure="G2") # log-likelihood keyness(7, 1000, 2, 1000, measure="LogRatio") keyness(7, 1000, 2, 1000, measure="LogRatio", alpha=0.05) # with significance filter keyness(7, 1000, 2, 1000, measure="LRC") # the default measure keyness(7, 1000, 2, 1000, measure="PositiveLRC") keyness(7, 1000, 2, 1000, measure="SimpleMaths") # a practical example: keywords of spoken British English (from BNC corpus) n1 <- sum(BNCcomparison$spoken) # sample sizes n2 <- sum(BNCcomparison$written) kw <- transform(BNCcomparison, G2 = keyness(spoken, n1, written, n2, measure="G2"), LogRatio = keyness(spoken, n1, written, n2, measure="LogRatio"), LRC = keyness(spoken, n1, written, n2)) kw <- kw[order(-kw$LogRatio), ] head(kw, 20) # collocations of "in charge of" with LRC as an association measure colloc <- transform(BNCInChargeOf, PosLRC = keyness(f.in, N.in, f.out, N.out, measure="PositiveLRC")) colloc <- colloc[order(-colloc$PosLRC), ] head(colloc, 30)
# compute all keyness measures for a single candidate item with f1=7, f2=2 and n1=n2=1000 keyness(7, 1000, 2, 1000, measure="G2") # log-likelihood keyness(7, 1000, 2, 1000, measure="LogRatio") keyness(7, 1000, 2, 1000, measure="LogRatio", alpha=0.05) # with significance filter keyness(7, 1000, 2, 1000, measure="LRC") # the default measure keyness(7, 1000, 2, 1000, measure="PositiveLRC") keyness(7, 1000, 2, 1000, measure="SimpleMaths") # a practical example: keywords of spoken British English (from BNC corpus) n1 <- sum(BNCcomparison$spoken) # sample sizes n2 <- sum(BNCcomparison$written) kw <- transform(BNCcomparison, G2 = keyness(spoken, n1, written, n2, measure="G2"), LogRatio = keyness(spoken, n1, written, n2, measure="LogRatio"), LRC = keyness(spoken, n1, written, n2)) kw <- kw[order(-kw$LogRatio), ] head(kw, 20) # collocations of "in charge of" with LRC as an association measure colloc <- transform(BNCInChargeOf, PosLRC = keyness(f.in, N.in, f.out, N.out, measure="PositiveLRC")) colloc <- colloc[order(-colloc$PosLRC), ] head(colloc, 30)
This data set lists 5102 frequent combinations of verbs and prepositional phrases (PP) extracted from a German newspaper corpus. The collocational status of each PP-verb combination was manually annotated by Brigitte Krenn (2000). In addition, pre-computed scores of several standard association measures are provided.
The KrennPPV
candidate set forms part of the data used in the evaluation study
of Evert & Krenn (2005).
KrennPPV
KrennPPV
A data frame with 5102 rows and the following columns:
PP
:the prepositional phrase, represented by preposition and lemma of the nominal head (character).
Preposition-article fusion is indicated by a +
sign. For example, the prepositional phrase
im letzten Jahr would appear as in:Jahr
in the data set.
verb
:the verb lemma (character). Separated particle verbs have been recombined.
is.colloc
:whether the PP-verb combination is a lexical collocation (logical)
is.SVC
:whether a PP-verb collocation is a support verb construction (logical)
is.figur
:whether a PP-verb-collocation is a figurative expression (logical)
freq
:co-occurrence frequency of the PP-verb combination within clauses (integer)
MI
:Mutual Information association measure
Dice
:Dice coefficient association measure
z.score
:z-score association measure
t.score
:t-score association measure
chisq
:chi-squared association measure (without Yates' continuity correction)
chisq.corr
:chi-squared association measure (with Yates' continuity correction)
log.like
:log-likelihood association measure
Fisher
:Fisher's exact test as an association measure (negative logarithm of one-sided p-value)
See Evert (2008) and http://www.collocations.de/AM/ for details on these association measures.
Stephanie Evert (https://purl.org/stephanie.evert)
Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 58, pages 1212–1248. Mouton de Gruyter, Berlin, New York.
Evert, Stefan and Krenn, Brigitte (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech and Language, 19(4), 450–466.
Krenn, Brigitte (2000). The Usual Suspects: Data-Oriented Models for the Identification and Representation of Lexical Collocations, volume~7 of Saarbrücken Dissertations in Computational Linguistics and Language Technology. DFKI & Universität des Saarlandes, Saarbrücken, Germany.
This data set contains frequency counts of passive verb phrases in the LOB corpus of written British English (Johansson et al. 1978), aggregated by genre category.
BrownPassives
BrownPassives
A data frame with 15 rows and the following columns:
cat
:genre category code (A
... R
)
passive
:number of passive verb phrases
n_w
:total number of words in the genre category
n_s
:total number of sentences in the genre category
name
:descriptive label for the genre category
Stephanie Evert (https://purl.org/stephanie.evert)
Johansson, Stig; Leech, Geoffrey; Goodluck, Helen (1978). Manual of information to accompany the Lancaster-Oslo/Bergen corpus of British English, for use with digital computers. Technical report, Department of English, University of Oslo, Oslo.
BrownPassives
, BrownLOBPassives
This data set provides some basic quantiative measures for all texts in the LOB corpus of written British English (Johansson et al. 1978).
LOBStats
LOBStats
A data frame with 500 rows and the following columns:
ty
:number of distinct types
to
:number of tokens (including punctuation)
se
:number of sentences
towl
:mean word length in characters, averaged over tokens
tywl
:mean word length in characters, averaged over types
Marco Baroni <[email protected]>
Johansson, Stig; Leech, Geoffrey; Goodluck, Helen (1978). Manual of information to accompany the Lancaster-Oslo/Bergen corpus of British English, for use with digital computers. Technical report, Department of English, University of Oslo, Oslo.
This data set specifies the number of passive and active verb phrases for each text in the extended Brown Family of corpora (Brown, LOB, Frown, FLOB, BLOB), covering edited written American and British English from 1930s, 1960s and 1990s (see Xiao 2008, 395–397).
Verb phrase and passive/active aspect counts are based on a fully automatic analysis of the texts, using the Pro3Gres parser (Schneider et al. 2004).
PassiveBrownFam
PassiveBrownFam
A data frame with 2499 rows and the following 11 columns:
id
:A unique ID for each text (also used as row name)
corpus
:Corpus, a factor with five levels BLOB
, Brown
, LOB
, Frown
, FLOB
section
:Genre, a factor with fifteen levels A
, ..., R
(Brown section codes)
genre
:Genre labels, a factor with fifteen levels (e.g. press reportage
)
period
:Date of publication, a factor with three levels (1930
, 1960
, 1990
)
lang
:Language variety / region, a factor with levels AmE
(U.S.) and BrE
(UK)
n.words
:Number of word tokens, an integer vector
act
:Number of active verb phrases, an integer vector
pass
:Number of passive verb phrases, an integer vector
verbs
:Total number of verb phrases, an integer vector
p.pass
:Percentage of passive verb phrases in the text, a numeric vector
No frequency data could be obtained for text N02
in the Frown corpus. This entry has been omitted from the table.
Frequency information for this data set was kindly provided by Gerold Schneider, University of Zurich (http://www.cl.uzh.ch/de/people/team/compling/gschneid.html).
Stephanie Evert (https://purl.org/stephanie.evert)
Schneider, Gerold; Rinaldi, Fabio; Dowdall, James (2004). Fast, deep-linguistic statistical dependency parsing. In G.-J. M. Kruijff and D. Duchier (eds.), Proceedings of the COLING 2004 Workshop on Recent Advances in Dependency Grammar, pages 33-40, Geneva, Switzerland. https://files.ifi.uzh.ch/cl/gschneid/parser/
Xiao, Richard (2008). Well-known and influential corpora. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 20, pages 383–457. Mouton de Gruyter, Berlin.
This function computes a confidence interval for a population proportion from the corresponding frequency count in a sample. It either uses the Clopper-Pearson method (inverted exact binomial test) or the Wilson score method (inversion of a z-score test, with or without continuity correction).
prop.cint(k, n, method = c("binomial", "z.score"), correct = TRUE, p.adjust=FALSE, conf.level = 0.95, alternative = c("two.sided", "less", "greater"))
prop.cint(k, n, method = c("binomial", "z.score"), correct = TRUE, p.adjust=FALSE, conf.level = 0.95, alternative = c("two.sided", "less", "greater"))
k |
frequency of a type in the corpus (or an integer vector of frequencies) |
n |
number of tokens in the corpus, i.e. sample size (or an integer vector specifying the sizes of different samples) |
method |
a character string specifying whether to compute
a Clopper-Pearson confidence interval ( |
correct |
if |
p.adjust |
if |
conf.level |
the desired confidence level (defaults to 95%) |
alternative |
a character string specifying the alternative
hypothesis, yielding a two-sided ( |
The confidence intervals computed by this function correspond to those
returned by binom.test
and prop.test
,
respectively. However, prop.cint
accepts vector arguments,
allowing many confidence intervals to be computed with a single
function call in a computationally efficient manner.
The Clopper-Pearson confidence interval (binomial
) is
obtained by inverting the exact binomial test at significance level
= 1 -
confidence.level
.
In the two-sided case, the p-value of the test is computed using the
“central” method Fay (2010: 53), i.e. as twice the tail probability
of the matching tail. This corresponds to the algorithm originally proposed
by Clopper & Pearson (1934).
The limits of the confidence interval are computed in an efficient and numerically robust manner via (the inverse of) the incomplete Beta function.
The Wilscon score confidence interval (z.score
) is computed
by solving the equation of the z-score test
for , where
is the
-value corresponding
to the chosen confidence level (e.g.
for a
two-sided test with 95% confidence). This leads to the quadratic
equation
whose two solutions correspond to the lower and upper boundary of the confidence interval.
When Yates' continuity correction is applied, the value in the
numerator of the
-score equation has to be replaced by
, with
for the
lower boundary of the confidence interval (where
)
and
for the upper boundary of
the confidence interval (where
). In each case, the
corresponding solution of the quadratic equation has to be chosen
(i.e., the solution with
for the lower boundary and vice
versa).
If a Bonferroni correction is applied, the significance level
of the underlying test is divided by the number
of tests carried out
(specified explicitly by the user or given implicitly by
length(k)
):
.
A data frame with two columns, labelled lower
for the lower
boundary and upper
for the upper boundary of the confidence
interval. The number of rows is determined by the length of the
longest input vector (k
, n
and conf.level
).
Stephanie Evert (https://purl.org/stephanie.evert)
Clopper, C. J. & Pearson, E. S. (1934). The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4), 404-413.
Fay, Michael P. (2010). Two-sided exact tests and matching confidence intervals for discrete data. The R Journal, 2(1), 53-58.
https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval
z.score.pval
, prop.test
,
binom.pval
, binom.test
# Clopper-Pearson confidence interval binom.test(19, 100) prop.cint(19, 100, method="binomial") # Wilson score confidence interval prop.test(19, 100) prop.cint(19, 100, method="z.score")
# Clopper-Pearson confidence interval binom.test(19, 100) prop.cint(19, 100, method="binomial") # Wilson score confidence interval prop.test(19, 100) prop.cint(19, 100, method="z.score")
This function splits one or more character strings into words. By default,
the strings are split on whitespace in order to emulate Perl's qw()
(quote words) functionality.
qw(s, sep="\\s+", names=FALSE)
qw(s, sep="\\s+", names=FALSE)
s |
one or more strings to be split (a character vector) |
sep |
PCRE regular expression on which to split (defaults to whitespace) |
names |
if TRUE, the resulting character vector is labelled with itself, which is convenient for |
A character vector of the resulting words. Multiple strings in s
are flattened into a single vector.
If names=TRUE
, the words are used both as values and as labels of the character vectors, which is convenient when iterating over it with lapply
or sapply
.
Stephanie Evert (https://purl.org/stephanie.evert)
qw(c("alpha beta gamma", "42 111" )) qw("alpha beta gamma", names=TRUE) qw("words with blanks, sep by commas", sep="\\s*,\\s*")
qw(c("alpha beta gamma", "42 111" )) qw("alpha beta gamma", names=TRUE) qw("words with blanks, sep by commas", sep="\\s*,\\s*")
This utility function converts a plain vector into a row or column vector, i.e. a single-row or single-column matrix.
rowVector(x, label=NULL) colVector(x, label=NULL)
rowVector(x, label=NULL) colVector(x, label=NULL)
x |
a (typically numeric) vector |
label |
an optional character string specifying a label for the single row or column returned |
A single-row or single-column matrix of the same data type as x
.
Labels of x
are preserved as column/row names of the matrix.
See matrix
for details on how non-atomic objects are handled.
Stephanie Evert (https://purl.org/stephanie.evert)
rowVector(1:5, "myvec") colVector(c(A=1, B=2, C=3), label="myvec")
rowVector(1:5, "myvec") colVector(c(A=1, B=2, C=3), label="myvec")
This function takes a random sample of rows from a data frame,
in analogy to the built-in function sample
(which sadly
does not accept a data frame).
sample.df(df, size, replace=FALSE, sort=FALSE, prob=NULL)
sample.df(df, size, replace=FALSE, sort=FALSE, prob=NULL)
df |
a data frame to be sampled from |
size |
positive integer giving the number of rows to choose |
replace |
Should sampling be with replacement? |
sort |
Should rows in sample be sorted in original order? |
prob |
a vector of probability weights for obtaining the elements of the vector being sampled |
Internally, rows are selected with the function sample.int
. See its manual page
for details on the arguments (except for sort
) and implementation.
A data frame containing the sampled rows of df
, either their original order (sort=TRUE
)
or shuffled randomly (sort=FALSE
).
Stephanie Evert (https://purl.org/stephanie.evert)
sample.df(BrownLOBPassives, 20, sort=TRUE)
sample.df(BrownLOBPassives, 20, sort=TRUE)
This function generates a large simulated census data frame with body measurements (height, weight, shoe size) for male and female inhabitants of a highly fictitious country.
The generated data set is usually named FakeCensus
(see code examples below)
and is used for various exercises and illustrations in the SIGIL course.
simulated.census(N=502202, p.male=0.55, seed.rng=42)
simulated.census(N=502202, p.male=0.55, seed.rng=42)
N |
population size, i.e. number of inhabitants of the fictitious country |
p.male |
proportion of males in the country |
seed.rng |
seed for the random number generator, so data sets with the same parameters ( |
The default population size corresponds to the estimated populace of Luxembourg on 1 January 2010 (according to https://en.wikipedia.org/wiki/Luxembourg).
Further parameters of the simulation (standard deviation, correlations, non-linearity) will be exposed as function arguments in future releases.
A data frame with N
rows corresponding to inhabitants and the following columns:
height
:body height in cm
height
:body weight in kg
shoe.size
:shoe size in Paris points (Continental European scale)
sex
:sex, either m
or f
Stephanie Evert (https://purl.org/stephanie.evert)
FakeCensus <- simulated.census() summary(FakeCensus)
FakeCensus <- simulated.census() summary(FakeCensus)
This function generates simulated results of a study measuring the effectiveness of a new corpus-driven foreign language teaching course.
The generated data set is usually named LanguageCourse
(see code examples below)
and is used for various exercises and illustrations in the SIGIL course.
simulated.language.course(n=c(15,20,10,10,14,18,15), mean=c(60,50,30,70,55,50,60), effect=c(5,8,12,-4,2,6,-5), sd.subject=15, sd.effect=5, seed.rng=42)
simulated.language.course(n=c(15,20,10,10,14,18,15), mean=c(60,50,30,70,55,50,60), effect=c(5,8,12,-4,2,6,-5), sd.subject=15, sd.effect=5, seed.rng=42)
n |
number of participants in each class |
mean |
average score of each class before the course |
effect |
improvement of each class during the course |
sd.subject |
inter-subject variability, may be different in each class |
sd.effect |
inter-subject variability of effect size, may also be different in each class |
seed.rng |
seed for the random number generator, so data sets with the same parameters are reproducible |
TODO
A data frame with sum(n)
rows corresponding to individual subjects participating in the study and the following columns
id
:unique ID code of subject
class
:name of the teaching class
pre
:score in standardized language test before the course (pre-test)
post
:score in standardized language test after the course (post-test)
Stephanie Evert (https://purl.org/stephanie.evert)
LanguageCourse <- simulated.language.course() head(LanguageCourse, 20) summary(LanguageCourse)
LanguageCourse <- simulated.language.course() head(LanguageCourse, 20) summary(LanguageCourse)
This function generates type and token counts, token-type ratios (TTR) and average word length for simulated articles from the English Wikipedia. Simulation paramters are based on data from the Wackypedia corpus.
The generated data set is usually named WackypediaStats
(see code examples below)
and is used for various exercises and illustrations in the SIGIL course.
simulated.wikipedia(N=1429649, length=c(100,1000), seed.rng=42)
simulated.wikipedia(N=1429649, length=c(100,1000), seed.rng=42)
N |
population size, i.e. total number of Wikipedia articles |
length |
a numeric vector of length 2, specifying the typical range of Wikipedia article lengths |
seed.rng |
seed for the random number generator, so data sets with the same parameters ( |
The default population size corresponds to the subset of the Wackypedia corpus from which the simulation parameters were obtained. This excludes all articles with extreme type-token statistics (very short, very long, extremely long words, etc.).
Article lengths are sampled from a lognormal distribution which is scaled so that the
central 95% of the values fall into the range specified by the length
argument.
The simulated data are surprising close to the original Wackypedia statistics.
A data frame with N
rows corresponding to Wikipedia articles and the following columns:
tokens
:number of word tokens in the article
types
:number of distinct word types in the article
ttr
:token-type ratio (TTR) for the article
avglen
:average word length in characters (averaged across tokens)
Stephanie Evert (https://purl.org/stephanie.evert)
The Wackypedia corpus can be obtained from https://wacky.sslmit.unibo.it/doku.php?id=corpora.
WackypediaStats <- simulated.wikipedia() summary(WackypediaStats)
WackypediaStats <- simulated.wikipedia() summary(WackypediaStats)
A simple utility function that converts p-values into the customary significance stars.
stars.pval(x)
stars.pval(x)
x |
a numeric vector of non-negative p-values |
A character vector with significance stars corresponding to the p-values.
Significance levels are ***
(),
**
(),
*
() and
.
(). For non-significant p-values (
), an empty string is returned.
Stephanie Evert (https://purl.org/stephanie.evert)
stars.pval(c(0, .007, .01, .04, .1))
stars.pval(c(0, .007, .01, .04, .1))
This data set contains a small corpus (8043 tokens) of short stories from the collection Very Short Stories (VSS, see http://www.schtepf.de/History/pages/stories.html). The text was automatically segmented (tokenised) and annotated with part-of-speech tags (from the Penn tagset) and lemmas (base forms), using the IMS TreeTagger (Schmid 1994) and a custom lemmatizer.
VSS
VSS
A data set with 8043 rows corresponding to tokens and the following columns:
word
:the word form (or surface form) of the token
pos
:the part-of-speech tag of the token (Penn tagset)
lemma
:the lemma (or base form) of the token
sentence
:number of the sentence in which the token occurs (integer)
story
:title of the story to which the token belongs (factor)
The Penn tagset defines the following part-of-speech tags:
CC |
Coordinating conjunction |
CD |
Cardinal number |
DT |
Determiner |
EX |
Existential there |
FW |
Foreign word |
IN |
Preposition or subordinating conjunction |
JJ |
Adjective |
JJR |
Adjective, comparative |
JJS |
Adjective, superlative |
LS |
List item marker |
MD |
Modal |
NN |
Noun, singular or mass |
NNS |
Noun, plural |
NP |
Proper noun, singular |
NPS |
Proper noun, plural |
PDT |
Predeterminer |
POS |
Possessive ending |
PP |
Personal pronoun |
PP$ |
Possessive pronoun |
RB |
Adverb |
RBR |
Adverb, comparative |
RBS |
Adverb, superlative |
RP |
Particle |
SYM |
Symbol |
TO |
to |
UH |
Interjection |
VB |
Verb, base form |
VBD |
Verb, past tense |
VBG |
Verb, gerund or present participle |
VBN |
Verb, past participle |
VBP |
Verb, non-3rd person singular present |
VBZ |
Verb, 3rd person singular present |
WDT |
Wh-determiner |
WP |
Wh-pronoun |
WP$ |
Possessive wh-pronoun |
WRB |
Wh-adverb |
Stephanie Evert (https://purl.org/stephanie.evert)
Schmid, Helmut (1994). Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing (NeMLaP), pages 44-49.
This function computes a z-score statistic for frequency counts, based on a normal approximation to the correct binomial distribution under the random sampling model.
z.score(k, n, p = 0.5, correct = TRUE)
z.score(k, n, p = 0.5, correct = TRUE)
k |
frequency of a type in the corpus (or an integer vector of frequencies) |
n |
number of tokens in the corpus, i.e. sample size (or an integer vector specifying the sizes of different samples) |
p |
null hypothesis, giving the assumed proportion of this type in the population (or a vector of proportions for different types and/or different populations) |
correct |
if |
The statistic is given by
When Yates' continuity correction is enabled, the absolute
value of the numerator is reduced by
,
but clamped to a non-negative value.
The -score corresponding to the specified data (or a vector of
-scores).
Stephanie Evert (https://purl.org/stephanie.evert)
# z-test for H0: pi = 0.15 with observed counts 10..30 in a sample of n=100 tokens k <- c(10:30) z <- z.score(k, 100, p=.15) names(z) <- k round(z, 3) abs(z) >= 1.96 # significant results at p < .05
# z-test for H0: pi = 0.15 with observed counts 10..30 in a sample of n=100 tokens k <- c(10:30) z <- z.score(k, 100, p=.15) names(z) <- k round(z, 3) abs(z) >= 1.96 # significant results at p < .05
This function computes the p-value of a z-score test for frequency
counts, based on the z-score statistic implemented by
z.score
.
z.score.pval(k, n, p = 0.5, correct = TRUE, alternative = c("two.sided", "less", "greater"))
z.score.pval(k, n, p = 0.5, correct = TRUE, alternative = c("two.sided", "less", "greater"))
k |
frequency of a type in the corpus (or an integer vector of frequencies) |
n |
number of tokens in the corpus, i.e. sample size (or an integer vector specifying the sizes of different samples) |
p |
null hypothesis, giving the assumed proportion of this type in the population (or a vector of proportions for different types and/or different populations) |
correct |
if |
alternative |
a character string specifying the alternative
hypothesis; must be one of |
The p-value of a -score test applied to the given data (or a vector
of p-values).
Stephanie Evert (https://purl.org/stephanie.evert)
z.score
, binom.pval
, prop.cint
# compare z-test for H0: pi = 0.15 against binomial test # with observed counts 10..30 in a sample of n=100 tokens k <- c(10:30) p.compare <- rbind( z.score = z.score.pval(k, 100, p=.15), binomial = binom.pval(k, 100, p=.15)) colnames(p.compare) <- k round(p.compare, 4)
# compare z-test for H0: pi = 0.15 against binomial test # with observed counts 10..30 in a sample of n=100 tokens k <- c(10:30) p.compare <- rbind( z.score = z.score.pval(k, 100, p=.15), binomial = binom.pval(k, 100, p=.15)) colnames(p.compare) <- k round(p.compare, 4)