Title: | Text Mining Package |
---|---|
Description: | A framework for text mining applications within R. |
Authors: | Ingo Feinerer [aut] , Kurt Hornik [aut, cre] , Artifex Software, Inc. [ctb, cph] (pdf_info.ps taken from GPL Ghostscript) |
Maintainer: | Kurt Hornik <[email protected]> |
License: | GPL-3 |
Version: | 0.7-15 |
Built: | 2024-12-09 20:25:19 UTC |
Source: | https://github.com/r-forge/tm |
This dataset holds 50 news articles with additional meta information from the
Reuters-21578 data set. All documents belong to the topic acq
dealing
with corporate acquisitions.
data("acq")
data("acq")
A VCorpus
of 50 text documents.
Reuters-21578 Text Categorization Collection Distribution 1.0 (XML format).
Lewis, David (1997). Reuters-21578 Text Categorization Collection Distribution. UCI Machine Learning Repository. doi:10.24432/C52G6M.
data("acq") acq
data("acq") acq
Create content transformers, i.e., functions which modify the content of an R object.
content_transformer(FUN)
content_transformer(FUN)
FUN |
a function. |
A function with two arguments:
x
an R object with implemented content getter
(content
) and setter (content<-
)
functions.
...
arguments passed over to FUN
.
tm_map
for an interface to apply transformations to corpora.
data("crude") crude[[1]] (f <- content_transformer(function(x, pattern) gsub(pattern, "", x))) tm_map(crude, f, "[[:digit:]]+")[[1]]
data("crude") crude[[1]] (f <- content_transformer(function(x, pattern) gsub(pattern, "", x))) tm_map(crude, f, "[[:digit:]]+")[[1]]
Representing and computing on corpora.
Corpora are collections of documents containing (natural language)
text. In packages which employ the infrastructure provided by package
tm, such corpora are represented via the virtual S3 class
Corpus
: such packages then provide S3 corpus classes extending the
virtual base class (such as VCorpus
provided by package tm
itself).
All extension classes must provide accessors to extract subsets
([
), individual documents ([[
), and metadata
(meta
). The function length
must return the number
of documents, and as.list
must construct a list holding the
documents.
A corpus can have two types of metadata (accessible via meta
).
Corpus metadata contains corpus specific metadata in form of tag-value
pairs. Document level metadata contains document specific metadata but
is stored in the corpus as a data frame. Document level metadata is typically
used for semantic reasons (e.g., classifications of documents form an own
entity due to some high-level information like the range of possible values)
or for performance reasons (single access instead of extracting metadata of
each document).
The function Corpus
is a convenience alias to SimpleCorpus
or
VCorpus
, depending on the arguments provided.
SimpleCorpus
, VCorpus
, and PCorpus
for the corpora classes provided by package tm.
DCorpus
for a distributed corpus class provided by
package tm.plugin.dc.
This data set holds 20 news articles with additional meta information from
the Reuters-21578 data set. All documents belong to the topic crude
dealing with crude oil.
data("crude")
data("crude")
A VCorpus
of 20 text documents.
Reuters-21578 Text Categorization Collection Distribution 1.0 (XML format).
Lewis, David (1997). Reuters-21578 Text Categorization Collection Distribution. UCI Machine Learning Repository. doi:10.24432/C52G6M.
data("crude") crude
data("crude") crude
Create a data frame source.
DataframeSource(x)
DataframeSource(x)
x |
A data frame giving the texts and metadata. |
A data frame source interprets each row of the data frame x
as a
document. The first column must be named "doc_id"
and contain a unique
string identifier for each document. The second column must be named
"text"
and contain a UTF-8 encoded string representing the
document's content. Optional additional columns are used as document level
metadata.
An object inheriting from DataframeSource
, SimpleSource
,
and Source
.
Source
for basic information on the source infrastructure
employed by package tm, and meta
for types of metadata.
readtext
for reading in a text in multiple formats
suitable to be processed by DataframeSource
.
docs <- data.frame(doc_id = c("doc_1", "doc_2"), text = c("This is a text.", "This another one."), dmeta1 = 1:2, dmeta2 = letters[1:2], stringsAsFactors = FALSE) (ds <- DataframeSource(docs)) x <- Corpus(ds) inspect(x) meta(x)
docs <- data.frame(doc_id = c("doc_1", "doc_2"), text = c("This is a text.", "This another one."), dmeta1 = 1:2, dmeta2 = letters[1:2], stringsAsFactors = FALSE) (ds <- DataframeSource(docs)) x <- Corpus(ds) inspect(x) meta(x)
Create a directory source.
DirSource(directory = ".", encoding = "", pattern = NULL, recursive = FALSE, ignore.case = FALSE, mode = "text")
DirSource(directory = ".", encoding = "", pattern = NULL, recursive = FALSE, ignore.case = FALSE, mode = "text")
directory |
A character vector of full path names; the default
corresponds to the working directory |
encoding |
a character string describing the current encoding. It is
passed to |
pattern |
an optional regular expression. Only file names which match the regular expression will be returned. |
recursive |
logical. Should the listing recurse into directories? |
ignore.case |
logical. Should pattern-matching be case-insensitive? |
mode |
a character string specifying if and how files should be read in. Available modes are: |
A directory source acquires a list of files via dir
and
interprets each file as a document.
An object inheriting from DirSource
, SimpleSource
, and
Source
.
Source
for basic information on the source infrastructure
employed by package tm.
Encoding
and iconv
on encodings.
DirSource(system.file("texts", "txt", package = "tm"))
DirSource(system.file("texts", "txt", package = "tm"))
Accessing document IDs, terms, and their number of a term-document matrix or document-term matrix.
Docs(x) nDocs(x) nTerms(x) Terms(x)
Docs(x) nDocs(x) nTerms(x) Terms(x)
x |
Either a |
For Docs
and Terms
, a character vector with document IDs and
terms, respectively.
For nDocs
and nTerms
, an integer with the number of document IDs
and terms, respectively.
data("crude") tdm <- TermDocumentMatrix(crude)[1:10,1:20] Docs(tdm) nDocs(tdm) nTerms(tdm) Terms(tdm)
data("crude") tdm <- TermDocumentMatrix(crude)[1:10,1:20] Docs(tdm) nDocs(tdm) nTerms(tdm) Terms(tdm)
Find associations in a document-term or term-document matrix.
## S3 method for class 'DocumentTermMatrix' findAssocs(x, terms, corlimit) ## S3 method for class 'TermDocumentMatrix' findAssocs(x, terms, corlimit)
## S3 method for class 'DocumentTermMatrix' findAssocs(x, terms, corlimit) ## S3 method for class 'TermDocumentMatrix' findAssocs(x, terms, corlimit)
x |
A |
terms |
a character vector holding terms. |
corlimit |
a numeric vector (of the same length as |
A named list. Each list component is named after a term in terms
and contains a named numeric vector. Each vector holds matching terms from
x
and their rounded correlations satisfying the inclusive lower
correlation limit of corlimit
.
data("crude") tdm <- TermDocumentMatrix(crude) findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1))
data("crude") tdm <- TermDocumentMatrix(crude) findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1))
Find frequent terms in a document-term or term-document matrix.
findFreqTerms(x, lowfreq = 0, highfreq = Inf)
findFreqTerms(x, lowfreq = 0, highfreq = Inf)
x |
|
lowfreq |
A numeric for the lower frequency bound. |
highfreq |
A numeric for the upper frequency bound. |
This method works for all numeric weightings but is probably
most meaningful for the standard term frequency (tf
) weighting
of x
.
A character vector of terms in x
which occur more or equal often
than lowfreq
times and less or equal often than highfreq
times.
data("crude") tdm <- TermDocumentMatrix(crude) findFreqTerms(tdm, 2, 3)
data("crude") tdm <- TermDocumentMatrix(crude) findFreqTerms(tdm, 2, 3)
Find most frequent terms in a document-term or term-document matrix, or a vector of term frequencies.
findMostFreqTerms(x, n = 6L, ...) ## S3 method for class 'DocumentTermMatrix' findMostFreqTerms(x, n = 6L, INDEX = NULL, ...) ## S3 method for class 'TermDocumentMatrix' findMostFreqTerms(x, n = 6L, INDEX = NULL, ...)
findMostFreqTerms(x, n = 6L, ...) ## S3 method for class 'DocumentTermMatrix' findMostFreqTerms(x, n = 6L, INDEX = NULL, ...) ## S3 method for class 'TermDocumentMatrix' findMostFreqTerms(x, n = 6L, INDEX = NULL, ...)
x |
A |
n |
A single integer giving the maximal number of terms. |
INDEX |
an object specifying a grouping of documents for rollup,
or |
... |
arguments to be passed to or from methods. |
Only terms with positive frequencies are included in the results.
For the document-term or term-document matrix methods, a list with the
named frequencies of the up to n
most frequent terms occurring
in each document (group). Otherwise, a single such vector of most
frequent terms.
data("crude") ## Term frequencies: tf <- termFreq(crude[[14L]]) findMostFreqTerms(tf) ## Document-term matrices: dtm <- DocumentTermMatrix(crude) ## Most frequent terms for each document: findMostFreqTerms(dtm) ## Most frequent terms for the first 10 the second 10 documents, ## respectively: findMostFreqTerms(dtm, INDEX = rep(1 : 2, each = 10L))
data("crude") ## Term frequencies: tf <- termFreq(crude[[14L]]) findMostFreqTerms(tf) ## Document-term matrices: dtm <- DocumentTermMatrix(crude) ## Most frequent terms for each document: findMostFreqTerms(dtm) ## Most frequent terms for the first 10 the second 10 documents, ## respectively: findMostFreqTerms(dtm, INDEX = rep(1 : 2, each = 10L))
Read document-term matrices stored in special file formats.
read_dtm_Blei_et_al(file, vocab = NULL) read_dtm_MC(file, scalingtype = NULL)
read_dtm_Blei_et_al(file, vocab = NULL) read_dtm_MC(file, scalingtype = NULL)
file |
a character string with the name of the file to read. |
vocab |
a character string with the name of a vocabulary file
(giving the terms, one per line), or |
scalingtype |
a character string specifying the type of scaling
to be used, or |
read_dtm_Blei_et_al
reads the (List of Lists type sparse
matrix) format employed by the Latent Dirichlet Allocation and
Correlated Topic Model C codes by Blei et al
(http://www.cs.columbia.edu/~blei/).
MC is a toolkit for creating vector models from text documents (see https://www.cs.utexas.edu/~dml/software/mc/). It employs a variant of Compressed Column Storage (CCS) sparse matrix format, writing data into several files with suitable names: e.g., a file with ‘_dim’ appended to the base file name stores the matrix dimensions. The non-zero entries are stored in a file the name of which indicates the scaling type used: e.g., ‘_tfx_nz’ indicates scaling by term frequency (‘t’), inverse document frequency (‘f’) and no normalization (‘x’). See ‘README’ in the MC sources for more information.
read_dtm_MC
reads such sparse matrix information with argument
file
giving the path with the base file name.
read_stm_MC
in package slam.
Predefined tokenizers.
getTokenizers()
getTokenizers()
A character vector with tokenizers provided by package tm.
Boost_tokenizer
, MC_tokenizer
and
scan_tokenizer
.
getTokenizers()
getTokenizers()
Predefined transformations (mappings) which can be used with
tm_map
.
getTransformations()
getTransformations()
A character vector with transformations provided by package tm.
removeNumbers
, removePunctuation
,
removeWords
, stemDocument
, and
stripWhitespace
.
content_transformer
to create custom transformations.
getTransformations()
getTransformations()
Parallelize applying a function over a list or vector according to the registered parallelization engine.
tm_parLapply(X, FUN, ...) tm_parLapply_engine(new)
tm_parLapply(X, FUN, ...) tm_parLapply_engine(new)
X |
A vector (atomic or list), or other objects suitable for the engine in use. |
FUN |
the function to be applied to each element of |
... |
optional arguments to |
new |
an object inheriting from class |
Parallelization can be employed to speed up some of the embarrassingly
parallel computations performed in package tm, specifically
tm_index()
, tm_map()
on a non-lazy-mapped
VCorpus
, and TermDocumentMatrix()
on a
VCorpus
or PCorpus
.
Functions tm_parLapply()
and tm_parLapply_engine()
can
be used to customize parallelization according to the available
resources.
tm_parLapply_engine()
is used for getting (with no arguments)
or setting (with argument new
) the parallelization engine
employed (see below for examples).
If an engine is set to an object inheriting from class cluster
,
tm_parLapply()
calls
parLapply()
with this cluster and
the given arguments. If set to a function, tm_parLapply()
calls the function with the given arguments. Otherwise, it simply
calls lapply()
.
Hence, parallelization via
parLapply()
and a default cluster registered via
setDefaultCluster()
can be
achieved via
tm_parLapply_engine(function(X, FUN, ...) parallel::parLapply(NULL, X, FUN, ...))
or re-registering the cluster, say cl
, using
tm_parLapply_engine(cl)
(note that since R version 3.5.0, one can use
getDefaultCluster()
to get
the registered default cluster). Using
tm_parLapply_engine(function(X, FUN, ...) parallel::parLapplyLB(NULL, X, FUN, ...))
or
tm_parLapply_engine(function(X, FUN, ...) parallel::parLapplyLB(cl, X, FUN, ...))
gives load-balancing parallelization with the registered default or
given cluster, respectively. To achieve parallelization via forking
(on Unix-alike platforms), one can use the above with clusters created
by makeForkCluster()
, or use
tm_parLapply_engine(parallel::mclapply)
or
tm_parLapply_engine(function(X, FUN, ...) parallel::mclapply(X, FUN, ..., mc.cores = n))
to use mclapply()
with the default or
given number n
of cores.
A list the length of X
, with the result of applying FUN
together with the ...
arguments to each element of X
.
makeCluster()
,
parLapply()
,
parLapplyLB()
, and
mclapply()
.
Inspect, i.e., display detailed information on a corpus, a term-document matrix, or a text document.
## S3 method for class 'PCorpus' inspect(x) ## S3 method for class 'VCorpus' inspect(x) ## S3 method for class 'TermDocumentMatrix' inspect(x) ## S3 method for class 'TextDocument' inspect(x)
## S3 method for class 'PCorpus' inspect(x) ## S3 method for class 'VCorpus' inspect(x) ## S3 method for class 'TermDocumentMatrix' inspect(x) ## S3 method for class 'TextDocument' inspect(x)
x |
Either a corpus, a term-document matrix, or a text document. |
data("crude") inspect(crude[1:3]) inspect(crude[[1]]) tdm <- TermDocumentMatrix(crude)[1:10, 1:10] inspect(tdm)
data("crude") inspect(crude[1:3]) inspect(crude[[1]]) tdm <- TermDocumentMatrix(crude)[1:10, 1:10] inspect(tdm)
Accessing and modifying metadata of text documents and corpora.
## S3 method for class 'PCorpus' meta(x, tag = NULL, type = c("indexed", "corpus", "local"), ...) ## S3 replacement method for class 'PCorpus' meta(x, tag, type = c("indexed", "corpus", "local"), ...) <- value ## S3 method for class 'SimpleCorpus' meta(x, tag = NULL, type = c("indexed", "corpus"), ...) ## S3 replacement method for class 'SimpleCorpus' meta(x, tag, type = c("indexed", "corpus"), ...) <- value ## S3 method for class 'VCorpus' meta(x, tag = NULL, type = c("indexed", "corpus", "local"), ...) ## S3 replacement method for class 'VCorpus' meta(x, tag, type = c("indexed", "corpus", "local"), ...) <- value ## S3 method for class 'PlainTextDocument' meta(x, tag = NULL, ...) ## S3 replacement method for class 'PlainTextDocument' meta(x, tag = NULL, ...) <- value ## S3 method for class 'XMLTextDocument' meta(x, tag = NULL, ...) ## S3 replacement method for class 'XMLTextDocument' meta(x, tag = NULL, ...) <- value DublinCore(x, tag = NULL) DublinCore(x, tag) <- value
## S3 method for class 'PCorpus' meta(x, tag = NULL, type = c("indexed", "corpus", "local"), ...) ## S3 replacement method for class 'PCorpus' meta(x, tag, type = c("indexed", "corpus", "local"), ...) <- value ## S3 method for class 'SimpleCorpus' meta(x, tag = NULL, type = c("indexed", "corpus"), ...) ## S3 replacement method for class 'SimpleCorpus' meta(x, tag, type = c("indexed", "corpus"), ...) <- value ## S3 method for class 'VCorpus' meta(x, tag = NULL, type = c("indexed", "corpus", "local"), ...) ## S3 replacement method for class 'VCorpus' meta(x, tag, type = c("indexed", "corpus", "local"), ...) <- value ## S3 method for class 'PlainTextDocument' meta(x, tag = NULL, ...) ## S3 replacement method for class 'PlainTextDocument' meta(x, tag = NULL, ...) <- value ## S3 method for class 'XMLTextDocument' meta(x, tag = NULL, ...) ## S3 replacement method for class 'XMLTextDocument' meta(x, tag = NULL, ...) <- value DublinCore(x, tag = NULL) DublinCore(x, tag) <- value
x |
For |
tag |
a character giving the name of a metadatum. No tag corresponds to all available metadata. |
type |
a character specifying the kind of corpus metadata (see Details). |
... |
Not used. |
value |
replacement value. |
A corpus has two types of metadata. Corpus metadata ("corpus"
)
contains corpus specific metadata in form of tag-value pairs.
Document level metadata ("indexed"
) contains document specific
metadata but is stored in the corpus as a data frame. Document level metadata
is typically used for semantic reasons (e.g., classifications of documents
form an own entity due to some high-level information like the range of
possible values) or for performance reasons (single access instead of
extracting metadata of each document). The latter can be seen as a from of
indexing, hence the name "indexed"
. Document metadata
("local"
) are tag-value pairs directly stored locally at the individual
documents.
DublinCore
is a convenience wrapper to access and modify the metadata
of a text document using the Simple Dublin Core schema (supporting the 15
metadata elements from the Dublin Core Metadata Element Set
https://dublincore.org/documents/dces/).
Dublin Core Metadata Initiative. https://dublincore.org/
meta
for metadata in package NLP.
data("crude") meta(crude[[1]]) DublinCore(crude[[1]]) meta(crude[[1]], tag = "topics") meta(crude[[1]], tag = "comment") <- "A short comment." meta(crude[[1]], tag = "topics") <- NULL DublinCore(crude[[1]], tag = "creator") <- "Ano Nymous" DublinCore(crude[[1]], tag = "format") <- "XML" DublinCore(crude[[1]]) meta(crude[[1]]) meta(crude) meta(crude, type = "corpus") meta(crude, "labels") <- 21:40 meta(crude)
data("crude") meta(crude[[1]]) DublinCore(crude[[1]]) meta(crude[[1]], tag = "topics") meta(crude[[1]], tag = "comment") <- "A short comment." meta(crude[[1]], tag = "topics") <- NULL DublinCore(crude[[1]], tag = "creator") <- "Ano Nymous" DublinCore(crude[[1]], tag = "format") <- "XML" DublinCore(crude[[1]]) meta(crude[[1]]) meta(crude) meta(crude, type = "corpus") meta(crude, "labels") <- 21:40 meta(crude)
Create permanent corpora.
PCorpus(x, readerControl = list(reader = reader(x), language = "en"), dbControl = list(dbName = "", dbType = "DB1"))
PCorpus(x, readerControl = list(reader = reader(x), language = "en"), dbControl = list(dbName = "", dbType = "DB1"))
x |
A |
readerControl |
a named list of control parameters for reading in content
from
|
dbControl |
a named list of control parameters for the underlying database storage provided by package filehash.
|
A permanent corpus stores documents outside of R in a database. Since
multiple PCorpus
R objects with the same underlying database can
exist simultaneously in memory, changes in one get propagated to all
corresponding objects (in contrast to the default R semantics).
An object inheriting from PCorpus
and Corpus
.
Corpus
for basic information on the corpus infrastructure
employed by package tm.
VCorpus
provides an implementation with volatile storage
semantics.
txt <- system.file("texts", "txt", package = "tm") ## Not run: PCorpus(DirSource(txt), dbControl = list(dbName = "pcorpus.db", dbType = "DB1")) ## End(Not run)
txt <- system.file("texts", "txt", package = "tm") ## Not run: PCorpus(DirSource(txt), dbControl = list(dbName = "pcorpus.db", dbType = "DB1")) ## End(Not run)
Create plain text documents.
PlainTextDocument(x = character(0), author = character(0), datetimestamp = as.POSIXlt(Sys.time(), tz = "GMT"), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0), ..., meta = NULL, class = NULL)
PlainTextDocument(x = character(0), author = character(0), datetimestamp = as.POSIXlt(Sys.time(), tz = "GMT"), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0), ..., meta = NULL, class = NULL)
x |
A character string giving the plain text content. |
author |
a character string or an object of class |
datetimestamp |
an object of class |
description |
a character string giving a description. |
heading |
a character string giving the title or a short heading. |
id |
a character string giving a unique identifier. |
language |
a character string giving the language (preferably as IETF language tags, see language in package NLP). |
origin |
a character string giving information on the source and origin. |
... |
user-defined document metadata tag-value pairs. |
meta |
a named list or |
class |
a character vector or |
An object inheriting from class
, PlainTextDocument
and
TextDocument
.
TextDocument
for basic information on the text document
infrastructure employed by package tm.
(ptd <- PlainTextDocument("A simple plain text document", heading = "Plain text document", id = basename(tempfile()), language = "en")) meta(ptd)
(ptd <- PlainTextDocument("A simple plain text document", heading = "Plain text document", id = basename(tempfile()), language = "en")) meta(ptd)
Visualize correlations between terms of a term-document matrix.
## S3 method for class 'TermDocumentMatrix' plot(x, terms = sample(Terms(x), 20), corThreshold = 0.7, weighting = FALSE, attrs = list(graph = list(rankdir = "BT"), node = list(shape = "rectangle", fixedsize = FALSE)), ...)
## S3 method for class 'TermDocumentMatrix' plot(x, terms = sample(Terms(x), 20), corThreshold = 0.7, weighting = FALSE, attrs = list(graph = list(rankdir = "BT"), node = list(shape = "rectangle", fixedsize = FALSE)), ...)
x |
A term-document matrix. |
terms |
Terms to be plotted. Defaults to 20 randomly chosen terms of the term-document matrix. |
corThreshold |
Do not plot correlations below this
threshold. Defaults to |
weighting |
Define whether the line width corresponds to the correlation. |
attrs |
Argument passed to the plot method for class
|
... |
Other arguments passed to the
|
Visualization requires that package Rgraphviz is available.
## Not run: data(crude) tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE, removeNumbers = TRUE, stopwords = TRUE)) plot(tdm, corThreshold = 0.2, weighting = TRUE) ## End(Not run)
## Not run: data(crude) tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE, removeNumbers = TRUE, stopwords = TRUE)) plot(tdm, corThreshold = 0.2, weighting = TRUE) ## End(Not run)
Read in a text document from a row in a data frame.
readDataframe(elem, language, id)
readDataframe(elem, language, id)
elem |
a named list with the component |
language |
a string giving the language. |
id |
Not used. |
A PlainTextDocument
representing elem$content
.
Reader
for basic information on the reader infrastructure
employed by package tm.
docs <- data.frame(doc_id = c("doc_1", "doc_2"), text = c("This is a text.", "This another one."), stringsAsFactors = FALSE) ds <- DataframeSource(docs) elem <- getElem(stepNext(ds)) result <- readDataframe(elem, "en", NULL) inspect(result) meta(result)
docs <- data.frame(doc_id = c("doc_1", "doc_2"), text = c("This is a text.", "This another one."), stringsAsFactors = FALSE) ds <- DataframeSource(docs) elem <- getElem(stepNext(ds)) result <- readDataframe(elem, "en", NULL) inspect(result) meta(result)
Return a function which reads in a Microsoft Word document extracting its text.
readDOC(engine = c("antiword", "executable"), AntiwordOptions = "")
readDOC(engine = c("antiword", "executable"), AntiwordOptions = "")
engine |
a character string for the preferred DOC extraction engine (see Details). |
AntiwordOptions |
Options passed over to |
Formally this function is a function generator, i.e., it returns a
function (which reads in a text document) with a well-defined
signature, but can access passed over arguments (e.g., options to
antiword
) via lexical scoping.
Available DOC extraction engines are as follows.
"antiword"
(default) Antiword utility as provided by the
function antiword
in package antiword.
"executable"
command line antiword
executable which must be installed and accessible on your system.
This can convert documents from Microsoft Word version 2, 6, 7,
97, 2000, 2002 and 2003 to plain text.
The character vector AntiwordOptions
is passed over to the
executable.
A function
with the following formals:
elem
a list with the named component uri
which must
hold a valid file name.
language
a string giving the language.
id
Not used.
The function returns a PlainTextDocument
representing the text
and metadata extracted from elem$uri
.
Reader
for basic information on the reader infrastructure
employed by package tm.
Creating readers.
getReaders()
getReaders()
Readers are functions for extracting textual content and metadata out
of elements delivered by a Source
, and for constructing a
TextDocument
. A reader must accept following arguments in
its signature:
elem
a named list with the components content
and
uri
(as delivered by a Source
via
getElem
or pGetElem
).
language
a character string giving the language.
id
a character giving a unique identifier for the created text document.
The element elem
is typically provided by a source whereas the language
and the identifier are normally provided by a corpus constructor (for the case
that elem$content
does not give information on these two essential
items).
In case a reader expects configuration arguments we can use a function
generator. A function generator is indicated by inheriting from class
FunctionGenerator
and function
. It allows us to process
additional arguments, store them in an environment, return a reader function
with the well-defined signature described above, and still be able to access
the additional arguments via lexical scoping. All corpus constructors in
package tm check the reader function for being a function generator and
if so apply it to yield the reader with the expected signature.
For getReaders()
, a character vector with readers provided by package
tm.
readDOC
, readPDF
, readPlain
,
readRCV1
, readRCV1asPlain
,
readReut21578XML
, readReut21578XMLasPlain
,
and readXML
.
Return a function which reads in a portable document format (PDF) document extracting both its text and its metadata.
readPDF(engine = c("pdftools", "xpdf", "Rpoppler", "ghostscript", "Rcampdf", "custom"), control = list(info = NULL, text = NULL))
readPDF(engine = c("pdftools", "xpdf", "Rpoppler", "ghostscript", "Rcampdf", "custom"), control = list(info = NULL, text = NULL))
engine |
a character string for the preferred PDF extraction engine (see Details). |
control |
a list of control options for the engine with the named
components |
Formally this function is a function generator, i.e., it returns a function
(which reads in a text document) with a well-defined signature, but can access
passed over arguments (e.g., the preferred PDF extraction
engine
and control
options) via lexical scoping.
Available PDF extraction engines are as follows.
"pdftools"
(default) Poppler PDF rendering library
as provided by the functions pdf_info
and
pdf_text
in package pdftools.
"xpdf"
command line pdfinfo
and
pdftotext
executables which must be installed and accessible on
your system. Suitable utilities are provided by the Xpdf
(http://www.xpdfreader.com/) PDF viewer or by the
Poppler (https://poppler.freedesktop.org/) PDF rendering
library.
"Rpoppler"
Poppler PDF rendering library as
provided by the functions PDF_info
and
PDF_text
in package Rpoppler.
"ghostscript"
Ghostscript using ‘pdf_info.ps’ and ‘ps2ascii.ps’.
"Rcampdf"
Perl CAM::PDF PDF manipulation library
as provided by the functions pdf_info
and pdf_text
in package Rcampdf, available from the repository at
http://datacube.wu.ac.at.
"custom"
custom user-provided extraction engine.
Control parameters for engine "xpdf"
are as follows.
info
a character vector specifying options passed over to
the pdfinfo
executable.
text
a character vector specifying options passed over to
the pdftotext
executable.
Control parameters for engine "custom"
are as follows.
info
a function extracting metadata from a PDF.
The function must accept a file path as first argument and must return a
named list with the components Author
(as character string),
CreationDate
(of class POSIXlt
), Subject
(as
character string), Title
(as character string), and Creator
(as character string).
text
a function extracting content from a PDF. The function must accept a file path as first argument and must return a character vector.
A function
with the following formals:
elem
a named list with the component uri
which must
hold a valid file name.
language
a string giving the language.
id
Not used.
The function returns a PlainTextDocument
representing the text
and metadata extracted from elem$uri
.
Reader
for basic information on the reader infrastructure
employed by package tm.
uri <- paste0("file://", system.file(file.path("doc", "tm.pdf"), package = "tm")) engine <- if(nzchar(system.file(package = "pdftools"))) { "pdftools" } else { "ghostscript" } reader <- readPDF(engine) pdf <- reader(elem = list(uri = uri), language = "en", id = "id1") cat(content(pdf)[1]) VCorpus(URISource(uri, mode = ""), readerControl = list(reader = readPDF(engine = "ghostscript")))
uri <- paste0("file://", system.file(file.path("doc", "tm.pdf"), package = "tm")) engine <- if(nzchar(system.file(package = "pdftools"))) { "pdftools" } else { "ghostscript" } reader <- readPDF(engine) pdf <- reader(elem = list(uri = uri), language = "en", id = "id1") cat(content(pdf)[1]) VCorpus(URISource(uri, mode = ""), readerControl = list(reader = readPDF(engine = "ghostscript")))
Read in a text document without knowledge about its internal structure and possible available metadata.
readPlain(elem, language, id)
readPlain(elem, language, id)
elem |
a named list with the component |
language |
a string giving the language. |
id |
a character giving a unique identifier for the created text document. |
A PlainTextDocument
representing elem$content
. The
argument id
is used as fallback if elem$uri
is null.
Reader
for basic information on the reader infrastructure
employed by package tm.
docs <- c("This is a text.", "This another one.") vs <- VectorSource(docs) elem <- getElem(stepNext(vs)) (result <- readPlain(elem, "en", "id1")) meta(result)
docs <- c("This is a text.", "This another one.") vs <- VectorSource(docs) elem <- getElem(stepNext(vs)) (result <- readPlain(elem, "en", "id1")) meta(result)
Read in a Reuters Corpus Volume 1 XML document.
readRCV1(elem, language, id) readRCV1asPlain(elem, language, id)
readRCV1(elem, language, id) readRCV1asPlain(elem, language, id)
elem |
a named list with the component |
language |
a string giving the language. |
id |
Not used. |
An XMLTextDocument
for readRCV1
, or a
PlainTextDocument
for readRCV1asPlain
, representing the
text and metadata extracted from elem$content
.
Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F (2004). RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5, 361–397. https://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf
Reader
for basic information on the reader infrastructure
employed by package tm.
f <- system.file("texts", "rcv1_2330.xml", package = "tm") f_bin <- readBin(f, raw(), file.size(f)) rcv1 <- readRCV1(elem = list(content = f_bin), language = "en", id = "id1") content(rcv1) meta(rcv1)
f <- system.file("texts", "rcv1_2330.xml", package = "tm") f_bin <- readBin(f, raw(), file.size(f)) rcv1 <- readRCV1(elem = list(content = f_bin), language = "en", id = "id1") content(rcv1) meta(rcv1)
Read in a Reuters-21578 XML document.
readReut21578XML(elem, language, id) readReut21578XMLasPlain(elem, language, id)
readReut21578XML(elem, language, id) readReut21578XMLasPlain(elem, language, id)
elem |
a named list with the component |
language |
a string giving the language. |
id |
Not used. |
An XMLTextDocument
for readReut21578XML
, or a
PlainTextDocument
for readReut21578XMLasPlain
,
representing the text and metadata extracted from elem$content
.
Lewis, David (1997). Reuters-21578 Text Categorization Collection Distribution. UCI Machine Learning Repository. doi:10.24432/C52G6M.
Reader
for basic information on the reader infrastructure
employed by package tm.
Return a function which reads in a text document containing POS-tagged words.
readTagged(...)
readTagged(...)
... |
Arguments passed to |
Formally this function is a function generator, i.e., it returns a
function (which reads in a text document) with a well-defined
signature, but can access passed over arguments (...
)
via lexical scoping.
A function
with the following formals:
elem
a named list with the component content
which must
hold the document to be read in or the component uri
holding a
connection object or a character string.
language
a string giving the language.
id
a character giving a unique identifier for the created text document.
The function returns a TaggedTextDocument
representing the
text and metadata extracted from elem$content
or elem$uri
. The
argument id
is used as fallback if elem$uri
is null.
Reader
for basic information on the reader infrastructure
employed by package tm.
# See http://www.nltk.org/book/ch05.html or file ca01 in the Brown corpus x <- paste("The/at grand/jj jury/nn commented/vbd on/in a/at number/nn of/in", "other/ap topics/nns ,/, among/in them/ppo the/at Atlanta/np and/cc", "Fulton/np-tl County/nn-tl purchasing/vbg departments/nns which/wdt", "it/pps said/vbd ``/`` are/ber well/ql operated/vbn and/cc follow/vb", "generally/rb accepted/vbn practices/nns which/wdt inure/vb to/in the/at", "best/jjt interest/nn of/in both/abx governments/nns ''/'' ./.") vs <- VectorSource(x) elem <- getElem(stepNext(vs)) (doc <- readTagged()(elem, language = "en", id = "id1")) tagged_words(doc)
# See http://www.nltk.org/book/ch05.html or file ca01 in the Brown corpus x <- paste("The/at grand/jj jury/nn commented/vbd on/in a/at number/nn of/in", "other/ap topics/nns ,/, among/in them/ppo the/at Atlanta/np and/cc", "Fulton/np-tl County/nn-tl purchasing/vbg departments/nns which/wdt", "it/pps said/vbd ``/`` are/ber well/ql operated/vbn and/cc follow/vb", "generally/rb accepted/vbn practices/nns which/wdt inure/vb to/in the/at", "best/jjt interest/nn of/in both/abx governments/nns ''/'' ./.") vs <- VectorSource(x) elem <- getElem(stepNext(vs)) (doc <- readTagged()(elem, language = "en", id = "id1")) tagged_words(doc)
Return a function which reads in an XML document. The structure of the XML document is described with a specification.
readXML(spec, doc)
readXML(spec, doc)
spec |
A named list of lists each containing two components. The
constructed reader will map each list entry to the content or metadatum of
the text document as specified by the named list entry. Valid names include
Each list entry must consist of two components: the first must be a string describing the type of the second argument, and the second is the specification entry. Valid combinations are:
|
doc |
An (empty) document of some subclass of |
Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., the specification) via lexical scoping.
A function with the following formals:
elem
a named list with the component content
which
must hold the document to be read in.
language
a string giving the language.
id
a character giving a unique identifier for the created text document.
The function returns doc
augmented by the parsed information
as described by spec
out of the XML file in
elem$content
. The arguments language
and id
are used as
fallback: language
if no corresponding metadata entry is found in
elem$content
, and id
if no corresponding metadata entry is found
in elem$content
and if elem$uri
is null.
Reader
for basic information on the reader infrastructure
employed by package tm.
Vignette 'Extensions: How to Handle Custom File Formats', and
XMLSource
.
Remove numbers from a text document.
## S3 method for class 'character' removeNumbers(x, ucp = FALSE, ...) ## S3 method for class 'PlainTextDocument' removeNumbers(x, ...)
## S3 method for class 'character' removeNumbers(x, ucp = FALSE, ...) ## S3 method for class 'PlainTextDocument' removeNumbers(x, ...)
x |
a character vector or text document. |
ucp |
a logical specifying whether to use Unicode character
properties for determining digit characters. If |
... |
arguments to be passed to or from methods;
in particular, from the |
The text document without numbers.
getTransformations
to list available transformation
(mapping) functions.
https://unicode.org/reports/tr44/#General_Category_Values.
data("crude") crude[[1]] removeNumbers(crude[[1]])
data("crude") crude[[1]] removeNumbers(crude[[1]])
Remove punctuation marks from a text document.
## S3 method for class 'character' removePunctuation(x, preserve_intra_word_contractions = FALSE, preserve_intra_word_dashes = FALSE, ucp = FALSE, ...) ## S3 method for class 'PlainTextDocument' removePunctuation(x, ...)
## S3 method for class 'character' removePunctuation(x, preserve_intra_word_contractions = FALSE, preserve_intra_word_dashes = FALSE, ucp = FALSE, ...) ## S3 method for class 'PlainTextDocument' removePunctuation(x, ...)
x |
a character vector or text document. |
preserve_intra_word_contractions |
a logical specifying whether intra-word contractions should be kept. |
preserve_intra_word_dashes |
a logical specifying whether intra-word dashes should be kept. |
ucp |
a logical specifying whether to use Unicode character
properties for determining punctuation characters. If |
... |
arguments to be passed to or from methods;
in particular, from the |
The character or text document x
without punctuation marks
(besides intra-word contractions (‘'’) and intra-word dashes
(‘-’) if preserve_intra_word_contractions
and
preserve_intra_word_dashes
are set, respectively).
getTransformations
to list available transformation
(mapping) functions.
regex
shows the class [:punct:]
of punctuation
characters.
https://unicode.org/reports/tr44/#General_Category_Values.
data("crude") inspect(crude[[14]]) inspect(removePunctuation(crude[[14]])) inspect(removePunctuation(crude[[14]], preserve_intra_word_contractions = TRUE, preserve_intra_word_dashes = TRUE))
data("crude") inspect(crude[[14]]) inspect(removePunctuation(crude[[14]])) inspect(removePunctuation(crude[[14]], preserve_intra_word_contractions = TRUE, preserve_intra_word_dashes = TRUE))
Remove sparse terms from a document-term or term-document matrix.
removeSparseTerms(x, sparse)
removeSparseTerms(x, sparse)
x |
A |
sparse |
A numeric for the maximal allowed sparsity in the range from bigger zero to smaller one. |
A term-document matrix where those terms from x
are
removed which have at least a sparse
percentage of empty (i.e.,
terms occurring 0 times in a document) elements. I.e., the resulting
matrix contains only terms with a sparse factor of less than
sparse
.
data("crude") tdm <- TermDocumentMatrix(crude) removeSparseTerms(tdm, 0.2)
data("crude") tdm <- TermDocumentMatrix(crude) removeSparseTerms(tdm, 0.2)
Remove words from a text document.
## S3 method for class 'character' removeWords(x, words) ## S3 method for class 'PlainTextDocument' removeWords(x, ...)
## S3 method for class 'character' removeWords(x, words) ## S3 method for class 'PlainTextDocument' removeWords(x, ...)
x |
A character or text document. |
words |
A character vector giving the words to be removed. |
... |
passed over argument |
The character or text document without the specified words.
getTransformations
to list available transformation (mapping)
functions.
remove_stopwords
provided by package tau.
data("crude") crude[[1]] removeWords(crude[[1]], stopwords("english"))
data("crude") crude[[1]] removeWords(crude[[1]], stopwords("english"))
Create simple corpora.
SimpleCorpus(x, control = list(language = "en"))
SimpleCorpus(x, control = list(language = "en"))
x |
a |
control |
a named list of control parameters.
|
A simple corpus is fully kept in memory. Compared to a VCorpus
,
it is optimized for the most common usage scenario: importing plain texts from
files in a directory or directly from a vector in R, preprocessing and
transforming the texts, and finally exporting them to a term-document matrix.
It adheres to the Corpus
API. However, it takes
internally various shortcuts to boost performance and minimize memory
pressure; consequently it operates only under the following contraints:
only DataframeSource
, DirSource
and VectorSource
are supported,
no custom readers, i.e., each document is read in and stored as plain text (as a string, i.e., a character vector of length one),
transformations applied via tm_map
must be able to
process character vectors and return character vectors (of the same
length),
no lazy transformations in tm_map
,
no meta data for individual documents (i.e., no "local"
in
meta
).
An object inheriting from SimpleCorpus
and Corpus
.
Corpus
for basic information on the corpus infrastructure
employed by package tm.
VCorpus
provides an implementation with volatile storage
semantics, and PCorpus
provides an implementation with
permanent storage semantics.
txt <- system.file("texts", "txt", package = "tm") (ovid <- SimpleCorpus(DirSource(txt, encoding = "UTF-8"), control = list(language = "lat")))
txt <- system.file("texts", "txt", package = "tm") (ovid <- SimpleCorpus(DirSource(txt, encoding = "UTF-8"), control = list(language = "lat")))
Creating and accessing sources.
SimpleSource(encoding = "", length = 0, position = 0, reader = readPlain, ..., class) getSources() ## S3 method for class 'SimpleSource' close(con, ...) ## S3 method for class 'SimpleSource' eoi(x) ## S3 method for class 'DataframeSource' getMeta(x) ## S3 method for class 'DataframeSource' getElem(x) ## S3 method for class 'DirSource' getElem(x) ## S3 method for class 'URISource' getElem(x) ## S3 method for class 'VectorSource' getElem(x) ## S3 method for class 'XMLSource' getElem(x) ## S3 method for class 'SimpleSource' length(x) ## S3 method for class 'SimpleSource' open(con, ...) ## S3 method for class 'DataframeSource' pGetElem(x) ## S3 method for class 'DirSource' pGetElem(x) ## S3 method for class 'URISource' pGetElem(x) ## S3 method for class 'VectorSource' pGetElem(x) ## S3 method for class 'SimpleSource' reader(x) ## S3 method for class 'SimpleSource' stepNext(x)
SimpleSource(encoding = "", length = 0, position = 0, reader = readPlain, ..., class) getSources() ## S3 method for class 'SimpleSource' close(con, ...) ## S3 method for class 'SimpleSource' eoi(x) ## S3 method for class 'DataframeSource' getMeta(x) ## S3 method for class 'DataframeSource' getElem(x) ## S3 method for class 'DirSource' getElem(x) ## S3 method for class 'URISource' getElem(x) ## S3 method for class 'VectorSource' getElem(x) ## S3 method for class 'XMLSource' getElem(x) ## S3 method for class 'SimpleSource' length(x) ## S3 method for class 'SimpleSource' open(con, ...) ## S3 method for class 'DataframeSource' pGetElem(x) ## S3 method for class 'DirSource' pGetElem(x) ## S3 method for class 'URISource' pGetElem(x) ## S3 method for class 'VectorSource' pGetElem(x) ## S3 method for class 'SimpleSource' reader(x) ## S3 method for class 'SimpleSource' stepNext(x)
x |
A |
con |
A |
encoding |
a character giving the encoding of the elements delivered by the source. |
length |
a non-negative integer denoting the number of elements delivered
by the source. If the length is unknown in advance set it to |
position |
a numeric indicating the current position in the source. |
reader |
a reader function (generator). |
... |
For |
class |
a character vector giving additional classes to be used for the created source. |
Sources abstract input locations, like a directory, a connection, or
simply an R vector, in order to acquire content in a uniform way. In packages
which employ the infrastructure provided by package tm, such sources are
represented via the virtual S3 class Source
: such packages then provide
S3 source classes extending the virtual base class (such as
DirSource
provided by package tm itself).
All extension classes must provide implementations for the functions
close
, eoi
, getElem
, length
, open
,
reader
, and stepNext
. For parallel element access the
(optional) function pGetElem
must be provided as well. If
document level metadata is available, the (optional) function getMeta
must be implemented.
The functions open
and close
open and close the source,
respectively. eoi
indicates end of input. getElem
fetches the
element at the current position, whereas pGetElem
retrieves all
elements in parallel at once. The function length
gives the number of
elements. reader
returns a default reader for processing elements.
stepNext
increases the position in the source to acquire the next
element.
The function SimpleSource
provides a simple reference implementation
and can be used when creating custom sources.
For SimpleSource
, an object inheriting from class
,
SimpleSource
, and Source
.
For getSources
, a character vector with sources provided by package
tm.
open
and close
return the opened and closed source,
respectively.
For eoi
, a logical indicating if the end of input of the source is
reached.
For getElem
a named list with the components content
holding the
document and uri
giving a uniform resource identifier (e.g., a file
path or URL; NULL
if not applicable or unavailable). For
pGetElem
a list of such named lists.
For length
, an integer for the number of elements.
For reader
, a function for the default reader.
DataframeSource
, DirSource
,
URISource
, VectorSource
, and
XMLSource
.
Heuristically complete stemmed words.
stemCompletion(x, dictionary, type = c("prevalent", "first", "longest", "none", "random", "shortest"))
stemCompletion(x, dictionary, type = c("prevalent", "first", "longest", "none", "random", "shortest"))
x |
A character vector of stems to be completed. |
dictionary |
A |
type |
A
|
A character vector with completed words.
Ingo Feinerer (2010). Analysis and Algorithms for Stemming Inversion. Information Retrieval Technology — 6th Asia Information Retrieval Societies Conference, AIRS 2010, Taipei, Taiwan, December 1–3, 2010. Proceedings, volume 6458 of Lecture Notes in Computer Science, pages 290–299. Springer-Verlag, December 2010.
data("crude") stemCompletion(c("compan", "entit", "suppl"), crude)
data("crude") stemCompletion(c("compan", "entit", "suppl"), crude)
Stem words in a text document using Porter's stemming algorithm.
## S3 method for class 'character' stemDocument(x, language = "english") ## S3 method for class 'PlainTextDocument' stemDocument(x, language = meta(x, "language"))
## S3 method for class 'character' stemDocument(x, language = "english") ## S3 method for class 'PlainTextDocument' stemDocument(x, language = meta(x, "language"))
x |
A character vector or text document. |
language |
A string giving the language for stemming. |
The argument language
is passed over to
wordStem
as the name of the Snowball stemmer.
data("crude") inspect(crude[[1]]) if(requireNamespace("SnowballC")) { inspect(stemDocument(crude[[1]])) }
data("crude") inspect(crude[[1]]) if(requireNamespace("SnowballC")) { inspect(stemDocument(crude[[1]])) }
Return various kinds of stopwords with support for different languages.
stopwords(kind = "en")
stopwords(kind = "en")
kind |
A character string identifying the desired stopword list. |
Available stopword lists are:
catalan
Catalan stopwords (obtained from http://latel.upf.edu/morgana/altres/pub/ca_stop.htm),
romanian
Romanian stopwords (extracted from http://snowball.tartarus.org/otherapps/romanian/romanian1.tgz),
SMART
English stopwords from the SMART information retrieval system (as documented in Appendix 11 of https://jmlr.csail.mit.edu/papers/volume5/lewis04a/) (which coincides with the stopword list used by the MC toolkit (https://www.cs.utexas.edu/~dml/software/mc/)),
and a set of stopword lists from the Snowball stemmer project in different
languages (obtained from
‘http://svn.tartarus.org/snowball/trunk/website/algorithms/*/stop.txt’).
Supported languages are danish
, dutch
, english
,
finnish
, french
, german
, hungarian
, italian
,
norwegian
, portuguese
, russian
, spanish
, and
swedish
. Language names are case sensitive. Alternatively, their
IETF language tags may be used.
A character vector containing the requested stopwords. An error
is raised if no stopwords are available for the requested
kind
.
stopwords("en") stopwords("SMART") stopwords("german")
stopwords("en") stopwords("SMART") stopwords("german")
Strip extra whitespace from a text document. Multiple whitespace characters are collapsed to a single blank.
## S3 method for class 'PlainTextDocument' stripWhitespace(x, ...)
## S3 method for class 'PlainTextDocument' stripWhitespace(x, ...)
x |
A text document. |
... |
Not used. |
The text document with multiple whitespace characters collapsed to a single blank.
getTransformations
to list available transformation (mapping)
functions.
data("crude") crude[[1]] stripWhitespace(crude[[1]])
data("crude") crude[[1]] stripWhitespace(crude[[1]])
Constructs or coerces to a term-document matrix or a document-term matrix.
TermDocumentMatrix(x, control = list()) DocumentTermMatrix(x, control = list()) as.TermDocumentMatrix(x, ...) as.DocumentTermMatrix(x, ...)
TermDocumentMatrix(x, control = list()) DocumentTermMatrix(x, control = list()) as.TermDocumentMatrix(x, ...) as.DocumentTermMatrix(x, ...)
x |
for the constructors, a corpus or an R object from which a
corpus can be generated via |
control |
a named list of control options. There are local
options which are evaluated for each document and global options
which are evaluated once for the constructed matrix. Available local
options are documented in This is different for a Available global options are:
|
... |
the additional argument |
An object of class TermDocumentMatrix
or class
DocumentTermMatrix
(both inheriting from a
simple triplet matrix in package slam)
containing a sparse term-document matrix or document-term matrix. The
attribute weighting
contains the weighting applied to the
matrix.
termFreq
for available local control options.
data("crude") tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE, stopwords = TRUE)) dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = TRUE)) inspect(tdm[202:205, 1:5]) inspect(tdm[c("price", "prices", "texas"), c("127", "144", "191", "194")]) inspect(dtm[1:5, 273:276]) if(requireNamespace("SnowballC")) { s <- SimpleCorpus(VectorSource(unlist(lapply(crude, as.character)))) m <- TermDocumentMatrix(s, control = list(removeNumbers = TRUE, stopwords = TRUE, stemming = TRUE)) inspect(m[c("price", "texa"), c("127", "144", "191", "194")]) }
data("crude") tdm <- TermDocumentMatrix(crude, control = list(removePunctuation = TRUE, stopwords = TRUE)) dtm <- DocumentTermMatrix(crude, control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE), stopwords = TRUE)) inspect(tdm[202:205, 1:5]) inspect(tdm[c("price", "prices", "texas"), c("127", "144", "191", "194")]) inspect(dtm[1:5, 273:276]) if(requireNamespace("SnowballC")) { s <- SimpleCorpus(VectorSource(unlist(lapply(crude, as.character)))) m <- TermDocumentMatrix(s, control = list(removeNumbers = TRUE, stopwords = TRUE, stemming = TRUE)) inspect(m[c("price", "texa"), c("127", "144", "191", "194")]) }
Generate a term frequency vector from a text document.
termFreq(doc, control = list())
termFreq(doc, control = list())
doc |
An object inheriting from |
control |
A list of control options which override default settings. First, following two options are processed.
Next, a set of options which are sensitive to the order of
occurrence in the
Finally, following options are processed in the given order.
|
A table of class c("term_frequency", "integer")
with term frequencies
as values and tokens as names.
data("crude") termFreq(crude[[14]]) if(requireNamespace("SnowballC")) { strsplit_space_tokenizer <- function(x) unlist(strsplit(as.character(x), "[[:space:]]+")) ctrl <- list(tokenize = strsplit_space_tokenizer, removePunctuation = list(preserve_intra_word_dashes = TRUE), stopwords = c("reuter", "that"), stemming = TRUE, wordLengths = c(4, Inf)) termFreq(crude[[14]], control = ctrl) }
data("crude") termFreq(crude[[14]]) if(requireNamespace("SnowballC")) { strsplit_space_tokenizer <- function(x) unlist(strsplit(as.character(x), "[[:space:]]+")) ctrl <- list(tokenize = strsplit_space_tokenizer, removePunctuation = list(preserve_intra_word_dashes = TRUE), stopwords = c("reuter", "that"), stemming = TRUE, wordLengths = c(4, Inf)) termFreq(crude[[14]], control = ctrl) }
Representing and computing on text documents.
Text documents are documents containing (natural language) text. The
tm package employs the infrastructure provided by package NLP and
represents text documents via the virtual S3 class TextDocument
.
Actual S3 text document classes then extend the virtual base class (such as
PlainTextDocument
).
All extension classes must provide an as.character
method which extracts the natural language text in documents of the
respective classes in a “suitable” (not necessarily structured)
form, as well as content
and meta
methods
for accessing the (possibly raw) document content and metadata.
PlainTextDocument
, and
XMLTextDocument
for the text document classes provided by package tm.
TextDocument
for text documents in package NLP.
Combine several corpora into a single one, combine multiple documents into a corpus, combine multiple term-document matrices into a single one, or combine multiple term frequency vectors into a single term-document matrix.
## S3 method for class 'VCorpus' c(..., recursive = FALSE) ## S3 method for class 'TextDocument' c(..., recursive = FALSE) ## S3 method for class 'TermDocumentMatrix' c(..., recursive = FALSE) ## S3 method for class 'term_frequency' c(..., recursive = FALSE)
## S3 method for class 'VCorpus' c(..., recursive = FALSE) ## S3 method for class 'TextDocument' c(..., recursive = FALSE) ## S3 method for class 'TermDocumentMatrix' c(..., recursive = FALSE) ## S3 method for class 'term_frequency' c(..., recursive = FALSE)
... |
Corpora, text documents, term-document matrices, or term frequency vectors. |
recursive |
Not used. |
VCorpus
, TextDocument
,
TermDocumentMatrix
, and termFreq
.
data("acq") data("crude") meta(acq, "comment", type = "corpus") <- "Acquisitions" meta(crude, "comment", type = "corpus") <- "Crude oil" meta(acq, "acqLabels") <- 1:50 meta(acq, "jointLabels") <- 1:50 meta(crude, "crudeLabels") <- letters[1:20] meta(crude, "jointLabels") <- 1:20 c(acq, crude) meta(c(acq, crude), type = "corpus") meta(c(acq, crude)) c(acq[[30]], crude[[10]]) c(TermDocumentMatrix(acq), TermDocumentMatrix(crude))
data("acq") data("crude") meta(acq, "comment", type = "corpus") <- "Acquisitions" meta(crude, "comment", type = "corpus") <- "Crude oil" meta(acq, "acqLabels") <- 1:50 meta(acq, "jointLabels") <- 1:50 meta(crude, "crudeLabels") <- letters[1:20] meta(crude, "jointLabels") <- 1:20 c(acq, crude) meta(c(acq, crude), type = "corpus") meta(c(acq, crude)) c(acq[[30]], crude[[10]]) c(TermDocumentMatrix(acq), TermDocumentMatrix(crude))
Interface to apply filter and index functions to corpora.
## S3 method for class 'PCorpus' tm_filter(x, FUN, ...) ## S3 method for class 'SimpleCorpus' tm_filter(x, FUN, ...) ## S3 method for class 'VCorpus' tm_filter(x, FUN, ...) ## S3 method for class 'PCorpus' tm_index(x, FUN, ...) ## S3 method for class 'SimpleCorpus' tm_index(x, FUN, ...) ## S3 method for class 'VCorpus' tm_index(x, FUN, ...)
## S3 method for class 'PCorpus' tm_filter(x, FUN, ...) ## S3 method for class 'SimpleCorpus' tm_filter(x, FUN, ...) ## S3 method for class 'VCorpus' tm_filter(x, FUN, ...) ## S3 method for class 'PCorpus' tm_index(x, FUN, ...) ## S3 method for class 'SimpleCorpus' tm_index(x, FUN, ...) ## S3 method for class 'VCorpus' tm_index(x, FUN, ...)
x |
A corpus. |
FUN |
a filter function taking a text document or a string (if
|
... |
arguments to |
tm_filter
returns a corpus containing documents where
FUN
matches, whereas tm_index
only returns the
corresponding indices.
data("crude") # Full-text search tm_filter(crude, FUN = function(x) any(grep("co[m]?pany", content(x))))
data("crude") # Full-text search tm_filter(crude, FUN = function(x) any(grep("co[m]?pany", content(x))))
Interface to apply transformation functions (also denoted as mappings) to corpora.
## S3 method for class 'PCorpus' tm_map(x, FUN, ...) ## S3 method for class 'SimpleCorpus' tm_map(x, FUN, ...) ## S3 method for class 'VCorpus' tm_map(x, FUN, ..., lazy = FALSE)
## S3 method for class 'PCorpus' tm_map(x, FUN, ...) ## S3 method for class 'SimpleCorpus' tm_map(x, FUN, ...) ## S3 method for class 'VCorpus' tm_map(x, FUN, ..., lazy = FALSE)
x |
A corpus. |
FUN |
a transformation function taking a text document (a character
vector when |
... |
arguments to |
lazy |
a logical. Lazy mappings are mappings which are delayed until the content is accessed. It is useful for large corpora if only few documents will be accessed. In such a case it avoids the computationally expensive application of the mapping to all elements in the corpus. |
A corpus with FUN
applied to each document in x
. In case
of lazy mappings only internal flags are set. Access of individual documents
triggers the execution of the corresponding transformation function.
Lazy transformations change R's standard evaluation semantics.
getTransformations
for available transformations.
data("crude") ## Document access triggers the stemming function ## (i.e., all other documents are not stemmed yet) if(requireNamespace("SnowballC")) { tm_map(crude, stemDocument, lazy = TRUE)[[1]] } ## Use wrapper to apply character processing function tm_map(crude, content_transformer(tolower)) ## Generate a custom transformation function which takes the heading as new content headings <- function(x) PlainTextDocument(meta(x, "heading"), id = meta(x, "id"), language = meta(x, "language")) inspect(tm_map(crude, headings))
data("crude") ## Document access triggers the stemming function ## (i.e., all other documents are not stemmed yet) if(requireNamespace("SnowballC")) { tm_map(crude, stemDocument, lazy = TRUE)[[1]] } ## Use wrapper to apply character processing function tm_map(crude, content_transformer(tolower)) ## Generate a custom transformation function which takes the heading as new content headings <- function(x) PlainTextDocument(meta(x, "heading"), id = meta(x, "id"), language = meta(x, "language")) inspect(tm_map(crude, headings))
Fold multiple transformations (mappings) into a single one.
tm_reduce(x, tmFuns, ...)
tm_reduce(x, tmFuns, ...)
x |
A corpus. |
tmFuns |
A list of tm transformations. |
... |
Arguments to the individual transformations. |
A single tm transformation function obtained by folding tmFuns
from right to left (via Reduce(..., right = TRUE)
).
Reduce
for R's internal folding/accumulation mechanism, and
getTransformations
to list available transformation
(mapping) functions.
data(crude) crude[[1]] skipWords <- function(x) removeWords(x, c("it", "the")) funs <- list(stripWhitespace, skipWords, removePunctuation, content_transformer(tolower)) tm_map(crude, FUN = tm_reduce, tmFuns = funs)[[1]]
data(crude) crude[[1]] skipWords <- function(x) removeWords(x, c("it", "the")) funs <- list(stripWhitespace, skipWords, removePunctuation, content_transformer(tolower)) tm_map(crude, FUN = tm_reduce, tmFuns = funs)[[1]]
Compute a score based on the number of matching terms.
## S3 method for class 'DocumentTermMatrix' tm_term_score(x, terms, FUN = row_sums) ## S3 method for class 'PlainTextDocument' tm_term_score(x, terms, FUN = function(x) sum(x, na.rm = TRUE)) ## S3 method for class 'term_frequency' tm_term_score(x, terms, FUN = function(x) sum(x, na.rm = TRUE)) ## S3 method for class 'TermDocumentMatrix' tm_term_score(x, terms, FUN = col_sums)
## S3 method for class 'DocumentTermMatrix' tm_term_score(x, terms, FUN = row_sums) ## S3 method for class 'PlainTextDocument' tm_term_score(x, terms, FUN = function(x) sum(x, na.rm = TRUE)) ## S3 method for class 'term_frequency' tm_term_score(x, terms, FUN = function(x) sum(x, na.rm = TRUE)) ## S3 method for class 'TermDocumentMatrix' tm_term_score(x, terms, FUN = col_sums)
x |
Either a |
terms |
A character vector of terms to be matched. |
FUN |
A function computing a score from the number of terms
matching in |
A score as computed by FUN
from the number of matching
terms
in x
.
data("acq") tm_term_score(acq[[1]], c("company", "change")) ## Not run: ## Test for positive and negative sentiments ## install.packages("tm.lexicon.GeneralInquirer", repos="http://datacube.wu.ac.at", type="source") require("tm.lexicon.GeneralInquirer") sapply(acq[1:10], tm_term_score, terms_in_General_Inquirer_categories("Positiv")) sapply(acq[1:10], tm_term_score, terms_in_General_Inquirer_categories("Negativ")) tm_term_score(TermDocumentMatrix(acq[1:10], control = list(removePunctuation = TRUE)), terms_in_General_Inquirer_categories("Positiv")) ## End(Not run)
data("acq") tm_term_score(acq[[1]], c("company", "change")) ## Not run: ## Test for positive and negative sentiments ## install.packages("tm.lexicon.GeneralInquirer", repos="http://datacube.wu.ac.at", type="source") require("tm.lexicon.GeneralInquirer") sapply(acq[1:10], tm_term_score, terms_in_General_Inquirer_categories("Positiv")) sapply(acq[1:10], tm_term_score, terms_in_General_Inquirer_categories("Negativ")) tm_term_score(TermDocumentMatrix(acq[1:10], control = list(removePunctuation = TRUE)), terms_in_General_Inquirer_categories("Positiv")) ## End(Not run)
Tokenize a document or character vector.
Boost_tokenizer(x) MC_tokenizer(x) scan_tokenizer(x)
Boost_tokenizer(x) MC_tokenizer(x) scan_tokenizer(x)
x |
A character vector, or an object that can be coerced to character by
|
The quality and correctness of a tokenization algorithm highly depends on the context and application scenario. Relevant factors are the language of the underlying text and the notions of whitespace (which can vary with the used encoding and the language) and punctuation marks. Consequently, for superior results you probably need a custom tokenization function.
Uses the Boost (https://www.boost.org) Tokenizer (via Rcpp).
Implements the functionality of the tokenizer in the MC toolkit (https://www.cs.utexas.edu/~dml/software/mc/).
Simulates scan(..., what = "character")
.
A character vector consisting of tokens obtained by tokenization of x
.
getTokenizers
to list tokenizers provided by package tm.
Regexp_Tokenizer
for tokenizers using regular expressions
provided by package NLP.
tokenize
for a simple regular expression based tokenizer
provided by package tau.
tokenizers
for a collection of tokenizers provided
by package tokenizers.
data("crude") Boost_tokenizer(crude[[1]]) MC_tokenizer(crude[[1]]) scan_tokenizer(crude[[1]]) strsplit_space_tokenizer <- function(x) unlist(strsplit(as.character(x), "[[:space:]]+")) strsplit_space_tokenizer(crude[[1]])
data("crude") Boost_tokenizer(crude[[1]]) MC_tokenizer(crude[[1]]) scan_tokenizer(crude[[1]]) strsplit_space_tokenizer <- function(x) unlist(strsplit(as.character(x), "[[:space:]]+")) strsplit_space_tokenizer(crude[[1]])
Create a uniform resource identifier source.
URISource(x, encoding = "", mode = "text")
URISource(x, encoding = "", mode = "text")
x |
A character vector of uniform resource identifiers (URIs. |
encoding |
A character string describing the current encoding. It is
passed to |
mode |
a character string specifying if and how URIs should be read in. Available modes are: |
A uniform resource identifier source interprets each URI as a document.
An object inheriting from URISource
, SimpleSource
,
and Source
.
Source
for basic information on the source infrastructure
employed by package tm.
Encoding
and iconv
on encodings.
loremipsum <- system.file("texts", "loremipsum.txt", package = "tm") ovid <- system.file("texts", "txt", "ovid_1.txt", package = "tm") us <- URISource(sprintf("file://%s", c(loremipsum, ovid))) inspect(VCorpus(us))
loremipsum <- system.file("texts", "loremipsum.txt", package = "tm") ovid <- system.file("texts", "txt", "ovid_1.txt", package = "tm") us <- URISource(sprintf("file://%s", c(loremipsum, ovid))) inspect(VCorpus(us))
Create volatile corpora.
VCorpus(x, readerControl = list(reader = reader(x), language = "en")) as.VCorpus(x)
VCorpus(x, readerControl = list(reader = reader(x), language = "en")) as.VCorpus(x)
x |
For |
readerControl |
a named list of control parameters for reading in content
from
|
A volatile corpus is fully kept in memory and thus all changes only affect the corresponding R object.
An object inheriting from VCorpus
and Corpus
.
Corpus
for basic information on the corpus infrastructure
employed by package tm.
PCorpus
provides an implementation with permanent storage
semantics.
reut21578 <- system.file("texts", "crude", package = "tm") VCorpus(DirSource(reut21578, mode = "binary"), list(reader = readReut21578XMLasPlain))
reut21578 <- system.file("texts", "crude", package = "tm") VCorpus(DirSource(reut21578, mode = "binary"), list(reader = readReut21578XMLasPlain))
Create a vector source.
VectorSource(x)
VectorSource(x)
x |
A vector giving the texts. |
A vector source interprets each element of the vector x
as a
document.
An object inheriting from VectorSource
, SimpleSource
,
and Source
.
Source
for basic information on the source infrastructure
employed by package tm.
docs <- c("This is a text.", "This another one.") (vs <- VectorSource(docs)) inspect(VCorpus(vs))
docs <- c("This is a text.", "This another one.") (vs <- VectorSource(docs)) inspect(VCorpus(vs))
Binary weight a term-document matrix.
weightBin(m)
weightBin(m)
m |
A |
Formally this function is of class WeightingFunction
with the
additional attributes name
and acronym
.
The weighted matrix.
Construct a weighting function for term-document matrices.
WeightFunction(x, name, acronym)
WeightFunction(x, name, acronym)
x |
A function which takes a |
name |
A character naming the weighting function. |
acronym |
A character giving an acronym for the name of the weighting function. |
An object of class WeightFunction
which extends the class
function
representing a weighting function.
weightCutBin <- WeightFunction(function(m, cutoff) m > cutoff, "binary with cutoff", "bincut")
weightCutBin <- WeightFunction(function(m, cutoff) m > cutoff, "binary with cutoff", "bincut")
Weight a term-document matrix according to a combination of weights specified in SMART notation.
weightSMART(m, spec = "nnn", control = list())
weightSMART(m, spec = "nnn", control = list())
m |
A |
spec |
a character string consisting of three characters. The first letter specifies a term frequency schema, the second a document frequency schema, and the third a normalization schema. See Details for available built-in schemata. |
control |
a list of control parameters. See Details. |
Formally this function is of class WeightingFunction
with the
additional attributes name
and acronym
.
The first letter of spec
specifies a weighting schema for term
frequencies of m
:
(natural) counts the number of occurrences
of a term
in a document
. The
input term-document matrix
m
is assumed to be in this
standard term frequency format already.
(logarithm) is defined as .
(augmented) is defined as .
(boolean) is defined as 1 if and 0 otherwise.
(log average) is defined as .
The second letter of spec
specifies a weighting schema of
document frequencies for m
:
(no) is defined as 1.
(idf) is defined as where
denotes how often term
occurs in all
documents.
(prob idf) is defined as .
The third letter of spec
specifies a schema for normalization
of m
:
(none) is defined as 1.
(cosine) is defined as .
(pivoted unique) is defined as where both
slope
and pivot
must be set
via named tags in the control
list.
(byte size) is defined as
. The parameter
must be set via the named tag
alpha
in the control
list.
The final result is defined by multiplication of the chosen term frequency component with the chosen document frequency component with the chosen normalization component.
The weighted matrix.
Christopher D. Manning and Prabhakar Raghavan and Hinrich Schütze (2008). Introduction to Information Retrieval. Cambridge University Press, ISBN 0521865719.
data("crude") TermDocumentMatrix(crude, control = list(removePunctuation = TRUE, stopwords = TRUE, weighting = function(x) weightSMART(x, spec = "ntc")))
data("crude") TermDocumentMatrix(crude, control = list(removePunctuation = TRUE, stopwords = TRUE, weighting = function(x) weightSMART(x, spec = "ntc")))
Weight a term-document matrix by term frequency.
weightTf(m)
weightTf(m)
m |
A |
Formally this function is of class WeightingFunction
with the
additional attributes name
and acronym
.
This function acts as the identity function since the input matrix is already in term frequency format.
The weighted matrix.
Weight a term-document matrix by term frequency - inverse document frequency.
weightTfIdf(m, normalize = TRUE)
weightTfIdf(m, normalize = TRUE)
m |
A |
normalize |
A Boolean value indicating whether the term frequencies should be normalized. |
Formally this function is of class WeightingFunction
with the
additional attributes name
and acronym
.
Term frequency counts the number of
occurrences
of a term
in a document
. In the case of normalization, the term frequency
is divided by
.
Inverse document frequency for a term is defined as
where
denotes the total number of documents and where
is the number of documents where the term
appears.
Term frequency - inverse document frequency is now defined as
.
The weighted matrix.
Gerard Salton and Christopher Buckley (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24/5, 513–523.
Write a plain text representation of a corpus to multiple files on disk corresponding to the individual documents in the corpus.
writeCorpus(x, path = ".", filenames = NULL)
writeCorpus(x, path = ".", filenames = NULL)
x |
A corpus. |
path |
A character listing the directory to be written into. |
filenames |
Either |
The plain text representation of the corpus is obtained by calling
as.character
on each document.
data("crude") ## Not run: writeCorpus(crude, path = ".", filenames = paste(seq_along(crude), ".txt", sep = "")) ## End(Not run)
data("crude") ## Not run: writeCorpus(crude, path = ".", filenames = paste(seq_along(crude), ".txt", sep = "")) ## End(Not run)
Create an XML source.
XMLSource(x, parser = xml_contents, reader)
XMLSource(x, parser = xml_contents, reader)
x |
a character giving a uniform resource identifier. |
parser |
a function accepting an XML document (as delivered by
|
reader |
a function capable of turning XML elements/nodes as
returned by |
An object inheriting from XMLSource
, SimpleSource
,
and Source
.
Source
for basic information on the source infrastructure
employed by package tm.
Vignette 'Extensions: How to Handle Custom File Formats', and
readXML
.
Create XML text documents.
XMLTextDocument(x = xml_missing(), author = character(0), datetimestamp = as.POSIXlt(Sys.time(), tz = "GMT"), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0), ..., meta = NULL)
XMLTextDocument(x = xml_missing(), author = character(0), datetimestamp = as.POSIXlt(Sys.time(), tz = "GMT"), description = character(0), heading = character(0), id = character(0), language = character(0), origin = character(0), ..., meta = NULL)
x |
An |
author |
a character or an object of class |
datetimestamp |
an object of class |
description |
a character giving a description. |
heading |
a character giving the title or a short heading. |
id |
a character giving a unique identifier. |
language |
a character giving the language (preferably as IETF language tags, see language in package NLP). |
origin |
a character giving information on the source and origin. |
... |
user-defined document metadata tag-value pairs. |
meta |
a named list or |
An object inheriting from XMLTextDocument
and
TextDocument
.
TextDocument
for basic information on the text document
infrastructure employed by package tm.
xml <- system.file("extdata", "order-doc.xml", package = "xml2") (xtd <- XMLTextDocument(xml2::read_xml(xml), heading = "XML text document", id = xml, language = "en")) content(xtd) meta(xtd)
xml <- system.file("extdata", "order-doc.xml", package = "xml2") (xtd <- XMLTextDocument(xml2::read_xml(xml), heading = "XML text document", id = xml, language = "en")) content(xtd) meta(xtd)
Explore Zipf's law and Heaps' law, two empirical laws in linguistics describing commonly observed characteristics of term frequency distributions in corpora.
Zipf_plot(x, type = "l", ...) Heaps_plot(x, type = "l", ...)
Zipf_plot(x, type = "l", ...) Heaps_plot(x, type = "l", ...)
x |
a document-term matrix or term-document matrix with unweighted term frequencies. |
type |
a character string indicating the type of plot to be
drawn, see |
... |
further graphical parameters to be used for plotting. |
Zipf's law (e.g., https://en.wikipedia.org/wiki/Zipf%27s_law)
states that given some corpus of natural language utterances, the
frequency of any word is inversely proportional to its rank in the
frequency table, or, more generally, that the pmf of the term
frequencies is of the form , where
is the
rank of the term (taken from the most to the least frequent one).
We can conveniently explore the degree to which the law holds by
plotting the logarithm of the frequency against the logarithm of the
rank, and inspecting the goodness of fit of a linear model.
Heaps' law (e.g., https://en.wikipedia.org/wiki/Heaps%27_law)
states that the vocabulary size (i.e., the number of different
terms employed) grows polynomially with the text size
(the
total number of terms in the texts), so that
.
We can conveniently explore the degree to which the law holds by
plotting
against
, and inspecting the
goodness of fit of a linear model.
The coefficients of the fitted linear model. As a side effect, the corresponding plot is produced.
data("acq") m <- DocumentTermMatrix(acq) Zipf_plot(m) Heaps_plot(m)
data("acq") m <- DocumentTermMatrix(acq) Zipf_plot(m) Heaps_plot(m)
Create a ZIP file source.
ZipSource(zipfile, pattern = NULL, recursive = FALSE, ignore.case = FALSE, mode = "text")
ZipSource(zipfile, pattern = NULL, recursive = FALSE, ignore.case = FALSE, mode = "text")
zipfile |
A character string with the full path name of a ZIP file. |
pattern |
an optional regular expression. Only file names in the ZIP file which match the regular expression will be returned. |
recursive |
logical. Should the listing recurse into directories? |
ignore.case |
logical. Should pattern-matching be case-insensitive? |
mode |
a character string specifying if and how files should be read in. Available modes are: |
A ZIP file source extracts a compressed ZIP file via
unzip
and interprets each file as a document.
An object inheriting from ZipSource
, SimpleSource
, and
Source
.
Source
for basic information on the source infrastructure
employed by package tm.
zipfile <- tempfile() files <- Sys.glob(file.path(system.file("texts", "txt", package = "tm"), "*")) zip(zipfile, files) zipfile <- paste0(zipfile, ".zip") Corpus(ZipSource(zipfile, recursive = TRUE))[[1]] file.remove(zipfile)
zipfile <- tempfile() files <- Sys.glob(file.path(system.file("texts", "txt", package = "tm"), "*")) zip(zipfile, files) zipfile <- paste0(zipfile, ".zip") Corpus(ZipSource(zipfile, recursive = TRUE))[[1]] file.remove(zipfile)