Changes in version 0.7-18 (2026-02-18)                 

BUG FIXES

    o   Generate bibliographic citations and references from
	bibentries.

                 Changes in version 0.7-17 (2025-12-10)                 

BUG FIXES

    o   Documentation improvements.

                 Changes in version 0.7-16 (2025-02-19)                 

BUG FIXES

    o   Improvements for Rd cross-references.

                 Changes in version 0.7-15 (2024-11-18)                 

BUG FIXES

    o   Improvements for Rd cross-references.

                 Changes in version 0.7-14 (2024-08-13)                 

BUG FIXES

    o   Use R_Calloc/R_Free instead of the long-deprecated Calloc/Free.

                 Changes in version 0.7-13 (2024-04-20)                 

BUG FIXES

    o   Improvements for Rd cross-references.

                 Changes in version 0.7-12 (2024-03-11)                 

BUG FIXES

    o   Add missing S3 method registration.

                 Changes in version 0.7-11 (2023-02-05)                 

BUG FIXES

    o   Use the default C++ standard instead of C++11.

                 Changes in version 0.7-10 (2022-12-14)                 

NEW FEATURES

    o   All built-in pGetElem() methods now use tm_parLapply().

                 Changes in version 0.7-9 (2022-10-19)                  

BUG FIXES

    o   Compilation fixes.

                 Changes in version 0.7-8 (2020-11-18)                  

BUG FIXES

    o   Fix invalid counting in prevalent stemCompletion().  Reported
	by Bernard Chang.

    o   tm_index() now interprets all non-TRUE logical values returned
	by the filter function as FALSE. This fixes corner cases where
	filter functions return logical(0) or NA. Reported by Tom
	Nicholls.

                 Changes in version 0.7-6 (2018-12-21)                  

NEW FEATURES

    o   TermDocumentMatrix.SimpleCorpus() now also honors a logical
	removePunctuation control option (default: false).

BUG FIXES

    o   Sync encoding fixes in TermDocumentMatrix.SimpleCorpus() with
	Boost_tokenizer().

                 Changes in version 0.7-5 (2018-07-29)                  

BUG FIXES

    o   Handle NAs consistently in tokenizers.

                 Changes in version 0.7-4 (2018-06-19)                  

BUG FIXES

    o   Keep document names in tm_map.SimpleCorpus().

    o   Fix encoding problems in scan_tokenizer() and
	Boost_tokenizer().

                 Changes in version 0.7-3 (2017-12-06)                  

BUG FIXES

    o   scan_tokenizer() now works with character vectors and character
	strings.

    o   removePunctuation() now works again in latin1 locales.

    o   Handle empty term-document matrices gracefully.

                 Changes in version 0.7-2 (2017-11-18)                  

SIGNIFICANT USER-VISIBLE CHANGES

    o   DataframeSource now only processes data frames with the two
	mandatory columns "doc_id" and "text". Additional columns are
	used as document level metadata. This implements compatibility
	with _Text Interchange Formats_ corpora
	(<https://github.com/ropenscilabs/tif>).

    o   readTabular() has been removed. Use DataframeSource instead.

    o   removeNumbers() and removePunctuation() now have an argument
	ucp to check for Unicode general categories Nd (decimal digits)
	and P (punctuation), respectively. Contributed by Kurt Hornik.

    o   The package xml2 is now imported for XML functionality instead
	of the (CRAN maintainer orphaned) package XML.

NEW FEATURES

    o   Boost_tokenizer provides a tokenizer based on the Boost
	(<https://www.boost.org>) Tokenizer.

BUG FIXES

    o   Correctly handle the dictionary argument when constructing a
	term-document matrix from a SimpleCorpus (reported by Joe
	Corrigan) or from a VCorpus (reported by Mark Rosenstein).

                 Changes in version 0.7-1 (2017-03-02)                  

BUG FIXES

    o   Compilation fixes for Clang's libc++.

                  Changes in version 0.7 (2017-02-27)                   

SIGNIFICANT USER-VISIBLE CHANGES

    o   inspect.TermDocumentMatrix() now displays a sample instead of
	the full matrix. The full dense representation is available via
	as.matrix().

NEW FEATURES

    o   SimpleCorpus provides a corpus which is optimized for the most
	common usage scenario: importing plain texts from files in a
	directory or directly from a vector in R, preprocessing and
	transforming the texts, and finally exporting them to a
	term-document matrix. The aim is to boost performance and
	minimize memory pressure. It loads all documents into memory,
	and is designed for medium-sized to large data sets.

    o   inspect() on text documents as a shorthand for
	writeLines(as.character()).

    o   findMostFreqTerms() finds most frequent terms in a
	document-term or term-document matrix, or a vector of term
	frequencies.

    o   tm_parLapply() is now internally used for the parallelization
	of transformations, filters, and term-document matrix
	construction. The preferred parallelization engine can be
	registered via tm_parLapply_engine(). The default is to use no
	parallelization (instead of mclapply (package parallel) in
	previous versions).

                 Changes in version 0.6-2 (2015-07-03)                  

BUG FIXES

    o   format.PlainTextDocument() now reports only one character count
	for a whole document.

                 Changes in version 0.6-1 (2015-05-07)                  

SIGNIFICANT USER-VISIBLE CHANGES

    o   format.PlainTextDocument() now displays a compact
	representation instead of the content. Use as.character() to
	obtain the character content (which in turn can be applied to a
	corpus via lapply()).

NEW FEATURES

    o   ZipSource() for processing ZIP files.

    o   Sources now provide open() and close().

    o   termFreq() now accepts Span_Tokenizer and Token_Tokenizer (both
	from package NLP) objects as tokenizers.

    o   readTagged(), a reader for text documents containing POS-tagged
	words.

BUG FIXES

    o   The function removeWords() now correctly processes words being
	truncations of others. Reported by Александр Труфанов.

                  Changes in version 0.6 (2014-06-11)                   

SIGNIFICANT USER-VISIBLE CHANGES

    o   DirSource() and URISource() now use the argument encoding for
	conversion via iconv() to "UTF-8".

    o   termFreq() now uses words() as the default tokenizer.

    o   Text documents now provide the functions content() and
	as.character() to access the (possibly raw) document content
	and the natural language text in a suitable (not necessarily
	structured) form.

    o   The internal representation of corpora, sources, and text
	documents changed. Saved objects created with older tm versions
	are incompatible and need to be rebuilt.

NEW FEATURES

    o   DirSource() and URISource() now have a mode argument specifying
	how elements should be read (no read, binary, text).

    o   Improved high-level documentation on corpora (?Corpus), text
	documents (?TextDocument), sources (?Source), and readers
	(?Reader).

    o   Integration with package NLP.

    o   Romanian stopwords. Suggested by Cristian Chirita.

    o   words.PlainTextDocument() delivers word tokens in the document.

BUG FIXES

    o   The function stemCompletion() now avoids spurious duplicate
	results. Reported by Seong-Hyeon Kim.

DEPRECATED & DEFUNCT

    o   Following functions have been removed:
	
	  • Author(), DateTimeStamp(), CMetaData(), content_meta(),
	    DMetaData(), Description(), Heading(), ID(), Language(),
	    LocalMetaData(), Origin(), prescindMeta(), sFilter() (use
	    meta() instead).
	
	  • dissimilarity() (use proxy::dist() instead).
	
	  • makeChunks() (use [ and [[ manually).
	
	  • summary.Corpus() and summary.TextRepository() (print() now
	    gives a more informative but succinct overview).
	
	  • TextRepository() and RepoMetaData() (use e.g. a list to
	    store multiple corpora instead).

                 Changes in version 0.5-10 (2014-01-13)                 

SIGNIFICANT USER-VISIBLE CHANGES

    o   License changed to GPL-3 (from GPL-2 | GPL-3).

    o   Following functions have been renamed:
	
	  • tm_tag_score() to tm_term_score().

DEPRECATED & DEFUNCT

    o   Following functions have been removed:
	
	  • Dictionary() (use a character vector instead; use Terms()
	    to extract terms from a document-term or term-document
	    matrix),
	
	  • GmaneSource() (but still available via an example in
	    XMLSource()),
	
	  • preprocessReut21578XML() (moved to package
	    tm.corpus.Reuters21578),
	
	  • readGmane() (but still available via an example in
	    readXML()),
	
	  • searchFullText() and tm_intersect() (use grep() instead).

    o   Following S3 classes are no longer registered as S4 classes:
	
	  • VCorpus and PlainTextDocument.

                 Changes in version 0.5-9 (2013-06-18)                  

SIGNIFICANT USER-VISIBLE CHANGES

    o   Stemming functionality is now provided by the package SnowballC
	replacing packages Snowball and RWeka.

    o   All stopword lists (besides Catalan and SMART) available via
	stopwords() now come from the Snowball stemmer project.

    o   Transformations, filters, and term-document matrix construction
	now use mclapply (package parallel).  Packages snow and Rmpi
	are no longer used.

DEPRECATED & DEFUNCT

    o   Following functions have been removed:
	
	  • tm_startCluster() and tm_stopCluster().

                 Changes in version 0.5-8 (2012-12-06)                  

SIGNIFICANT USER-VISIBLE CHANGES

    o   The function termFreq() now processes the tolower and tokenize
	options first.

NEW FEATURES

    o   Catalan stopwords. Requested by Xavier Fernández i Marín.

BUG FIXES

    o   The function termFreq() now correctly accepts user-provided
	stopwords. Reported by Bettina Grün.

    o   The function termFreq() now correctly handles the lower bound
	of the option wordLength. Reported by Steven C. Bagley.

                 Changes in version 0.5-7 (2011-12-17)                  

SIGNIFICANT USER-VISIBLE CHANGES

    o   The function termFreq() provides two new arguments for
	generalized bounds checking of term frequencies and word
	lengths. This replaces the arguments minDocFreq and
	minWordLength.

    o   The function termFreq() is now sensitive to the order of
	control options.

NEW FEATURES

    o   Weighting schemata for term-document matrices in SMART
	notation.

    o   Local and global options for term-document matrix construction.

    o   SMART stopword list was added.

                 Changes in version 0.5-5 (2011-02-20)                  

NEW FEATURES

    o   Access documents in a corpus by names (fallback to IDs if names
	are not set), i.e., allow a string for the corpus operator
	`[[`.

BUG FIXES

    o   The function findFreqTerms() now checks bounds on a global
	level (to comply with the manual page) instead per document.
	Reported and fixed by Thomas Zapf-Schramm.

                 Changes in version 0.5-4 (2010-08-19)                  

SIGNIFICANT USER-VISIBLE CHANGES

    o   Use IETF language tags for language codes (instead of ISO
	639-2).

NEW FEATURES

    o   The function tm_tag_score() provides functionality to score
	documents based on the number of tags found. This is useful for
	sentiment analysis.

    o   The weighting function for term frequency-inverse document
	frequency weightTfIdf() has now an option for term
	normalization.

    o   Plotting functions to test for Zipf's and Heaps' law on a
	term-document matrix were added: Zipf_plot() and Heaps_plot().
	Contributed by Kurt Hornik.

                 Changes in version 0.5-3 (2010-02-19)                  

NEW FEATURES

    o   The reader function readRCV1asPlain() was added and combines
	the functionality of readRCV1() and as.PlainTextDocument().

    o   The function stemCompletion() has a set of new heuristics.

                 Changes in version 0.5-2 (2010-01-09)                  

SIGNIFICANT USER-VISIBLE CHANGES

    o   The function termFreq() which is used for building a
	term-document matrix now uses a whitespace oriented tokenizer
	as default.

NEW FEATURES

    o   A combine method for merging multiple term-document matrices
	was added (c.TermDocumentMatrix()).

    o   The function termFreq() has now an option to remove punctuation
	characters.

DEPRECATED & DEFUNCT

    o   Following functions have been removed:
	
	  • CSVSource() (use DataframeSource(read.csv(...,
	    stringsAsFactors = FALSE)) instead), and
	
	  • TermDocMatrix() (use DocumentTermMatrix() instead).

BUG FIXES

    o   removeWords() no longer skips words at the beginning or the end
	of a line. Reported by Mark Kimpel.

                 Changes in version 0.5-1 (2009-10-27)                  

BUG FIXES

    o   preprocessReut21578XML() no longer generates invalid file
	names.

                  Changes in version 0.5 (2009-09-10)                   

SIGNIFICANT USER-VISIBLE CHANGES

    o   All classes, functions, and generics are reimplemented using
	the S3 class system.

    o   Following functions have been renamed:
	
	  • activateCluster() to tm_startCluster(),
	
	  • asPlain() to as.PlainTextDocument(),
	
	  • deactivateCluster() to tm_stopCluster(),
	
	  • tmFilter() to tm_filter(),
	
	  • tmIndex() to tm_index(),
	
	  • tmIntersect() to tm_intersect(), and
	
	  • tmMap() to tm_map().

    o   Mail handling functionality is factored out to the
	tm.plugin.mail package.

DEPRECATED & DEFUNCT

    o   Following functions have been removed:
	
	  • tmTolower() (use tolower() instead), and
	
	  • replacePatterns() (use gsub() instead).

                  Changes in version 0.4 (2009-07-01)                   

SIGNIFICANT USER-VISIBLE CHANGES

    o   The Corpus class is now virtual providing an abstract
	interface.

    o   VCorpus, the default implementation of the abstract corpus
	interface (by subclassing), provides a corpus with volatile (=
	standard R object) semantics. It loads all documents into
	memory, and is designed for small to medium-sized data sets.

    o   PCorpus, an implementation of the abstract corpus interface (by
	subclassing), provides a corpus with permanent storage
	semantics. The actual data is stored in an external database
	(file) object (as supported by the filehash package), with
	automatic (un-)loading into memory. It is designed for systems
	with small memory.

    o   Language codes are now in ISO 639-2 (instead of ISO 639-1).

    o   Reader functions no longer have a load argument for lazy
	loading.

NEW FEATURES

    o   The reader function readReut21578XMLasPlain() was added and
	combines the functionality of readReut21578XML() and asPlain().

BUG FIXES

    o   weightTfIdf() no longer applies a binary weighting to an input
	matrix in term frequency format (which happened only in 0.3-4).

                 Changes in version 0.3-4 (2009-04-29)                  

SIGNIFICANT USER-VISIBLE CHANGES

    o   .onLoad() no longer tries to start a MPI cluster (which often
	failed in misconfigured environments). Use activateCluster()
	and deactivateCluster() instead.

    o   DocumentTermMatrix (the improved reimplementation of defunct
	TermDocMatrix) does not use the Matrix package anymore.

NEW FEATURES

    o   The DirSource() constructor now accepts the two new (optional)
	arguments pattern and ignore.case. With pattern one can define
	a regular expression for selecting only matching files, and
	ignore.case specifies whether pattern-matching is
	case-sensitive.

    o   The readNewsgroup() reader function can now be configured for
	custom date formats (via the DateFormat argument).

    o   The readPDF() reader function can now be configured (via the
	PdfinfoOptions and PdftotextOptions arguments).

    o   The readDOC() reader function can now be configured (via the
	AntiwordOptions argument).

    o   Sources now can be vectorized. This allows faster corpus
	construction.

    o   New XMLSource class for arbitrary XML files.

    o   The new readTabular() reader function allows to create a custom
	tailor-made reader configured via mappings from a tabular data
	structure.

    o   The new readXML() reader function allows to read in arbitrary
	XML files which are described with a specification.

    o   The new tmReduce() transformation allows to combine multiple
	maps into one transformation.

DEPRECATED & DEFUNCT

    o   CSVSource is defunct (use DataframeSource instead).

    o   weightLogical is defunct.

    o   TermDocMatrix is defunct (use DocumentTermMatrix or
	TermDocumentMatrix instead).

                 Changes in version 0.3-3 (2008-12-22)                  

NEW FEATURES

    o   The abstract Source class gets a default implementation for the
	stepNext() method. It increments the position counter by one, a
	reasonable value for most sources. For special purposes custom
	methods can be created via overloading stepNext() of the
	subclass.

    o   New URISource class for a single document identified by a
	Uniform Resource Identifier.

    o   New DataframeSource for documents stored in a data frame. Each
	row is interpreted as a single document.

BUG FIXES

    o   Fix off-by-one error in convertMboxEml() function. Reported by
	Angela Bohn.

    o   Sort row indices in sparse term-document matrices. Kudos to
	Martin Mächler for his suggestions.

    o   Sources and readers no longer evaluate calls in a non-standard
	way.

                 Changes in version 0.3-2 (2008-11-12)                  

NEW FEATURES

    o   Weighting functions now have an Acronym slot containing
	abbreviations of the weighting functions' names. This is highly
	useful when generating tables with indications which weighting
	scheme was actually used for your experiments.

    o   The functions tmFilter(), tmIndex(), tmMap() and
	TermDocMatrix() now can use a MPI cluster (via the snow and
	Rmpi packages) if available. Use (de)activateCluster() to
	manually override cluster usage settings. Special thanks to
	Stefan Theussl for his constructive comments.

    o   The Source class receives a new Length slot. It contains the
	number of elements provided by the source (although there might
	be rare cases where the number cannot be determined in
	advance-then it should be set to zero).