Title: | Import Articles from 'Europresse' Using the 'tm' Text Mining Framework |
---|---|
Description: | Provides a 'tm' Source to create corpora from articles exported from the 'Europresse' content provider as HTML files. It is able to read both text content and meta-data information (including source, date, title, author and pages). |
Authors: | Milan Bouchet-Valat [aut, cre] |
Maintainer: | Milan Bouchet-Valat <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.4 |
Built: | 2024-10-15 04:15:13 UTC |
Source: | https://github.com/r-forge/r-temis |
This package provides a tm Source to create corpora from articles exported from the Europresse content provider as HTML files.
Typical usage is to create a corpus from HTML files
exported from Europresse (here called myEuropresseArticles.html
).
Frequently, it is necessary to specify the encoding of the texts
via link{EuropresseSource}
's encoding
argument.
# Import corpus source <- EuropresseSource("myEuropresseArticles.html") corpus <- Corpus(source) # See how many articles were imported corpus # See the contents of the first article and its meta-data inspect(corpus[1]) meta(corpus[[1]])
See link{EuropresseSource}
for more details and real examples.
Milan Bouchet-Valat <[email protected]>
Construct a source for an input containing a set of articles exported from Europresse in the HTML format.
EuropresseSource(x, encoding = "UTF-8")
EuropresseSource(x, encoding = "UTF-8")
x |
Either a character identifying the file or a connection. |
encoding |
A character giving the encoding of |
This function imports the body of the articles, but also sets several meta-data variables on individual documents:
datetimestamp
: The publication date.
heading
: The title of the article.
origin
: The newspaper the article comes from.
section
: If available, the part of the newspaper containing
the article.
pages
: If available, the pages where the article appeared.
Please note that it commonly happens that the encoding specified in Europresse HTML files does not correspond to the one actually used in the text: in that case, you will need to find out the correct encoding and specify it manually.
An object of class EuropresseSource
which extends the class
Source
representing set of articles from Europresse.
Milan Bouchet-Valat
readEuropresseHTML2
for the function actually parsing
individual articles.
getSources
to list available sources.
library(tm) file <- system.file("texts", "europresse_test2.html", package = "tm.plugin.europresse") corpus <- Corpus(EuropresseSource(file)) # See the contents of the documents inspect(corpus) # See meta-data associated with first article meta(corpus[[1]])
library(tm) file <- system.file("texts", "europresse_test2.html", package = "tm.plugin.europresse") corpus <- Corpus(EuropresseSource(file)) # See the contents of the documents inspect(corpus) # See meta-data associated with first article meta(corpus[[1]])
Read in an article exported from Europresse in the HTML format.
readEuropresseHTML1(elem, language, id) readEuropresseHTML2(elem, language, id)
readEuropresseHTML1(elem, language, id) readEuropresseHTML2(elem, language, id)
elem |
A |
language |
A |
id |
A |
readEuropresseHTML1
reads documents in the old format, while readEuropresseHTML2
reads documents in the new one. EuropresseSource
automatically chooses the correct
reader based on the structure of the file.
A PlainTextDocument
with the contents of the article and the available meta-data set.
Milan Bouchet-Valat
getReaders
to list available reader functions.