Error trying to read a PDF using readPDF from the tm package
Intersting, on my machine after a fresh start pdf
is a function to convert an image to a PDF:
getAnywhere(pdf)
A single object matching ‘pdf’ was found
It was found in the following places
package:grDevices
namespace:grDevices [etc.]
But back to the problem of reading in PDF files as text, fiddling with the PATH is a bit hit-and-miss (and annoying if you work across several different computers), so I think the simplest and safest method is to call pdf2text
using system
as Tony Breyal describes here.
In your case it would be (note the two sets of quotes):
system(paste('"C:/Program Files/xpdf64/pdftotext.exe"',
'"C:/Users/Raffael/Documents/17214.pdf"'), wait=FALSE)
This could easily be extended with an *apply
function or loop if you have many PDF files.
tm readPDF: Error in file(con, r) : cannot open the connection
Did some debugging and see it fails in tm:::pdfinfo()
:
status <- system2("pdfinfo", shQuote(normalizePath(file)),
stdout = outfile)
This command doesn't create the outfile. According to Redirect system2 stdout to a file on windows this is a bug!
R tm package readPDF error in strptime(d, fmt) : input string too long
Based on what I've read this error has something to do with the way that the "readPDF" function tries to make metadata for the file you're importing. Anyway, you can change the metadata info by using the "info" option. For example, I usually circumvent this error by modifying the command in the following way (using your code):
doc <- readPDF(control = list(info="-f",text = "-layout"))(elem = list(uri = filename),language = "en", id = "id1")
Where the addition of "info="-f"" is the only change. This doesn't really "fix" the problem, but it bypasses the error. Cheers :)
Extracting text data from PDF files
Linux systems have pdftotext
which I had reasonable success with. By default, it creates foo.txt
from a give foo.pdf
.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
Corpus reading from pdf OR text in R
Do it the easier way using the readtext package. If your mix of .txt and .pdf files are in the same subdirectory, call this path_to_your_files/
, then you can read them all in and then make them into a tm Corpus using readtext()
. This function automagically recognises different input file types and converts them into UTF-8 text for text analysis in R. (The rtext
object created here is a special type of data.frame that includes a document identifier column and a column called text
that contains the converted text contents of your input documents.)
rtext <- readtext::readtext("path_to_your_files/*")
tm::Corpus(VectorSource(rtext[["text"]]))
readtext objects can also be used directly with the quanteda package as inputs to quanteda::corpus()
if you wanted to try an alternative to tm.
Related Topics
Piecewise Function Fitting with Nls() in R
Quantiles by Factor Levels in R
Label_Parsed of Facet_Grid in Ggplot2 Mixed with Spaces and Expressions
Find Second Highest Value on a Raster Stack in R
Benchmarking: Using 'Expression' 'Quote' or Neither
Numerical Triple Integration in R
Check If a Program Is Installed
Ggplot: Subset a Layer Where Data Is Passed Using a Pipe
Write a File Using 'saverds()' So That It Is Backwards Compatible with Old Versions of R
How to Round Percentage to 2 Decimal Places in Ggplot2
Shiny Datatable in Landscape Orientation
How to Find The Indices Where There Are N Consecutive Zeroes in a Row