Error Trying to Read a PDF Using Readpdf from The Tm Package

Error trying to read a PDF using readPDF from the tm package

Intersting, on my machine after a fresh start pdf is a function to convert an image to a PDF:

 getAnywhere(pdf)
A single object matching ‘pdf’ was found
It was found in the following places
  package:grDevices
  namespace:grDevices [etc.]

But back to the problem of reading in PDF files as text, fiddling with the PATH is a bit hit-and-miss (and annoying if you work across several different computers), so I think the simplest and safest method is to call pdf2text using system as Tony Breyal describes here.

In your case it would be (note the two sets of quotes):

system(paste('"C:/Program Files/xpdf64/pdftotext.exe"', 
             '"C:/Users/Raffael/Documents/17214.pdf"'), wait=FALSE)

This could easily be extended with an *apply function or loop if you have many PDF files.

tm readPDF: Error in file(con, r) : cannot open the connection

Did some debugging and see it fails in tm:::pdfinfo():

status <- system2("pdfinfo", shQuote(normalizePath(file)), 
        stdout = outfile)

This command doesn't create the outfile. According to Redirect system2 stdout to a file on windows this is a bug!

R tm package readPDF error in strptime(d, fmt) : input string too long

Based on what I've read this error has something to do with the way that the "readPDF" function tries to make metadata for the file you're importing. Anyway, you can change the metadata info by using the "info" option. For example, I usually circumvent this error by modifying the command in the following way (using your code):

doc <- readPDF(control = list(info="-f",text = "-layout"))(elem = list(uri = filename),language = "en", id = "id1")

Where the addition of "info="-f"" is the only change. This doesn't really "fix" the problem, but it bypasses the error. Cheers :)

Extracting text data from PDF files

Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.

That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.

Corpus reading from pdf OR text in R

Do it the easier way using the readtext package. If your mix of .txt and .pdf files are in the same subdirectory, call this path_to_your_files/, then you can read them all in and then make them into a tm Corpus using readtext(). This function automagically recognises different input file types and converts them into UTF-8 text for text analysis in R. (The rtext object created here is a special type of data.frame that includes a document identifier column and a column called text that contains the converted text contents of your input documents.)

rtext <- readtext::readtext("path_to_your_files/*")
tm::Corpus(VectorSource(rtext[["text"]]))

readtext objects can also be used directly with the quanteda package as inputs to quanteda::corpus() if you wanted to try an alternative to tm.

Error Trying to Read a PDF Using Readpdf from The Tm Package