Use R to Convert PDF Files to Text Files for Text Mining

Use R to convert PDF files to text files for text mining

Yes, not really an R question as IShouldBuyABoat notes, but something that R can do with only minor contortions...

Use R to convert PDF files to txt files...

# folder with 1000s of PDFs
dest <- "C:\\Users\\Desktop"

# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)

# convert each PDF file that is named in the vector into a text file
# text file is created in the same directory as the PDFs
# note that my pdftotext.exe is in a different location to yours
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"',
paste0('"', i, '"')), wait = FALSE) )

Extract only abstracts from txt files...

# if you just want the abstracts, we can use regex to extract that part of
# each txt file, Assumes that the abstract is always between the words 'Abstract'
# and 'Introduction'
mytxtfiles <- list.files(path = dest, pattern = "txt", full.names = TRUE)
abstracts <- lapply(mytxtfiles, function(i) {
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("(?<=Abstract).*?(?=Introduction)", j, perl=TRUE))
})

Write abstracts into separate txt files...

# write abstracts as txt files 
# (or use them in the list for whatever you want to do next)
lapply(1:length(abstracts), function(i) write.table(abstracts[i], file=paste(mytxtfiles[i], "abstract", "txt", sep="."), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = " " ))

And now you're ready to do some text mining on the abstracts.

Convert .pdf to .txt

Late answer:

But I recently discovered that with the current verions of tm (0.7-4) you can read pdfs directly into a corpus if you have pdftools installed (install.packages("pdftools")).

library(tm)

directory <- getwd() # change this to directory where pdf-files are located

# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"),
readerControl = list(reader = readPDF))

Extracting text data from PDF files

Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.

That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.

How do I convert multiple pdf's into a corpus for text analysis in R?

There are multiple ways, but if you want to get it into a corpus there is a simple way to do it. It does require that the package pdftools is installed (install.packages("pdftools")) as that will be the engine used to read the pdfs. Then it is just a question of using the tm package to read everything into a corpus.

library(tm)

directory <- getwd() # change this to directory where files are located

# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"),
readerControl = list(reader = readPDF))

Text Mining PDFs - Convert List of Character Vectors (Strings) to Dataframe

This should do the trick:

#dummy data generation: file names and a list of strings (your corpus)    
files <- paste("file", 1:6)

strings <- list("a","b","c", "d","e","f")
names(strings) <-files
t(as.data.frame(unlist(strings)))

# file 1 file 2 file 3 file 4 file 5 file 6
# unlist(strings) "a" "b" "c" "d" "e" "f"

Edit based on data structure edit

files <- paste("file", 1:6)

strings <- list(c("a","b"),c("c", "d"),c("e","f"),
c("g","h"), c("i","j"), c("k", "l"))

names(strings) <-files
t(data.frame(Doc=sapply(strings, paste0, collapse = " ")))

# file 1 file 2 file 3 file 4 file 5 file 6
# Doc "a b" "c d" "e f" "g h" "i j" "k l"

Corpus reading from pdf OR text in R

Do it the easier way using the readtext package. If your mix of .txt and .pdf files are in the same subdirectory, call this path_to_your_files/, then you can read them all in and then make them into a tm Corpus using readtext(). This function automagically recognises different input file types and converts them into UTF-8 text for text analysis in R. (The rtext object created here is a special type of data.frame that includes a document identifier column and a column called text that contains the converted text contents of your input documents.)

rtext <- readtext::readtext("path_to_your_files/*")
tm::Corpus(VectorSource(rtext[["text"]]))

readtext objects can also be used directly with the quanteda package as inputs to quanteda::corpus() if you wanted to try an alternative to tm.

convert two columns text document to single line for text mining

With the fixed-width left column, we can split each line into the first 37 chars and the rest, adding these to strings for the left and right column. For instance, with regex

use warnings;
use strict;

my $file = 'two_column.txt'
open my $fh, '<', $file or die "Can't open $file: $!";

my ($left_col, $right_col);

while (<$fh>)
{
my ($left, $right) = /(.{37})(.*)/;

$left =~ s/\s*$/ /;

$left_col .= $left;
$right_col .= $right;
}
close $fh;

print $left_col, $right_col, "\n";

This prints the whole text. Or join columns, my $text = $left_col . $right_col;

The regex pattern (.{37}) matches any character (.) and does this exactly 37 times ({37}), capturing that with (); the (.*) captures all remaining. These are returned by the regex, and assigned. The trailing spaces in $left are condensed into one. Both are then appended (.=).

Or from the command line

perl -wne'
($l, $r) = /(.{37})(.*)/; $l =~ s/\s*$/ /; $cL .= $l; $cR .= $r;
}{ print $cL,$cR,"\n"
' two_column.txt

where }{ starts the END block, that runs before exit (after all lines have been processed).



Related Topics



Leave a reply



Submit