Use R to convert PDF files to text files for text mining
Yes, not really an R
question as IShouldBuyABoat notes, but something that R
can do with only minor contortions...
Use R
to convert PDF files to txt files...
# folder with 1000s of PDFs
dest <- "C:\\Users\\Desktop"
# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf", full.names = TRUE)
# convert each PDF file that is named in the vector into a text file
# text file is created in the same directory as the PDFs
# note that my pdftotext.exe is in a different location to yours
lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"',
paste0('"', i, '"')), wait = FALSE) )
Extract only abstracts from txt files...
# if you just want the abstracts, we can use regex to extract that part of
# each txt file, Assumes that the abstract is always between the words 'Abstract'
# and 'Introduction'
mytxtfiles <- list.files(path = dest, pattern = "txt", full.names = TRUE)
abstracts <- lapply(mytxtfiles, function(i) {
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("(?<=Abstract).*?(?=Introduction)", j, perl=TRUE))
})
Write abstracts into separate txt files...
# write abstracts as txt files
# (or use them in the list for whatever you want to do next)
lapply(1:length(abstracts), function(i) write.table(abstracts[i], file=paste(mytxtfiles[i], "abstract", "txt", sep="."), quote = FALSE, row.names = FALSE, col.names = FALSE, eol = " " ))
And now you're ready to do some text mining on the abstracts.
Convert .pdf to .txt
Late answer:
But I recently discovered that with the current verions of tm (0.7-4) you can read pdfs directly into a corpus if you have pdftools installed (install.packages("pdftools")
).
library(tm)
directory <- getwd() # change this to directory where pdf-files are located
# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"),
readerControl = list(reader = readPDF))
Extracting text data from PDF files
Linux systems have pdftotext
which I had reasonable success with. By default, it creates foo.txt
from a give foo.pdf
.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
How do I convert multiple pdf's into a corpus for text analysis in R?
There are multiple ways, but if you want to get it into a corpus there is a simple way to do it. It does require that the package pdftools is installed (install.packages("pdftools")
) as that will be the engine used to read the pdfs. Then it is just a question of using the tm package to read everything into a corpus.
library(tm)
directory <- getwd() # change this to directory where files are located
# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"),
readerControl = list(reader = readPDF))
Text Mining PDFs - Convert List of Character Vectors (Strings) to Dataframe
This should do the trick:
#dummy data generation: file names and a list of strings (your corpus)
files <- paste("file", 1:6)
strings <- list("a","b","c", "d","e","f")
names(strings) <-files
t(as.data.frame(unlist(strings)))
# file 1 file 2 file 3 file 4 file 5 file 6
# unlist(strings) "a" "b" "c" "d" "e" "f"
Edit based on data structure edit
files <- paste("file", 1:6)
strings <- list(c("a","b"),c("c", "d"),c("e","f"),
c("g","h"), c("i","j"), c("k", "l"))
names(strings) <-files
t(data.frame(Doc=sapply(strings, paste0, collapse = " ")))
# file 1 file 2 file 3 file 4 file 5 file 6
# Doc "a b" "c d" "e f" "g h" "i j" "k l"
Corpus reading from pdf OR text in R
Do it the easier way using the readtext package. If your mix of .txt and .pdf files are in the same subdirectory, call this path_to_your_files/
, then you can read them all in and then make them into a tm Corpus using readtext()
. This function automagically recognises different input file types and converts them into UTF-8 text for text analysis in R. (The rtext
object created here is a special type of data.frame that includes a document identifier column and a column called text
that contains the converted text contents of your input documents.)
rtext <- readtext::readtext("path_to_your_files/*")
tm::Corpus(VectorSource(rtext[["text"]]))
readtext objects can also be used directly with the quanteda package as inputs to quanteda::corpus()
if you wanted to try an alternative to tm.
convert two columns text document to single line for text mining
With the fixed-width left column, we can split each line into the first 37 chars and the rest, adding these to strings for the left and right column. For instance, with regex
use warnings;
use strict;
my $file = 'two_column.txt'
open my $fh, '<', $file or die "Can't open $file: $!";
my ($left_col, $right_col);
while (<$fh>)
{
my ($left, $right) = /(.{37})(.*)/;
$left =~ s/\s*$/ /;
$left_col .= $left;
$right_col .= $right;
}
close $fh;
print $left_col, $right_col, "\n";
This prints the whole text. Or join columns, my $text = $left_col . $right_col;
The regex pattern (.{37})
matches any character (.
) and does this exactly 37 times ({37}
), capturing that with ()
; the (.*)
captures all remaining. These are returned by the regex, and assigned. The trailing spaces in $left
are condensed into one. Both are then appended (.=
).
Or from the command line
perl -wne'
($l, $r) = /(.{37})(.*)/; $l =~ s/\s*$/ /; $cL .= $l; $cR .= $r;
}{ print $cL,$cR,"\n"
' two_column.txt
where }{
starts the END
block, that runs before exit (after all lines have been processed).
Related Topics
Regarding SQLdf Package/Regexp Function
What Best Practices Do You Use for Programming in R
How to Install Multiple Packages
Name Columns Within Aggregate in R
How to Plot a Subset of a Data Frame in R
Updating Column in One Dataframe with Value from Another Dataframe Based on Matching Values
Findinterval() with Right-Closed Intervals
Passing Arguments to Iterated Function Through Apply
Plot the Equivalent of Correlation Matrix for Factors (Categorical Data)? and Mixed Types
Rearrange Dataframe to a Table, the Opposite of "Melt"
How to Convert Utm Coordinates to Lat and Long in R
Get Selected Row from Datatable in Shiny App
Ggplot2: Define Plot Layout with Grid.Arrange() as Argument of Do.Call()
Cbind Warnings:Row Names Were Found from a Short Variable and Have Been Discarded