Reading Data from PDF Files into R

Reading data from PDF files into R

Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the original document does not contain actual text, as opposed to bitmapped images of text or possibly even uglier things than I can imagine, nothing other than OCR can help you.

On top of that, in my sad experience there's no guarantee that apps which create PDF docs all behave the same, so the data in your table may or may not be read out in the desired order (as a result of the way the doc was built). Be cautious.

Probably better to make a couple grad students transcribe the data for you. They're cheap :-)

importing data from a pdf file into R

I have written a package that can help extract text from pdfs. It's written from scratch in C++ and is fairly fast (usually a bit faster than pdftools). At the moment you still need to wrangle the text into a table - as you would in pdftools. In your case, it would work like this:

library(dplyr)
library(PDFR)

df <- pdfpage("C:/users/Administrator/Documents/sales.pdf", 4)

df <- df[df$left > 440,] %>%
group_by(top) %>%
arrange(left, by_group = TRUE) %>%
summarize(text = paste(text, collapse = ",")) %>%
arrange(-top) %>%
filter(seq(nrow(.)) > 4) %>%
`[[`(2) %>%
read.csv(text = ., header = FALSE,
col.names = c("freq", "cum_freq", "perc", "cum_perc"))

Which gives you:

#>     freq cum_freq perc cum_perc
#> 1 142 142 0.04 0.04
#> 2 15 157 0.00 0.04
#> 3 78 235 0.02 0.06
#> 4 269 504 0.07 0.13
#> 5 840 1344 0.21 0.34
#> 6 1690 3034 0.42 0.76
#> 7 3254 6288 0.81 1.57
#> 8 5413 11701 1.35 2.92
#> 9 7659 19360 1.91 4.83
#> 10 9696 29056 2.42 7.24
#> 11 11529 40585 2.87 10.12
#> 12 13145 53730 3.28 13.39
#> 13 13830 67560 3.45 16.84
#> 14 14844 82404 3.70 20.54
#> 15 15153 97557 3.78 24.32
#> 16 15120 112677 3.77 28.09
#> 17 15347 128024 3.83 31.92
#> 18 15525 143549 3.87 35.79
#> 19 15710 159259 3.92 39.70
#> 20 15596 174855 3.89 43.59
#> 21 15529 190384 3.87 47.46
#> 22 15451 205835 3.85 51.31
#> 23 15259 221094 3.80 55.12
#> 24 15028 236122 3.75 58.86
#> 25 15147 251269 3.78 62.64
#> 26 14683 265952 3.66 66.30
#> 27 14469 280421 3.61 69.91
#> 28 14229 294650 3.55 73.45
#> 29 13523 308173 3.37 76.82
#> 30 13246 321419 3.30 80.13
#> 31 12987 334406 3.24 83.36
#> 32 12264 346670 3.06 86.42
#> 33 11964 358634 2.98 89.40
#> 34 10841 369475 2.70 92.11
#> 35 9958 379433 2.48 94.59
#> 36 8529 387962 2.13 96.72
#> 37 6729 394691 1.68 98.39
#> 38 4437 399128 1.11 99.50
#> 39 2010 401138 0.50 100.00

Although this may seem a bit involved, it is great for pdfs like yours where the tables are the same on each page. If you ran the above code inside an lapply loop it could get multiple pages at a time far more quickly than cutting and pasting would.

To install you need devtools:

install.packages("devtools")
devtools::install_github("AllanCameron/PDFR")

Edit

If there are installation problems, here is the equivalent in pdftools:

install.packages(pdftools)

df <- pdftools::pdf_data("https://tea.texas.gov/sites/default/files/Scale%20Score%20Distribution%20Graph%201_Grade%203%20to%208%20English-r2_tagged.pdf")[[4]]
df <- df[df$x > 440,] %>%
group_by(y) %>%
arrange(x, by_group = TRUE) %>%
summarize(text = paste(text, collapse = ",")) %>%
arrange(y) %>%
`[[`(2) %>%
`[`(3:41) %>%
read.csv(text = ., header = FALSE,
col.names = c("freq", "cum_freq", "perc", "cum_perc"))

Corpus reading from pdf OR text in R

Do it the easier way using the readtext package. If your mix of .txt and .pdf files are in the same subdirectory, call this path_to_your_files/, then you can read them all in and then make them into a tm Corpus using readtext(). This function automagically recognises different input file types and converts them into UTF-8 text for text analysis in R. (The rtext object created here is a special type of data.frame that includes a document identifier column and a column called text that contains the converted text contents of your input documents.)

rtext <- readtext::readtext("path_to_your_files/*")
tm::Corpus(VectorSource(rtext[["text"]]))

readtext objects can also be used directly with the quanteda package as inputs to quanteda::corpus() if you wanted to try an alternative to tm.

Extracting text data from PDF files

Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.

That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.



Related Topics



Leave a reply



Submit