Reading Data from PDF Files into R

Reading data from PDF files into R

Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the original document does not contain actual text, as opposed to bitmapped images of text or possibly even uglier things than I can imagine, nothing other than OCR can help you.

On top of that, in my sad experience there's no guarantee that apps which create PDF docs all behave the same, so the data in your table may or may not be read out in the desired order (as a result of the way the doc was built). Be cautious.

Probably better to make a couple grad students transcribe the data for you. They're cheap :-)

importing data from a pdf file into R

I have written a package that can help extract text from pdfs. It's written from scratch in C++ and is fairly fast (usually a bit faster than pdftools). At the moment you still need to wrangle the text into a table - as you would in pdftools. In your case, it would work like this:

library(dplyr)
library(PDFR)

df <- pdfpage("C:/users/Administrator/Documents/sales.pdf", 4)

df <- df[df$left > 440,] %>%
  group_by(top) %>%
  arrange(left, by_group = TRUE) %>%
  summarize(text = paste(text, collapse = ",")) %>%
  arrange(-top) %>%
  filter(seq(nrow(.)) > 4) %>%
  `[[`(2) %>%
  read.csv(text = ., header = FALSE, 
           col.names = c("freq", "cum_freq", "perc", "cum_perc"))

Which gives you:

#>     freq cum_freq perc cum_perc
#> 1    142      142 0.04     0.04
#> 2     15      157 0.00     0.04
#> 3     78      235 0.02     0.06
#> 4    269      504 0.07     0.13
#> 5    840     1344 0.21     0.34
#> 6   1690     3034 0.42     0.76
#> 7   3254     6288 0.81     1.57
#> 8   5413    11701 1.35     2.92
#> 9   7659    19360 1.91     4.83
#> 10  9696    29056 2.42     7.24
#> 11 11529    40585 2.87    10.12
#> 12 13145    53730 3.28    13.39
#> 13 13830    67560 3.45    16.84
#> 14 14844    82404 3.70    20.54
#> 15 15153    97557 3.78    24.32
#> 16 15120   112677 3.77    28.09
#> 17 15347   128024 3.83    31.92
#> 18 15525   143549 3.87    35.79
#> 19 15710   159259 3.92    39.70
#> 20 15596   174855 3.89    43.59
#> 21 15529   190384 3.87    47.46
#> 22 15451   205835 3.85    51.31
#> 23 15259   221094 3.80    55.12
#> 24 15028   236122 3.75    58.86
#> 25 15147   251269 3.78    62.64
#> 26 14683   265952 3.66    66.30
#> 27 14469   280421 3.61    69.91
#> 28 14229   294650 3.55    73.45
#> 29 13523   308173 3.37    76.82
#> 30 13246   321419 3.30    80.13
#> 31 12987   334406 3.24    83.36
#> 32 12264   346670 3.06    86.42
#> 33 11964   358634 2.98    89.40
#> 34 10841   369475 2.70    92.11
#> 35  9958   379433 2.48    94.59
#> 36  8529   387962 2.13    96.72
#> 37  6729   394691 1.68    98.39
#> 38  4437   399128 1.11    99.50
#> 39  2010   401138 0.50   100.00

Although this may seem a bit involved, it is great for pdfs like yours where the tables are the same on each page. If you ran the above code inside an lapply loop it could get multiple pages at a time far more quickly than cutting and pasting would.

To install you need devtools:

install.packages("devtools")
devtools::install_github("AllanCameron/PDFR")

Edit

If there are installation problems, here is the equivalent in pdftools:

install.packages(pdftools)

df <- pdftools::pdf_data("https://tea.texas.gov/sites/default/files/Scale%20Score%20Distribution%20Graph%201_Grade%203%20to%208%20English-r2_tagged.pdf")[[4]] 
df <- df[df$x > 440,] %>%
  group_by(y) %>%
  arrange(x, by_group = TRUE) %>%
  summarize(text = paste(text, collapse = ",")) %>%
  arrange(y) %>%
  `[[`(2) %>%
  `[`(3:41) %>%
  read.csv(text = ., header = FALSE, 
           col.names = c("freq", "cum_freq", "perc", "cum_perc"))

Corpus reading from pdf OR text in R

Do it the easier way using the readtext package. If your mix of .txt and .pdf files are in the same subdirectory, call this path_to_your_files/, then you can read them all in and then make them into a tm Corpus using readtext(). This function automagically recognises different input file types and converts them into UTF-8 text for text analysis in R. (The rtext object created here is a special type of data.frame that includes a document identifier column and a column called text that contains the converted text contents of your input documents.)

rtext <- readtext::readtext("path_to_your_files/*")
tm::Corpus(VectorSource(rtext[["text"]]))

readtext objects can also be used directly with the quanteda package as inputs to quanteda::corpus() if you wanted to try an alternative to tm.

Extracting text data from PDF files

Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.

That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.

Reading Data from PDF Files into R