Extract Text from Two-Column PDF with R

Extract Text from Two-Column PDF with R

I'd the same problem. What I did was to get the most frequent space values for each of my pdfs pages and stored it into a Vector. Then I sliced it using that value.

library(pdftools)
src <- ""
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

QTD_COLUMNS <- 2
read_text <- function(text) {
result <- ''
#Get all index of " " from page.
lstops <- gregexpr(pattern =" ",text)
#Puts the index of the most frequents ' ' in a vector.
stops <- as.integer(names(sort(table(unlist(lstops)),decreasing=TRUE)[1:2]))
#Slice based in the specified number of colums (this can be improved)
for(i in seq(1, QTD_COLUMNS, by=1))
{
temp_result <- sapply(text, function(x){
start <- 1
stop <-stops[i]
if(i > 1)
start <- stops[i-1] + 1
if(i == QTD_COLUMNS)#last column, read until end.
stop <- nchar(x)+1
substr(x, start=start, stop=stop)
}, USE.NAMES=FALSE)
temp_result <- trim(temp_result)
result <- append(result, temp_result)
}
result
}

txt <- pdf_text(src)
result <- ''
for (i in 1:length(txt)) {
page <- txt[i]
t1 <- unlist(strsplit(page, "\n"))
maxSize <- max(nchar(t1))
t1 <- paste0(t1,strrep(" ", maxSize-nchar(t1)))
result = append(result,read_text(t1))
}
result

convert two columns text document to single line for text mining

With the fixed-width left column, we can split each line into the first 37 chars and the rest, adding these to strings for the left and right column. For instance, with regex

use warnings;
use strict;

my $file = 'two_column.txt'
open my $fh, '<', $file or die "Can't open $file: $!";

my ($left_col, $right_col);

while (<$fh>)
{
my ($left, $right) = /(.{37})(.*)/;

$left =~ s/\s*$/ /;

$left_col .= $left;
$right_col .= $right;
}
close $fh;

print $left_col, $right_col, "\n";

This prints the whole text. Or join columns, my $text = $left_col . $right_col;

The regex pattern (.{37}) matches any character (.) and does this exactly 37 times ({37}), capturing that with (); the (.*) captures all remaining. These are returned by the regex, and assigned. The trailing spaces in $left are condensed into one. Both are then appended (.=).

Or from the command line

perl -wne'
($l, $r) = /(.{37})(.*)/; $l =~ s/\s*$/ /; $cL .= $l; $cR .= $r;
}{ print $cL,$cR,"\n"
' two_column.txt

where }{ starts the END block, that runs before exit (after all lines have been processed).

From 2 columns of text to 1

Would the following approach work for you?

# split character string on line breaks
output.by.line <- strsplit(output, "\n")[[1]]

# consider everything up to the first 42 characters as column 1, everything after as column 2
output.by.line <- c(substring(output.by.line, 1, 42), # column 1
substring(output.by.line, 43)) # column 2

# remove leading / trailing whitespace
output.by.line <- trimws(output.by.line)

# remove blank lines
output.by.line <- output.by.line[nchar(output.by.line) > 0]

# preface each section number with \n to facilitate splitting
# (may require some manual check as not every section number appears to be in its own line)
output.by.line <- ifelse(nchar(output.by.line) <= 2 &
!is.na(as.integer(output.by.line)),
paste0("\n", output.by.line),
output.by.line)

# join all lines together & split by section, dropping empty lines if any
output.by.section <- strsplit(paste(output.by.line, collapse = " "), "\n")[[1]]
output.by.section <- output.by.section[nchar(output.by.section) > 0]

# remove repeated white space inside each section, if any
output.by.section <- stringr::str_squish(output.by.section)

Result:

> output.by.section
[1] "16 Then they journeyed from Bethel. And when there was but a little distance to go to Ephrath, Rachel labored in childbirth, and she had hard labor."
[2] "17 Now it came to pass, when she was in hard labor, that the midwife said to her, \"Do not fear; you will have this son also.\""
[3] "18 And so it was, as her soul was departing (for she died), that she called his name Ben-Oni; but his father called him Benjamin."
[4] "19 So Rachel died and was buried on the way to Ephrath (that is, Bethlehem)."
[5] "20 And Jacob set a pillar on her grave, which is the pillar of Rachel's grave to this day."
[6] "21 Then Israel journeyed and pitched his tent beyond the tower of Eder."
[7] "22 And it happened, when Israel dwelt in that land, that Reuben went and lay with Bilhah his father's concubine; and Israel heard about it. Now the sons of Jacob were twelve:"
[8] "23 the sons of Leah were Reuben, Jacob's firstborn, and Simeon, Levi, Judah, Issachar, and Zebulun;"
[9] "24 the sons of Rachel were Joseph and Benjamin;"
[10] "25 the sons of Bilhah, Rachel's maidservant, were Dan and Naphtali;"
[11] "26 and the sons of Zilpah, Leah's maidservant, were Gad and Asher. These were the sons of Jacob who were born to him in Padan Aram."
[12] "27 Then Jacob came to his father Isaac at Mamre, or Kirjath Arba (that is, Hebron), where Abraham and Isaac had dwelt."
[13] "28 Now the days of Isaac were one hundred and eighty years."
[14] "29 So Isaac breathed his last and died, and was gathered to his people, being old and full of days. And his sons Esau and Jacob buried him. 36Now this is the genealogy of Esau, who is Edom."
[15] "2 Esau took his wives from the daughters of Canaan: Adah the daughter of Elon the Hittite; Aholibamah the daughter of Anah, the daughter of Zibeon the Hivite;"
[16] "3 and Basemath, Ishmael's daughter, sister of Nebajoth."
[17] "4 Now Adah bore Eliphaz to Esau, and Basemath bore Reuel."
[18] "5 And Aholibamah bore Jeush, Jaalam, and Korah. These were the sons of Esau who were born to him in the land of Canaan."
[19] "6 Then Esau took his wives, his sons, his daughters, and all the persons of his household, his cattle and all his animals, and all his goods which he had gained in the land of Canaan, and went to a country away from the presence of his brother Jacob."

(Note: Yes, the line starting with 36 hasn't been identified as a new paragraph here, as it wasn't on its own line. I'm not sure what would be the most optimal way to deal with this. If it's just for a few pages, making some manual check + correction would probably be reasonable. Otherwise, it will depend on the numbering logic throughout the text, & probably worth a question on its own.)

On the file downloading part, you may wish to try the solution from this question (i.e. specify mode = "wb" as one of the arguments for download.file).



Related Topics



Leave a reply



Submit