How to Scrape/Automatically Download PDF Files from a Document Search Web Interface in R

how to download pdf file with R from web (encode issue)

There are a couple of problems here. Firstly, the links to some of the files are not properly formatted as urls - they contain spaces and other special characters. In order to convert them you must use url_escape(), which should be available to you as loading rvest also loads xml2, which contains url_escape().

Secondly, the path you are saving to is relative to your R home directory, but you are not telling R this. You either need the full path like this: "C://Users/Manoel/Documents/downloaded/testes.pdf", or a relative path like this: path.expand("~/downloaded/testes.pdf").

This code should do what you need:

library(tidyverse)
library(rvest)

# scraping hyperlinks
full_links <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html" %>%
read_html() %>%
html_nodes(".borderTD a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("http://www.ouvidoriageral.sp.gov.br/", .)}

# Looks at page in firefox
browseURL(full_links[1], encodeIfNeeded = T, browser = "firefox.exe")

# Saves pdf to "downloaded" folder if it exists
download.file(full_links[1], path.expand("~/downloaded/teste.pdf"))

opening PDF from a webpage in R

library(stringr)
library(rvest)
library(pdftools)

# Scrape the website with rvest for all href links
p <-
rvest::read_html("https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm")
pdfs <- p %>% rvest::html_elements("a") %>% html_attr("href")

# Filter selected fomcminute paths and reconstruct html links
pdfs <- pdfs[stringr::str_detect(pdfs, "fomcminutes.*pdf")]
pdfs <- pdfs[!is.na(pdfs)]
paths <- paste0("https://www.federalreserve.gov/", pdfs)

# Scrape minutes as list of text files
pdf_data <- lapply(paths, pdftools::pdf_text)

How to conditionally name pdf's saved in a for loop in R

for(i in 1:nrow(head_data)){
#retrieve url from head_data
url <- head_data[i,]$url_pdf

#create filename, including .pdf and subfolder (pdfs/)
filename <- paste0("pdfs/", head_data[i,]$id, ".pdf")

#check if file already exists, if not download and safe
if(!file.exists(filename)){
download.file(url, destfile = filename, mode = "wb")
}
}

Webscraping with R

I converted just_words list into dataframe and then used separate in tidyr package to split the column.

library(rvest)
library(dplyr)
library(stringr)
library(tidyr)
words<-read_html("https://www.education.com/magazine/article/Ed_50_Words_SAT_Loves_2/")
just_words<-words %>% html_nodes("ol") %>% html_text()
x <- as.data.frame(strsplit(just_words,"\r\n\t"), col.names = "V1")
head(x)
t <- x %>% separate(V1, c("Word", "Meaning"), extra = "merge", fill = "left")
head(t)

Output:

> head(t)
Word Meaning
1 abstract not concrete
2 aesthetic having to do with the appreciation of beauty
3 alleviate to ease a pain or a burden
4 ambivalent simultaneously feeling opposing feelings; uncertain
5 apathetic feeling or showing little emotion
6 auspicious favorable; promising

If you are looking for a more formatted output, use pander package. The output displays as below:

> library(pander)
> pander(head(t))

---------------------------------------
Word Meaning
---------- ----------------------------
abstract not concrete

aesthetic having to do with the
appreciation of beauty

alleviate to ease a pain or a burden

ambivalent simultaneously feeling
opposing feelings; uncertain

apathetic feeling or showing little
emotion

auspicious favorable; promising
---------------------------------------

To remove line breaks and spaces.

t <- t %>% mutate(Meaning=gsub("[\r\n]", "", Meaning)) %>% tail()

Web-Scraping with R

You picked a tough problem to learn on.

This site uses javascript to load the article information. In other words, the link loads a set of scripts which run when the page loads to grab the information (from a database, probably) and insert it into the DOM. htmlParse(...) just grabs the base html and parses that. So the links you want are simply not present.

AFAIK the only way around this is to use the RSelenium package. This package essentially allows you to pass the base html through what looks like a browser simulator, which does run the scripts. The problem with Rselenium is that you need not only to download the package, but also a "Selenium Server". This link has a nice introduction to RSelenium.

Once you've done that, inspection of the source in a browser shows that the article links are all in the href attribute of anchor tags which have class=doclink. This is straightforward to extract using xPath. NEVER NEVER NEVER use regex to parse XML.

library(XML)
library(RSelenium)
url <- "http://spiderbook.com/company/17495/details?rel=300795"
checkForServer() # download Selenium Server, if not already presnet
startServer() # start Selenium Server
remDr <- remoteDriver() # instantiates a new driver
remDr$open() # open connection
remDr$navigate(url) # grab and process the page (including scripts)
doc <- htmlParse(remDr$getPageSource()[[1]])
links <- as.character(doc['//a[@class="doclink"]/@href'])
links
# [1] "http://www.automotiveworld.com/news-releases/volkswagen-selects-bosch-chargepoint-e-golf-charging-solution-providers/"
# [2] "http://insideevs.com/category/vw/"
# [3] "http://www.greencarcongress.com/2014/07/20140711-vw.html"
# [4] "http://www.vdubnews.com/volkswagen-chooses-bosch-and-chargepoint-as-charging-solution-providers-for-its-e-golf-2"
# [5] "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=84543228"
# [6] "http://insideevs.com/volkswagen-selects-chargepoint-bosch-e-golf-charging/"
# [7] "http://www.calcharge.org/2014/07/"
# [8] "http://nl.anygator.com/search/volkswagen+winterbanden"
# [9] "http://nl.anygator.com/search/winterbanden+actie+volkswagen+caddy"

webscraping with R and rvest

Here is what I tried for you. I was playing with Selector Gadget and checking page source. After some inspection, I think you need to use <title> and <div class="article_body">. The map() part is looping through the three articles in article and creating a data frame. Each row represents each article. I think you still need to do some string manipulation to have clean text. But this will help you to scrape the contents you need.

library(tidyverse)
library(rvest)

article <- c("https://www.americanthinker.com/articles/2019/11/why_rich_people_love_poor_immigrants.html",
"https://tmp.americanthinker.com/blog/2015/01/california_begins_giving_drivers_licenses_to_illegal_aliens.html",
"https://www.americanthinker.com/articles/2018/11/immigrants_will_not_fund_our_retirement.html")

map_dfr(.x = article,
.f = function(x){

tibble(Title = read_html(x) %>%
html_nodes("title") %>%
html_text(),
Content = read_html(x) %>%
html_nodes(xpath = "//div[@class='article_body']") %>%
html_text(),
Site = "AmericanThinker")}) -> result

# Title Content Site
# <chr> <chr> <chr>
#1 Why Rich People Love Poor Immigra… "Soon after the Immigration Act of 1965 was passed, rea… AmericanT…
#2 California begins giving driver's… "The largest state in the union began handing out driv… AmericanT…
#3 Immigrants Will Not Fund Our Reti… "Ask Democrats why they support open borders, and they … AmericanT…


Related Topics



Leave a reply



Submit