Rvest Not Recognizing CSS Selector

Rvest not recognizing css selector

It's making an XHR request to generate the HTML. Try this (which should also make it easier to automate the data capture):

library(httr)
library(xml2)
library(rvest)

res <- GET("http://www.racingpost.com/greyhounds/result_by_meeting_full.sd",
query=list(r_date="2015-12-26",
meeting_id=18))

doc <- read_html(content(res, as="text"))

html_nodes(doc, ".black")
## {xml_nodeset (56)}
## [1] <span class="black">A9</span>
## [2] <span class="black">£61</span>
## [3] <span class="black">470m</span>
## [4] <span class="black">-30</span>
## [5] <span class="black">H2</span>
## [6] <span class="black">£105</span>
## [7] <span class="black">470m</span>
## [8] <span class="black">-30</span>
## [9] <span class="black">A7</span>
## [10] <span class="black">£61</span>
## [11] <span class="black">470m</span>
## [12] <span class="black">-30</span>
## [13] <span class="black">A5</span>
## [14] <span class="black">£66</span>
## [15] <span class="black">470m</span>
## [16] <span class="black">-30</span>
## [17] <span class="black">A8</span>
## [18] <span class="black">£61</span>
## [19] <span class="black">470m</span>
## [20] <span class="black">-20</span>
## ...

Rvest is unable to find the node specified by css selector, how do I fix it?

You can regex out the seller name easily from the return as it is contained in a script tag (presumably loaded from here when browser is able to run javascript - which rvest does not.)

library(rvest)
library(magrittr)
library(stringr)

p <- read_html('https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946') %>% html_text()
seller_name <- str_match_all(p,'"sellerName":"(.*?)"')[[1]][,2][1]
print(seller_name)

Regex:

Sample Image

rvest: how to find required css-selector

tl;dr The problem is not the css selectors. It's the encoding. Specify encoding = 'latin1'

read_html('https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00001/fnameorig_796482.html', encoding = "latin1") %>% 
html_nodes('[class^=WordSection]') %>%
html_text() %>%
length()

Curl:

You could also use curl.

library(rvest)
library(curl)

text_info <- curl_fetch_memory("https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00001/fnameorig_796482.html") %>%
{rawToChar(.$content)} %>%
.[[1]] %>%
read_html() %>%
html_nodes("[class^=WordSection]") %>%
html_text()

CSS Selectors:

If you use an css attribute = value selector with starts with operator ^ to get all the nodes with class value starting with WordSection.

Given that there is a lot of nesting to avoid repeat material you may decide to use nth-child range selectors or other css selector combinations to restrict the match list.

Write some custom function(s) to manage string cleaning.

You can of course use different css selectors if you so choose.



Rvest webscraping error - identifying css or xpath?

This has nothing to do with using the wrong selector. The site you're scraping does something super interesting on first-access:

Sample Image

When you hit the page it sets a cookie then refreshes the page (this is one of the stupidest ways I've seen to force a "session", ever).

Unless you use a proxy server to capture the web requests you'd never really be able to see this even in the network tab of browser developer tools. You could also see it, though, by looking at what came back from the initial read_html() call you did (it just has the javascript+redirect).

Neither read_html() nor httr::GET() can help you with this directly since the way the cookie is set is via javascript.

BUT! All hope is not lost and no silly third-party requirement like Selenium or Splash is required (I'm shocked that wasn't already suggested by the resident experts as it seems to be the default response these days).

Let's get the cookie (make sure this is a FRESH, RESTARTED, NEW R session since libcurl — which curl uses which is, in turn, used by httr::GET() which read_html() ultimately uses — maintains cookies (we'll be using this functionality to continue scraping pages but if anything goes awry you will likely need to start with a fresh session).

library(xml2)
library(httr)
library(rvest)
library(janitor)

# Get access cookie

httr::GET(
url = "http://www.rca.gov.rw/wemis/registration/all.php",
query = list(
start = "0",
status = "approved"
)
) -> res

ckie <- httr::content(res, as = "text", encoding = "UTF-8")
ckie <- unlist(strsplit(ckie, "\r\n"))
ckie <- grep("cookie", ckie, value = TRUE)
ckie <- gsub("^document.cookie = '_accessKey2=|'$", "", ckie)

Now, we're going to set the cookie and get our PHP session cookie, both of which will persist afterwards:

httr::GET(
url = "http://www.rca.gov.rw/wemis/registration/all.php",
httr::set_cookies(`_accessKey2` = ckie),
query = list(
start = "0",
status = "approved"
)
) -> res

Now, there are over 400 pages so we're going to cache the raw HTML in the event you scrape something wrong and need to re-parse the pages. That way you can iterate over files vs hit the site again. To do this we'll make a directory for them:

dir.create("rca-temp-scrape-dir")

Now, create the pagination start numbers:

pgs <- seq(0L, 8920L, 20)

And, iterate over them. NOTE: I don't need all 400+ pages so I just did 10. Remove the [1:10] to get them all. Also, unless you like hurting other people, please keep the sleep in since you don't pay for the cpu/bandwidth and that site is likely very fragile.

lapply(pgs[1:10], function(pg) { 

Sys.sleep(5) # Please don't hammer servers you don't pay for

httr::GET(
url = "http://www.rca.gov.rw/wemis/registration/all.php",
query = list(
start = pg,
status = "approved"
)
) -> res

# YOU SHOULD USE httr FUNCTIONS TO CHECK FOR STATUS
# SINCE THERE CAN BE HTTR ERRORS THAT YOU MAY NEED TO
# HANDLE TO AVOID CRASHING THE ITERATION

out <- httr::content(res, as = "text", encoding = "UTF-8")

# THIS CACHES THE RAW HTML SO YOU CAN RE-SCRAPE IT FROM DISK IF NECESSARY

writeLines(out, file.path("rca-temp-scrape-dir", sprintf("rca-page-%s.html", pg)))

out <- xml2::read_html(out)
out <- rvest::html_node(out, "table.primary")
out <- rvest::html_table(out, header = TRUE, trim = TRUE)

janitor::clean_names(out) # makes better column names

}) -> recs

Finally, we'll combine those 20 data frames into one:

recs <- do.call(rbind.data.frame, recs)

str(recs)
## 'data.frame': 200 obs. of 9 variables:
## $ s_no : num 1 2 3 4 5 6 7 8 9 10 ...
## $ code : chr "BUG0416" "RBV0494" "GAS0575" "RSZ0375" ...
## $ name : chr "URUMURI RWA NGERUKA" "BADUKANA IBAKWE NYAKIRIBA" "UBUDASA COOPERATIVE" "KODUKB" ...
## $ certificate: chr "RCA/0734/2018" "RCA/0733/2018" "RCA/0732/2018" "RCA/0731/2018" ...
## $ reg_date : chr "10.12.2018" "-" "10.12.2018" "07.12.2018" ...
## $ province : chr "East" "West" "Mvk" "West" ...
## $ district : chr "Bugesera" "Rubavu" "Gasabo" "Rusizi" ...
## $ sector : chr "Ngeruka" "Nyakiliba" "Remera" "Bweyeye" ...
## $ activity : chr "ubuhinzi (Ibigori, Ibishyimbo)" "ubuhinzi (Imboga)" "transformation (Amasabuni)" "ubworozi (Amafi)" ...

If you're a tidyverse user you can also just do:

purrr::map_df(pgs[1:10], ~{

Sys.sleep(5)

httr::GET(
url = "http://www.rca.gov.rw/wemis/registration/all.php",
httr::set_cookies(`_accessKey2` = ckie),
query = list(
start = .x,
status = "approved"
)
) -> res

out <- httr::content(res, as = "text", encoding = "UTF-8")
writeLines(out, file.path("rca-temp-scrape-dir", sprintf("rca-page-%s.html", pg)))

out <- xml2::read_html(out)
out <- rvest::html_node(out, "table.primary")
out <- rvest::html_table(out, header = TRUE, trim = TRUE)

janitor::clean_names(out)

}) -> recs

vs the lapply/do.call/rbind.data.frame approach.

Cannot find the correct CSS selector for webscraping

Might as well complete the set of recommendations.


You don't need to worry about dynamic classes. Use a stable class from the multi-value class as parent then child combinate to get the child a tag:

library(rvest)
library(magrittr)

url <- "https://www.imdb.com/title/tt0562992/?ref_=ttep_ep1"
link <- read_html(url) %>%
html_element(".rating-bar__base-button > a") %>%
html_attr("href") %>%
url_absolute(url)

Or, as IMDb has a consistent approach to these things, avoid making a request and simply do a substitution on the query string part of the url. You could encapsulate this into a ratings function.

url <- "https://www.imdb.com/title/tt0562992/?ref_=ttep_ep1"
link <- gsub("(\\?ref_=.*)", "ratings/?ref_=tt_ov_rt", url)


Related Topics



Leave a reply



Submit