How to Close Unused Connections After Read_HTML in R

How do I close unused connections after read_html in R

I haven't found a good answer for this problem. The best work-around that I came up with is to include the function below, with Secs = 3 or 4. I still don't know why the problem occurs or how to stop it without building in a large delay.

CatchupPause <- function(Secs){
 Sys.sleep(Secs) #pause to let connection work
 closeAllConnections()
 gc()
}

Webscraping with 'rvest' and code keeps stopping

You could escape the error using tryCatch and add NA's to your table in those cases:

library(rvest)

summary2 <- data.frame(matrix(nrow=0, ncol=4))
colnames(summary2) <- c("billnum", "sum", "type", "name_dis_part")
k <- c("0278", "0279", "0280")

for (i in k) {

  ## First scrape ##

  # sys.sleep(1) # Uncomment if ness.

  webpage <- read_html(paste0("https://www.hcdn.gob.ar/proyectos/textoCompleto.jsp?exp=", i, "-D-2014"))
  billno <- html_nodes(webpage, 'h1')
  billno_text <- html_text(billno)
  
  billsum <- html_nodes(webpage, '.interno')
  billsum_text <- html_text(billsum)
  
  billsum_text <- gsub("\n", "", billsum_text)
  billsum_text <- gsub("\t", "", billsum_text)
  billsum_text <- gsub("    ", "", billsum_text)
  
  ## Second scrape ##

  # sys.sleep(1) # Uncomment if ness.

  link <- tryCatch(read_html(paste0("https://www.hcdn.gob.ar/proyectos/proyectoTP.jsp?exp=", i, "-D-2014")),
                   error = function(e) NA)
  
  if (is.na(link)) {
    
    type_text <- NA
    table_text <- NA
    
  } else {
  
    type <- html_nodes(link, 'h3')
    type_text <- html_text(type)
    table <-html_node(link, "table.table.table-bordered tbody")
  
    table_text <- html_text(table)
  
    table_text <- gsub("\n", "", table_text)
    table_text <- gsub("\t", "", table_text)
    table_text <- gsub("", "", table_text)
    
  }
  
  ## Output ##

  summary2[i, 1] <- billno_text
  summary2[i, 2] <- billsum_text
  summary2[i, 3] <- type_text
  summary2[i, 4] <- table_text
}

Output:

tibble::as_tibble(summary2)
# A tibble: 3 × 4
  billnum     sum                                                                                                           type  name_…¹
  <chr>       <chr>                                                                                                         <chr> <chr>  
1 0278-D-2014 "0278-D-2014  ProyectoSu beneplácito por el reconocimiento que la revista científica Nature realizara a un g… " PR… "ASSEF…
2 0279-D-2014 "0279-D-2014  ProyectoSu Benplacito al conmemorarase  el  natalicio de el Dr.  Joaquin V.  Gonzalezel 6 de m…  NA    NA    
3 0280-D-2014 "0280-D-2014  ProyectoLA HONORABLE CAMARA DE DIPUTADOS EXPRESA SU ADHESIÓN AL CONMEMORARSE EL 07 DE MARZO \"… " PR… "GRANA…
# … with abbreviated variable name ¹name_dis_part

How do I scrape data with R when the same class is used repeatedly?

How difficult this is depends on what info you want from the page. I am working to the assumption you want a dataframe/tibble that details from PointOfCare e.g. 1. Hospitals, through ServiceType e.g. Publicly Funded / Free Services, all the way down into the actual listings details of each service.

There are two immediate problems to overcome if going for all the above info:

The DOM is pretty flat i.e. the PointOfCare info is at same level of DOM as ServiceType and the start of service listings is only 1 level deeper. This means there is no nice logical way to use an HTML parser and select for parent nodes then process children, and still get the desired info mapped for the PointOfCare and ServiceType to each service listing.
There are differing numbers of child nodes holding a given service's info, those with className summaryRecordType, within each listing (ranging between 3 and 5).

① To deal with the first problem I decide to convert the retrieved HTML to a string and split that string into chunks to process. I retrieve the PointOfCare labels and use those to generate the initial blocks settings_blocks:

all_text <- page %>% toString()
split_nodes <- page %>% html_nodes(xpath = '//*[@class="classyHeading"]/parent::div')
points_of_delivery <- map(split_nodes, point_of_delivery)
matches <- map(split_nodes, delivery_matches)
settings_blocks <- get_blocks(matches)

At this point I have e.g. 1. Hospitals in the first block, 2. Inpatient services in the second block and so on.

I then further split each of those chunks by the ServiceTypes:

service_types <- c("Publicly Funded / Free Services", "Private Practice Professionals and\r\nCommercial Businesses")

Annoyingly, I had to hardcode as \r\n in the latter string, rather than retrieve from the relevant node html itself, as it was not present otherwise (and therefore match was not found for split).

So, 1. Hospitals when processed would have only a sub-chunk for Publicly Funded / Free Services, whereas 2. Inpatient services would end up split in two Publicly Funded / Free Services and Private Practice Professionals and Commercial Businesses. This all happens in a loop over settings_blocks.

for (i in seq_along(settings_blocks)) {
  r <- r + 1
  point_of_care <- points_of_delivery[[i]]
  splits <- split_points(settings_blocks[[i]])
  nodes_html <- tryCatch(final_blocks(splits, settings_blocks[[i]]), error = function(e) print(i))

There are a couple of sections with no listings e.g. 3.3 Drop-in centres; in those cases I generate a record as follows:

record <- list(
      point_of_care, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
      NA_character_, NA_character_, NA_character_, NA_character_, NA_character_

A full record has the following info fields:

list(point_of_care, service_type, Title, Url, BusinessName, ServiceDescription, Address, Tel, Website, AreaServed, Ages)

Once at the lowest level block, nodes_html[[j]], provided there are tables (each table is a listing) I retrieve the info for all the fields of interest:

records[[r]] <- map(tables, ~ record_from_table(.x, point_of_care, service_type))

② Now, we still have the issue of differing amounts of info in each listing table. However, it turns out one can map what info is present according to how many child nodes with className summaryRecordType are present. The mapping is as follows:

| Child nodes count | BusinessName | ServiceDescription | Address | Tel | Website | AreaServed |
|-------------------|--------------|--------------------|---------|-----|---------|------------|
| 3                 |              |                    | 1       | 2   |         | 3          |
| 4                 | 1            |                    | 2       | 3   |         | 4          |
| 5                 |              | 1                  | 2       | 3   | 4       | 5          |

From column 2 onwards, the number indicates which node holds the info indicated by the column header. As I loop each bottom level chunk I have a helper function that applies this mapping when retrieving the listing info:

record_from_table <- function(table, point_of_care, service_type) {
  info_lines <- table %>% html_nodes(".summaryRecordType")
  Title <- table %>%
    html_node("a") %>%
    html_text() %>%
    trimws()
  Url <- table %>%
    html_node("a") %>%
    html_attr("href") %>%
    url_absolute(link)
  if (length(info_lines) == 3) {
    BusinessName <- NA_character_
    ServiceDescription <- NA_character_
    Address <- info_lines[1] %>%
      html_text() %>%
      trimws()  # etc.........

I pass in PointOfCare and ServiceType so they are mapped to the record level. By the end of

for (i in seq_along(settings_blocks)) {...}

I have a list of records/listings. I then do some tidying of the records. I return tibbles, so I can later use map_dfr to generate my final dataframe structure:

records <- unlist(records, recursive = FALSE) %>% map(clean_record)
listings <- map_dfr(records, unlist)

With the final dataframe structure in place and populated I set about tidying up some other things I noticed:

① During my final_blocks function the encoding of UTF-8 input strings was getting garbled.

For example, the following correctly UTF-8 encoded string (on Windows OS):

Autisme-Asperger-Québec (AAQc)

Ended up as:

Autisme-Asperger-QuÃ©bec (AAQc)

A colleague pointed out that it was actually tidy_html() at fault; and that this was particular to Windows OS - ran fine on Linux - due to the default encoding for Windows. The mangling is called Mojibake. He pointed me to the following links for further reading:

https://en.wikipedia.org/wiki/Mojibake
https://www.weblogism.com/item/270/why-does-e-become-a

To quote only a small part of the latter link:

The reason lies in the UTF-8 representation. Characters below or equal to 127 (0x7F) are represented with 1 byte only, and this is equivalent to the ASCII value. Characters below or equal to 2047 are written on two bytes of the form 110yyyyy 10xxxxxx where the scalar representation of the character is: 0000000000yyyyyxxxxxx
“é” is U+00E9 (LATIN SMALLER LETTER E WITH ACUTE), which in binary
representation is: 00000000 11101001. “é” is therefore between 127 and
2027 (233), so it will be coded on 2 bytes. Therefore its UTF-8
representation is 11000011 10101001.
Now let’s imagine that this “é” sits in a document that’s believed to
be latin-1, and we want to convert it to UTF-8. iso-8859-1 characters
are coded on 8 bits, so the 2-byte character “é” will become 2
1-byte-long latin-1 characters. The first character is 11000011, i.e.
C3, which, when checking the table corresponds to “Ã” (U+00C3); the
second one is 10101001, i.e. A9, which corresponds to “©” (U+00A9).

The colleague pointed out I could fix this by converting it from UTF-8 to latin twice because UTF-8 characters have been encoded in UTF-8 again.

iconv(iconv(<mangled_string>, from = "UTF-8", to = "latin1"), "UTF-8", "latin1")

I had introduced tidy_html to ensure sliced text ended up being parsable.

② I chose not to try and fix the mangled strings as per the description above. Instead, as my final dataframe provided the skeleton for where all my data resided, I simply went back to the original HTMLDocument and parsed out the info again (in UTF-8) and mapped onto my dataframe. This had the added benefit of preserving spacing between certain words and line breaks.

titles <- page %>%
  html_nodes(".emhTip a:nth-of-type(1)") %>%
  html_text()
descriptions <- page %>%
  html_nodes(".emhTip + .summaryRecordType") %>%
  html_text() %>%
  trimws()
mixed_nodes <- page %>%
  html_nodes(".summaryTitlePrivatePractice > div:nth-child(2)") %>%
  html_text() %>%
  trimws()

r <- r1 <- 0

# over-write existing values with tidier properly encoded strings
for (i in seq_along(listings$Title)) {
  if (!is.na(listings$Title[i])) {
    r <- r + 1
    listings$Title[i] <- titles[r]
    if (!is.na(listings$BusinessName[i])) {
      listings$BusinessName[i] <- mixed_nodes[r]
    }
  }
  if (!is.na(listings$ServiceDescription[i])) {
    r1 <- r1 + 1
    listings$ServiceDescription[i] <- descriptions[r1]
  }
}

Last, but not least, I noticed that some service descriptions had a ...more in the listing, where an additional XHR request would be required to gather the full description. I decided, in case you wanted to obtain the full descriptions, in those cases, to provide a helper function to retrieve these:

expanded_descriptions <- map2(listings$ServiceDescription, listings$Url, ~ full_description(.x, .y)) %>% unlist()

listings$ServiceDescription <- expanded_descriptions

Now, that did slow the run-time as I needed to add some delays in to ensure connections were opened and closed properly.

The full code is below, including a couple of attributions where I borrowed a few lines from other SO contributors.

library(stringr)
library(rvest)
library(htmltidy)
library(tidyverse)

point_of_delivery <- function(node) {
  pod <- node %>%
    html_node(".classyHeading") %>%
    html_text() %>%
    str_split("\n") %>%
    unlist() %>%
    tail(1) %>%
    trimws() %>%
    str_replace("\xa0", " ")
  return(pod)
}

delivery_matches <- function(node) {
  dm <- node %>%
    html_node(".classyHeading") %>%
    html_text() %>%
    str_split("\n") %>%
    unlist() %>%
    tail(1)
  return(dm)
}

get_blocks <- function(a_list) {
  results <- vector("list", length(a_list))
  for (i in seq_along(a_list)) {
    start_pos <- str_locate(all_text, gsub("\\)", "\\\\)", gsub("\\(", "\\\\(", a_list[i])))[, 1]
    if (i == length(a_list)) {
      block <- substring(all_text, start_pos, nchar(all_text)) %>% tidy_html()
    } else {
      next_start <- str_locate(all_text, gsub("\\)", "\\\\)", gsub("\\(", "\\\\(", a_list[i + 1])))[, 1]
      block <- substring(all_text, start_pos, next_start) %>% tidy_html()
    }
    results[[i]] <- block
  }
  return(results)
}

split_points <- function(node) {
  res <- map(service_types, ~ str_locate_all(node %>% toString(), .)) %>% unlist()
  if (length(res) == 0) {
    return(c(NA_integer_))
  } else {
    return(res[seq(1, length(res), 2)]) # https://stackoverflow.com/a/34100009/6241235 @stas g
  }
}

final_blocks <- function(splits, block) {
  results <- vector("list", length(splits))
  if (length(splits) == 1) {
    res <- ifelse(is.na(splits), splits, block %>% tidy_html())
  } else {
    for (i in seq_along(splits)) {
      start_pos <- splits[i]
      if (i == length(splits)) {
        res <- substring(block, start_pos, nchar(block)) %>% tidy_html()
      } else {
        next_start <- splits[i + 1]
        res <- substring(block, start_pos, next_start) %>% tidy_html()
      }
      results[i] <- res
    }
    return(results)
  }
}

record_from_table <- function(table, point_of_care, service_type) {
  info_lines <- table %>% html_nodes(".summaryRecordType")
  Title <- table %>%
    html_node("a") %>%
    html_text() %>%
    trimws()
  Url <- table %>%
    html_node("a") %>%
    html_attr("href") %>%
    url_absolute(link)
  if (length(info_lines) == 3) {
    BusinessName <- NA_character_
    ServiceDescription <- NA_character_
    Address <- info_lines[1] %>%
      html_text() %>%
      trimws()
    Tel <- info_lines[2] %>%
      html_text() %>%
      trimws()
    Website <- NA_character_
    AreaServed <- info_lines[3] %>%
      html_text() %>%
      trimws()
  } else if (length(info_lines) == 4) {
    BusinessName <- info_lines[1] %>%
      html_text() %>%
      trimws()
    ServiceDescription <- NA_character_
    Address <- info_lines[2] %>%
      html_text() %>%
      trimws()
    Tel <- info_lines[3] %>%
      html_text() %>%
      trimws()
    Website <- NA_character_
    AreaServed <- info_lines[4] %>%
      html_text() %>%
      trimws()
  } else {
    BusinessName <- NA_character_
    ServiceDescription <- info_lines[1] %>%
      html_text() %>%
      trimws()
    Address <- info_lines[2] %>%
      html_text() %>%
      trimws()
    Tel <- info_lines[3] %>%
      html_text() %>%
      trimws()
    Website <- info_lines[4] %>%
      html_text() %>%
      trimws()
    AreaServed <- info_lines[5] %>%
      html_text() %>%
      trimws()
  }
  Ages <- get_age(table)
  return(list(point_of_care, service_type, Title, Url, BusinessName, ServiceDescription, Address, Tel, Website, AreaServed, Ages))
}

get_age <- function(table) {
  tryCatch(table %>% html_node(".summaryTitlePrivatePractice + td") %>%
    html_text() %>% str_replace("Add to Info Cart", "") %>% trimws(), error = function(e) {
    return(NA_character_)
  })
}

clean_record <- function(a_record) {
  a_record[[7]] <- str_replace(a_record[[7]], " Map", "")
  a_record[[10]] <- str_replace(a_record[[10]], "Area[s]? Served: ", "")
  a_record <- set_names(a_record, c("PointOfCare", "ServiceType", "Title", "Url", "BusinessName", "ServiceDescription", "Address", "Tel", "Website", "AreaServed", "Ages"))
  return(a_record %>% as_tibble())
}

full_description <- function(current_description, current_url) {
  if (grepl(" \\.\\.\\.", current_description)) {
    content <- read_html(current_url, encoding = "UTF-8") %>%
      html_node(".recordSummary") %>%
      html_text() %>%
      trimws()
    CatchupPause(.1)
  } else {
    content <- gsub("\\s+more", "", current_description) %>% trimws()
  }
  return(content)
}

CatchupPause <- function(Secs) { # https://stackoverflow.com/a/52758758 @nm200
  Sys.sleep(Secs) # pause to let connection work
  closeAllConnections()
  gc()
}

link <- "https://www.ementalhealth.ca/Winnipeg-Regional-Health-Authority/Mental-Health-Facilities/index.php?m=heading&ID=229&recordType=1&sortBy=0"
page <- read_html(link, encoding = "UTF-8")
all_text <- page %>% toString()
split_nodes <- page %>% html_nodes(xpath = '//*[@class="classyHeading"]/parent::div')
points_of_delivery <- map(split_nodes, point_of_delivery)
matches <- map(split_nodes, delivery_matches)
settings_blocks <- get_blocks(matches)
service_types <- c("Publicly Funded / Free Services", "Private Practice Professionals and\r\nCommercial Businesses") # annoying have to hardcode as \r\n not present in node output

records <- vector("list", 1000) # > max expected num entries when lists unnested
r <- 0

# Generate all records for the final tibble
for (i in seq_along(settings_blocks)) {
  r <- r + 1
  point_of_care <- points_of_delivery[[i]]
  splits <- split_points(settings_blocks[[i]])
  nodes_html <- tryCatch(final_blocks(splits, settings_blocks[[i]]), error = function(e) print(i))

  if (is.na(nodes_html)[1]) {
    record <- list(
      point_of_care, NA_character_, NA_character_, NA_character_, NA_character_, NA_character_,
      NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
    )
    records[[r]] <- list(record)
  } else {
    for (j in seq_along(nodes_html)) {
      service_type <- if_else(str_detect(nodes_html[[j]], service_types[1]), service_types[1], service_types[2])
      tables <- nodes_html[[j]] %>%
        read_html() %>%
        html_nodes(".condensedViewTable")
      records[[r]] <- map(tables, ~ record_from_table(.x, point_of_care, service_type))
      r <- r + 1
    }
  }
}

records <- unlist(records, recursive = FALSE) %>% map(clean_record)
listings <- map_dfr(records, unlist)
#

## Partly due to default Windows encoding, and lack of UTF-8 support in R, causing Mojibake via earlier tidy_html(), we grab the properly encoded info
## to overwrite the mangled text |text lacking spaces
titles <- page %>%
  html_nodes(".emhTip a:nth-of-type(1)") %>%
  html_text()
descriptions <- page %>%
  html_nodes(".emhTip + .summaryRecordType") %>%
  html_text() %>%
  trimws()
mixed_nodes <- page %>%
  html_nodes(".summaryTitlePrivatePractice > div:nth-child(2)") %>%
  html_text() %>%
  trimws()

r <- r1 <- 0

# over-write existing values with tidier properly encoded strings
for (i in seq_along(listings$Title)) {
  if (!is.na(listings$Title[i])) {
    r <- r + 1
    listings$Title[i] <- titles[r]
    if (!is.na(listings$BusinessName[i])) {
      listings$BusinessName[i] <- mixed_nodes[r]
    }
  }
  if (!is.na(listings$ServiceDescription[i])) {
    r1 <- r1 + 1
    listings$ServiceDescription[i] <- descriptions[r1]
  }
}

# descriptions_to_expand <- dplyr::filter(listings, grepl(" \\.\\.\\.", ServiceDescription))
expanded_descriptions <- map2(listings$ServiceDescription, listings$Url, ~ full_description(.x, .y)) %>% unlist()
listings$ServiceDescription <- expanded_descriptions
write.csv(listings, "~/data.csv", na = "")

Some example rows of output:

Sample Image

_{click on image to enlarge}

Iterating rvest scrape function gives: Error in open.connection(x, rb ) : Timeout was reached

With large scraping tasks I would usually do a for-loop, which helps with troubleshooting. Create an empty list for your output:

d <- vector("list", length(links))

Here I do a for-loop, with a tryCatch block so that if the output is an error, we wait a couple of seconds and try again. We also include a counter that moves on to the next link if we're still getting an error after five attempts. In addition, we have if (!(links[i] %in% names(d))) so that if we have to break the loop, we can skip the links we've already scraped when we restart the loop.

for (i in seq_along(links)) {
  if (!(links[i] %in% names(d))) {
    cat(paste("Doing", links[i], "..."))
    ok <- FALSE
    counter <- 0
    while (ok == FALSE & counter <= 5) {
      counter <- counter + 1
      out <- tryCatch({                  
                  scrape_test(links[i])
                },
                error = function(e) {
                  Sys.sleep(2)
                  e
                }
              )
      if ("error" %in% class(out)) {
        cat(".")
      } else {
        ok <- TRUE
        cat(" Done.")
      }
    }
    cat("\n")
    d[[i]] <- out
    names(d)[i] <- links[i]
  }
}

How to Close Unused Connections After Read_HTML in R

How do I close unused connections after read_html in R

Webscraping with 'rvest' and code keeps stopping

How do I scrape data with R when the same class is used repeatedly?

Iterating rvest scrape function gives: Error in open.connection(x, rb ) : Timeout was reached

Related Topics

Leave a reply