How to Extract Text from R's Help Command

how can I extract text from R's help command?

help itself doesn't return anything useful. To get the help text, you can read the contents of the help database for a package, and parse that.

extract_help <- function(pkg, fn = NULL, to = c("txt", "html", "latex", "ex"))
{
to <- match.arg(to)
rdbfile <- file.path(find.package(pkg), "help", pkg)
rdb <- tools:::fetchRdDB(rdbfile, key = fn)
convertor <- switch(to,
txt = tools::Rd2txt,
html = tools::Rd2HTML,
latex = tools::Rd2latex,
ex = tools::Rd2ex
)
f <- function(x) capture.output(convertor(x))
if(is.null(fn)) lapply(rdb, f) else f(rdb)
}

pkg is a character string giving the name of a package

fn is a character string giving the name of a function within that package. If it is left as NULL, then the help for all the functions in that package gets returned.

to converts the help file to txt, tml or whatever.

Example usage:

#Everything in utils
extract_help("utils")

#just one function
extract_help("utils", "browseURL")

#convert to html instead
extract_help("utils", "browseURL", "html")

#a non-base package
extract_help("plyr")

How to write contents of help to a file from within R?

Looks like the two functions you would need are tools:::Rd2txt and utils:::.getHelpFile. This prints the help file to the console, but you may need to fiddle with the arguments to get it to write to a file in the way you want.

For example:

hs <- help(survey)
tools:::Rd2txt(utils:::.getHelpFile(as.character(hs)))

Since these functions aren't currently exported, I would not recommend you rely on them for any production code. It would be better to use them as a guide to create your own stable implementation.

In R, can I get the help text for a function into a variable?

I think help.search could be of use. For instance, if I wanted everything in the base package:

x <- help.search("*", package="base")
entries <- data.frame(entry=x$matches$Entry, title=x$matches$Title)
entries[c(1, 100, 1000),]
# entry title
# 1 + Arithmetic Operators
# 100 c.POSIXlt Date-Time Classes
# 1000 encoding Functions to Manipulate Connections

R command to extract text between two strings containing curly parentheses

Use the following regex.

a2 <- "@article{2020, title={Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR-421 and E-cadherin}, volume={9}, ISSN={2045-7634}, url={http://dx.doi.org/10.1002/cam4.3002}, DOI={10.1002/cam4.3002}, number={11}, journal={Cancer Medicine}, publisher={Wiley}, author={Ji, Yefeng and Feng, Guanying and Hou, Yunwen and Yu, Yang and Wang, Ruixia and Yuan, Hua}, year={2020}, month={Apr}, pages={3954–3963} }"

sub("^.*title=\\{([^{}]+)\\}.*$", "\\1", a2)
#> [1] "Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR-421 and E-cadherin"

Created on 2022-03-19 by the reprex package (v2.0.1)



Edit

Alternative stringr way.

stringr::str_match(a2, "^.*title=\\{([^{}]+)\\}.*$")[,2]
#> [1] "Long noncoding RNA MEG3 decreases the growth of head and neck squamous cell carcinoma by regulating the expression of miR-421 and E-cadherin"

Created on 2022-03-19 by the reprex package (v2.0.1)

Extract text with gsub

We can use str_extract

library(stringr)
str_extract(df$column.with.new.names, "KB_*\\d+[_ ]*[^_]*")
#[1] "KB_1813_B" "KB1720_1" "KB1810 mat"

Or the same pattern can be captured as a group with sub

sub(".*(KB_*\\d+[_ ]*[^_]*).*", "\\1", df$column.with.new.names)
#[1] "KB_1813_B" "KB1720_1" "KB1810 mat"

data

df <- data.frame(column.with.new.names = c("Baseline/Cell_Line_2_KB_1813_B_Baseline", 
"Dose 0001/Cell_Line_3_KB1720_1_0001",
"Dose 0010/Cell_Line_1_KB1810 mat_0010"), stringsAsFactors = FALSE)

Extract text between specific string in a URL /

Split the string on / and pull the 3rd and 2nd to last elements:

url = "https://www.somewebsiteLink.com/someDirectory/Directory/ascensor/163235494/d"
url2 = "https://www.somewebsiteLink.com/someDirectory/Directory/aire-acondicionado-calefaccion-ascensor/45837493/d"
urls = c(url, url2)

pieces = strsplit(urls, split = "/")
result = lapply(pieces, \(x) x[length(x) - 2:1])
## for older R verions:
# result = lapply(pieces, function(x) x[length(x) - 2:1])

result
# [[1]]
# [1] "ascensor" "163235494"
#
# [[2]]
# [1] "aire-acondicionado-calefaccion-ascensor" "45837493"

Extract text from search result URLs using R

This is a basic idea of how to go about scrapping this pages. Though it might be slow in r if there are many pages to be scrapped.
Now your question is a bit ambiguous. You want the end results to be .txt files. What of the webpages that has pdf??? Okay. you can still use this code and change the file extension to pdf for the webpages that have pdfs.

 library(xml2)
library(rvest)

urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"

urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%
.[!duplicated(.)]%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%
Map(function(x,y) write_html(x,tempfile(y,fileext=".txt"),options="format"),.,
c(paste("tmp",1:length(.))))

This is the breakdown of the code above:
The url you want to scrap from:

 urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"

Get all the url's that you need:

  allurls <- urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%.[!duplicated(.)]

Where do you want to save your texts?? Create the temp files:

 tmps <- tempfile(c(paste("tmp",1:length(allurls))),fileext=".txt")

as per now. Your allurls is in class character. You have to change that to xml in order to be able to scrap them. Then finally write them into the tmp files created above:

  allurls%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%  
Map(function(x,y) write_html(x,y,options="format"),.,tmps)

Please do not leave anything out. For example after ..."format"), there is a period. Take that into consideration.
Now your files have been written in the tempdir. To determine where they are, just type the command tempdir() on the console and it should give you the location of your files. At the same time, you can change the location of the files on scrapping within the tempfile command.

Hope this helps.



Related Topics



Leave a reply



Submit