Convert HTML Character Entity Encoding in R

convert HTML Character Entity Encoding in R

Update: this answer is outdated. Please check the answer below based on the new xml2 pkg.


Try something along the lines of:

# load XML package
library(XML)

# Convenience function to convert html codes
html2txt <- function(str) {
xpathApply(htmlParse(str, asText=TRUE),
"//body//text()",
xmlValue)[[1]]
}

# html encoded string
( x <- paste("i", "s", "n", "&", "a", "p", "o", "s", ";", "t", sep = "") )
[1] "isn't"

# converted string
html2txt(x)
[1] "isn't"

UPDATE: Edited the html2txt() function so it applies to more situations

Convert HTML Entity to proper character R

Here's one way via the XML package:

txt <- "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"

library("XML")
xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])

> xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])
[1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"

The [[1]] bit is because getNodeSet() returns a list of parsed elements, even if there is only one element as is the case here.

This was taken/modified from a reply to the R-Help list by Henrique Dallazuanna in 2010.

If you want to run this for a character vector of length >1, then lapply() this:

txt <- rep(txt, 2)
decode <- function(x) {
xmlValue(getNodeSet(htmlParse(x, asText = TRUE), "//p")[[1]])
}
lapply(txt, decode)

or if you want it as a vector, vapply():

> vapply(txt, decode, character(1), USE.NAMES = FALSE)
[1] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"
[2] "wine/name: 2003 Karthäuserhof Eitelsbacher Karthäuserhofberg Riesling Kabinett"

For the multi-line example, use the original version, but you have to write the character vector back out to a file if you want it as a multiline document again:

txt <- "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg 
Riesling Spätlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!"

out <- xmlValue(getNodeSet(htmlParse(txt, asText = TRUE), "//p")[[1]])

This gives me

> out
[1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg \nRiesling Spätlese\nwine/wineId: 3058\nwine/variant: Riesling\nwine/year: 2001\nreview/points: N/A\nreview/time: 1095120000\nreview/userId: 1\nreview/userName: Eric\nreview/text: Hideously corked!"

Which if you write out using writeLines()

writeLines(out, "wines.txt")

You'll get a text file, which can be read in again using your other parsing code:

> readLines("wines.txt")
[1] "wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg "
[2] "Riesling Spätlese"
[3] "wine/wineId: 3058"
[4] "wine/variant: Riesling"
[5] "wine/year: 2001"
[6] "review/points: N/A"
[7] "review/time: 1095120000"
[8] "review/userId: 1"
[9] "review/userName: Eric"
[10] "review/text: Hideously corked!"

And it is a file (from my BASH terminal)

$ cat wines.txt 
wine/name: 2001 Karthäuserhof Eitelsbacher Karthäuserhofberg
Riesling Spätlese
wine/wineId: 3058
wine/variant: Riesling
wine/year: 2001
review/points: N/A
review/time: 1095120000
review/userId: 1
review/userName: Eric
review/text: Hideously corked!

Encode character to HTML in R, the CRAN way

Here is something quick (not thoroughly tested). It was inspired by another SO answer.

foo <- function(x) {
splitted <- strsplit(x, "")[[1]]
intvalues <- as.hexmode(utf8ToInt(enc2utf8(x)))
paste(paste0("&#x", intvalues, ";"), collapse = "")
}

all.equal(
foo("<abc at def.gh>"),
"<abc at def.gh>"
)
# [1] TRUE

Convert default html encoding to UTF-8 or latin1 in R

You can do the following, I have used stringi function and a custom function to convert html #& to unicode equivalent, a function called stri_trans_general from stringi helped me translate these unicode converted into english alphabets. I have taken the xml parser from this link on SO itself

library(stringi)
vector_cities = strsplit("Nova Lima,São Paulo,Contagem,Rio de Janeiro,Rio de Janeiro,São Paulo,Castanhal,Diadema,Rio de Janeiro,Rio Verde,Porto Alegre,Maurilândia,Samambaia,Rio de Janeiro,Passo Fundo,São Paulo,Casimiro de Abreu,Rio de Janeiro,Barueri,Santos,São Paulo,São Paulo,Goiânia,Pelotas,Rio de Janeiro", ",")

vector_cities <- vector_cities[[1]]

library(XML)

html_txt <- function(str) {
xpathApply(htmlParse(str, asText=TRUE),
"//body//text()",
xmlValue)[[1]]
}

##The html_txt can parse the ã etc chars to their respective UTF values which can further be taken by stringi functions to convert into english alphabets

x <- vector_cities
txt <- html_txt(x)
Encoding(txt) <- "UTF-8" #encoding to utf-8, It is optional you may avoid it
splt_txt <-strsplit(txt,split="\n")[[1]]
stringi::stri_trans_general(splt_txt, "latin-ascii")

Output:

 [1] "Nova Lima"         "Sao Paulo"        
[3] "Contagem" "Rio de Janeiro"
[5] "Rio de Janeiro" "Sao Paulo"
[7] "Castanhal" "Diadema"
[9] "Rio de Janeiro" "Rio Verde"
[11] "Porto Alegre" "Maurilandia"
[13] "Samambaia" "Rio de Janeiro"
[15] "Passo Fundo" "Sao Paulo"
[17] "Casimiro de Abreu" "Rio de Janeiro"
[19] "Barueri" "Santos"
[21] "Sao Paulo" "Sao Paulo"
[23] "Goiania" "Pelotas"
[25] "Rio de Janeiro"

HTML encode text in R

Here's a function which will encode non-ascii characters as HTML entities.

entity_encode <- function(x) {
cp <- utf8ToInt(x)
rr <- vector("character", length(cp))
ucp <- cp>128
rr[ucp] <- paste0("&#", as.character(cp[ucp]), ";")
rr[!ucp] <- sapply(cp[!ucp], function(z) rawToChar(as.raw(z)))
paste0(rr, collapse="")
}

This returns

[1] "If both #AvengersEndgame and #Joker are nominated for Best Picture, it will be Marvel vs DC for the first time in a Best Picture race. I think both films deserve the nod, but the Twitter discourse leading up to the ceremony will be 🔥 🔥 🔥"

for your input but those seem to be equivalent encodings.

Is there a way to convert HTML back to character in R

You can use gsub to remove the text that you don't want.

gsub("<a href=(.*?)>.*?</a>", "\\1 ",v1)

#[1] "This is the link https://google.com also there is another link, https://yahoo.com "

This removes everything except the link between every <a>..</a> tags.

Convert character to html in R

This question is pretty old but I couldn't find any straightforward answer... So I came up with this simple function which uses the numerical html codes and works for LATIN 1 - Supplement (integer values 161 to 255). There's probably (certainly?) a function in some package that does it more thoroughly, but what follows is probably good enough for many applications...

conv_latinsupp <- function(...) {
out <- character()
for (s in list(...)) {
splitted <- unlist(strsplit(s, ""))
intvalues <- utf8ToInt(enc2utf8(s))
pos_to_modify <- which(intvalues >=161 & intvalues <= 255)
splitted[pos_to_modify] <- paste0("�", intvalues[pos_to_modify], ";")
out <- c(out, paste0(splitted, collapse = ""))
}
out
}

conv_latinsupp("aeiou", "àéïôù12345")
## [1] "aeiou" "àéïôù12345"

Can R read html-encoded emoji characters?

tl;dr: the emoji aren't valid HTML entities; UTF-16 numbers have been used to build them instead of Unicode code points. I describe an algorithm at the bottom of the answer to convert them so that they are valid XML.


Identifying the Problem

R definitely handles emoji:

Sample Image

In fact, a few packages exist for handling emoji in R. For example, the emojifont and emo packages both let you retrieve emoji based on Slack-style keywords. It's just a question of getting your source characters through from the HTML-escaped format so that you can convert them.

xml2::read_xml seems to do fine with other HTML entities, like an ampersand or double quotes. I looked at this SO answer to see whether there were any XML-specific constraints on HTML entities, and it seemed like they were storing emoji fine. So I tried changing the emoji codes in your reprex to the ones in that answer:

body="Hug emoji: 😀😃"

And, sure enough, they were preserved (though they're obviously not the hug emoji anymore):

> test8 = read_html('Desktop/test.xml')
> test8 %>% xml_child() %>% xml_child() %>% xml_child() %>% xml_attr('body')
[1] "Hug emoji: \U0001f600\U0001f603"

I looked up the hug emoji on this page, and the decimal HTML entity given there is not ��. It looks like the UTF-16 decimal codes for the emoji have been wrapped in &# and ;.

In conclusion, I think the answer is that your emoji are, in fact, not valid HTML entities. If you can't control the source, you might need to do some pre-processing to account for these errors.

So, why does the browser convert them properly? I'm wondering if the browser is a little more flexible with these things and is making some guesses about what those codes could be. I'm just speculating, though.


Converting UTF-16 to Unicode code points

After some more investigation, it looks like valid emoji HTML entities use the Unicode code point (in decimal, if it's &#...;, or hex, if it's &#x...;). The Unicode code point is different from the UTF-8 or UTF-16 code. (That link explains a lot about how emoji and other characters are variously encoded, BTW! Good read.)

So we need to convert the UTF-16 codes used in your source data to Unicode code points. Referring to this Wikipedia article on UTF-16, I've verified how it's done. Each Unicode code point (our target) is a 20-bit number, or five hex digits. When going from Unicode to UTF-16, you split it up into two 10-bit numbers (the middle hex digit gets cut in half, with two of its bits going to each block), do some maths on them and get your result).

Going backwards, as you want to, it's done like this:

  • Your decimal UTF-16 number (which is in two separate blocks for now) is 55358 56599
  • Converting those blocks to hex (separately) gives 0x0d83e 0x0dd17
  • You subtract 0xd800 from the first block and 0xdc00 from the second to give 0x3e 0x117
  • Converting them to binary, padding them out to 10 bits and concatenating them, it's 0b0000 1111 1001 0001 0111
  • Then we convert that back to hex, which is 0x0f917
  • Finally, we add 0x10000, giving 0x1f917
  • Therefore, our (hex) HTML entity is 🤗. Or, in decimal, 🤗

So, to preprocess this dataset, you'll need to extract the existing numbers, use the algorithm above, then put the result back in (with one &#...;, not two).


Displaying emoji in R

As far as I'm aware, there's no solution to printing emoji in the R console: they always come out as "U0001f600" (or what have you). However, the packages I described above can help you plot emoji in some circumstances (I'm hoping to expand ggflags to display arbitrary full-colour emoji at some point). They can also help you search for emoji to get their codes, but they can't get names given the codes AFAIK. But maybe you could try importing the emoji list from emojilib into R and doing a join with your data frame, if you've extracted the emoji codes into a column, to get the English names.

Trying to convert the character encoding in a dataset

Two ideas come to mind:

1) you have a simple problem and are asking the wrong question

-- the final line in your Gist is

Encoding(tab$fiche_communale$Nom)
Are you actually wanting:

Encoding(tab$fiche_communale$name)

2) readHTMLTable may not be reading in the character encoding correctly, in which case, you could set it explicitly with Encoding(tab$fiche_communale$Nom) <- "latin-1"

3) try relying on iconv to detect the local encoding:

iconv(tab$fiche_communale$Nom, from="", to="UTF-8")



Related Topics



Leave a reply



Submit