Convert HTML Entities to Unicode and Vice Versa

Convert HTML entities to Unicode and vice versa

You need to have BeautifulSoup.

from BeautifulSoup import BeautifulStoneSoup
import cgi

def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text

def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text

text = "&, ®, <, >, ¢, £, ¥, €, §, ©"

uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)

print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©

How to convert a string from unicode to html entity

That character is not special in HTML, so you can include it as-is in the output, just be sure to set the proper encoding of the document.

Note that to escape special characters in strings, you may use html.EscapeString(). But because ص is not special in HTML, that will not change.

If for some reason you do need to escape it, you may simply use the decimal representation of the rune:

fmt.Println(html.EscapeString("ص"))
fmt.Printf("&#%d;", 'ص')

Outputs (try it on the Go Playground):

ص
ص

Convert character entities to their unicode equivalents

My first thought is, can your RSS reader accept the actual characters? If so, you can use HtmlDecode and feed it directly in.

If you do need to convert it to the numeric representations, you could parse out each entity, HtmlDecode it, and then cast it to an int to get the base-10 unicode value. Then re-insert it into the string.

EDIT:
Here's some code to demonstrate what I mean (it is untested, but gets the idea across):

string input = "Something with — or other character entities.";
StringBuilder output = new StringBuilder(input.Length);

for (int i = 0; i < input.Length; i++)
{
if (input[i] == '&')
{
int startOfEntity = i; // just for easier reading
int endOfEntity = input.IndexOf(';', startOfEntity);
string entity = input.Substring(startOfEntity, endOfEntity - startOfEntity);
int unicodeNumber = (int)(HttpUtility.HtmlDecode(entity)[0]);
output.Append("&#" + unicodeNumber + ";");
i = endOfEntity; // continue parsing after the end of the entity
}
else
output.Append(input[i]);
}

I may have an off-by-one error somewhere in there, but it should be close.

HTML Entity Decode

You could try something like:

var Title = $('<textarea />').html("Chris' corner").text();console.log(Title);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>

How do I encode specific characters to HTML in python

I believe the solution to this post would give you what you need:
Convert HTML entities to Unicode and vice versa

Unicode encoding in python

You are looking for HTML entity decoding, not Unicode (or codec) decoding.

See Decode HTML entities in Python string? for ways to do this.



Related Topics



Leave a reply



Submit