Convert HTML entities to Unicode and vice versa
You need to have BeautifulSoup.
from BeautifulSoup import BeautifulStoneSoup
import cgi
def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text
def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text
text = "&, ®, <, >, ¢, £, ¥, €, §, ©"
uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)
print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©
How to convert a string from unicode to html entity
That character is not special in HTML, so you can include it as-is in the output, just be sure to set the proper encoding of the document.
Note that to escape special characters in strings, you may use html.EscapeString()
. But because ص
is not special in HTML, that will not change.
If for some reason you do need to escape it, you may simply use the decimal representation of the rune
:
fmt.Println(html.EscapeString("ص"))
fmt.Printf("%d;", 'ص')
Outputs (try it on the Go Playground):
ص
ص
Convert character entities to their unicode equivalents
My first thought is, can your RSS reader accept the actual characters? If so, you can use HtmlDecode and feed it directly in.
If you do need to convert it to the numeric representations, you could parse out each entity, HtmlDecode
it, and then cast it to an int
to get the base-10 unicode value. Then re-insert it into the string.
EDIT:
Here's some code to demonstrate what I mean (it is untested, but gets the idea across):
string input = "Something with — or other character entities.";
StringBuilder output = new StringBuilder(input.Length);
for (int i = 0; i < input.Length; i++)
{
if (input[i] == '&')
{
int startOfEntity = i; // just for easier reading
int endOfEntity = input.IndexOf(';', startOfEntity);
string entity = input.Substring(startOfEntity, endOfEntity - startOfEntity);
int unicodeNumber = (int)(HttpUtility.HtmlDecode(entity)[0]);
output.Append("" + unicodeNumber + ";");
i = endOfEntity; // continue parsing after the end of the entity
}
else
output.Append(input[i]);
}
I may have an off-by-one error somewhere in there, but it should be close.
HTML Entity Decode
You could try something like:
var Title = $('<textarea />').html("Chris' corner").text();console.log(Title);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
How do I encode specific characters to HTML in python
I believe the solution to this post would give you what you need:
Convert HTML entities to Unicode and vice versa
Unicode encoding in python
You are looking for HTML entity decoding, not Unicode (or codec) decoding.
See Decode HTML entities in Python string? for ways to do this.
Related Topics
How to Validate a Url with a Regular Expression in Python
Detect Socket Hangup Without Sending or Receiving
Differencebetween Installing a Package Using Pip VS. Apt-Get
Why Is a List Comprehension So Much Faster Than Appending to a List
Case Insensitive Regular Expression Without Re.Compile
What Is the Fastest Way to Flatten Arbitrarily Nested Lists in Python
Pandas Dataframe Get First Row of Each Group
Alternative to Dict Comprehension Prior to Python 2.7
How to Use Jdbc Source to Write and Read Data in (Py)Spark
Python Operator Precedence of in and Comparison
Too Many Values to Unpack', Iterating Over a Dict. Key=>String, Value=>List
How to Read Realtime Microphone Audio Volume in Python and Ffmpeg or Similar
How Is the 'Is' Keyword Implemented in Python
Send File Using Post from a Python Script