How to Unescape HTML Entities in a String in Python 3.1

How do I unescape HTML entities in a string in Python 3.1?

You could use the function html.unescape:

In Python3.4+ (thanks to J.F. Sebastian for the update):

import html
html.unescape('Suzy & John')
# 'Suzy & John'

html.unescape('"')
# '"'

In Python3.3 or older:

import html.parser    
html.parser.HTMLParser().unescape('Suzy & John')

In Python2:

import HTMLParser
HTMLParser.HTMLParser().unescape('Suzy & John')

Decode HTML entities in Python string?

Python 3.4+

Use html.unescape():

import html
print(html.unescape('£682m'))

FYI html.parser.HTMLParser.unescape is deprecated, and was supposed to be removed in 3.5, although it was left in by mistake. It will be removed from the language soon.


Python 2.6-3.3

You can use HTMLParser.unescape() from the standard library:

  • For Python 2.6-2.7 it's in HTMLParser
  • For Python 3 it's in html.parser
>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m

You can also use the six compatibility library to simplify the import:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m

Properly unescape html characters

You can use the HTMLParser to unescape this. Something like:

print HTMLParser.HTMLParser().unescape(synopsis)
"Work Out New York" invites viewers to break a sweat with some of New York Cityâs hottest personal trainers. They may be friends, but these high-end fitness experts compete against each other to earn the business of wealthy patrons and celebrity clientele. With training techniques and fitness regimens constantly evolving, these trainers better shape up or risk losing their clients to their competitors. Romances, jealousies, and bitter rivalries provide the ultimate test of endurance for these fitness fanatics.

More details here: How do I unescape HTML entities in a string in Python 3.1?.

Strip HTML from strings in Python

I always used this function to strip HTML tags, as it requires only the Python stdlib:

For Python 3:

from io import StringIO
from html.parser import HTMLParser

class MLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs= True
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self):
return self.text.getvalue()

def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()

For Python 2:

from HTMLParser import HTMLParser
from StringIO import StringIO

class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self):
return self.text.getvalue()

def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()

Python3 Convert all characters to HTML Entities

If you want to really escape all characters, there is no default function for that, but you can just replace each character with the ordinals manually:

''.join('&%d;'.format(ord(x)) for x in string)

Python - Best way to detect accent HTML escape in a string?

It's not quite clear to me what you're asking, but here is my best try:

  1. Ç is an HTML escape, which you can unescape like so:

    >>> s = 'Rialta te VeniceÇ'
    >>> import html
    >>> s2 = html.unescape(s); s2
    'Rialta te VeniceÇ'
  2. As you've said, there are libraries for normalizing/removing accents:

    >>> import unidecode
    >>> unidecode.unidecode(s2)
    'Rialta te VeniceC'

    You don't really need to check if it has Unicode codepoints, as this function won't change non-accented characters. But you could check anyway using s2.isascii().

So the complete solution is to use unidecode.unidecode(html.unescape(s)).

How do I perform HTML decoding/encoding using Python/Django?

Given the Django use case, there are two answers to this. Here is its django.utils.html.escape function, for reference:

def escape(html):
"""Returns the given HTML with ampersands, quotes and carets encoded."""
return mark_safe(force_unicode(html).replace('&', '&').replace('<', '&l
t;').replace('>', '>').replace('"', '"').replace("'", '''))

To reverse this, the Cheetah function described in Jake's answer should work, but is missing the single-quote. This version includes an updated tuple, with the order of replacement reversed to avoid symmetric problems:

def html_decode(s):
"""
Returns the ASCII decoded version of the given HTML string. This does
NOT remove normal HTML tags like <p>.
"""
htmlCodes = (
("'", '''),
('"', '"'),
('>', '>'),
('<', '<'),
('&', '&')
)
for code in htmlCodes:
s = s.replace(code[1], code[0])
return s

unescaped = html_decode(my_string)

This, however, is not a general solution; it is only appropriate for strings encoded with django.utils.html.escape. More generally, it is a good idea to stick with the standard library:

# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)

As a suggestion: it may make more sense to store the HTML unescaped in your database. It'd be worth looking into getting unescaped results back from BeautifulSoup if possible, and avoiding this process altogether.

With Django, escaping only occurs during template rendering; so to prevent escaping you just tell the templating engine not to escape your string. To do that, use one of these options in your template:

{{ context_var|safe }}
{% autoescape off %}
{{ context_var }}
{% endautoescape %}

Replace html entities with the corresponding utf-8 characters in Python 2.6

Python >= 3.4

Official documentation for HTMLParser: Python 3

>>> from html import unescape
>>> unescape('© €')
© €

Python < 3.5

Official documentation for HTMLParser: Python 3

>>> from html.parser import HTMLParser
>>> pars = HTMLParser()
>>> pars.unescape('© €')
© €

Note: this was deprecated in the favor of html.unescape().

Python 2.7

Official documentation for HTMLParser: Python 2.7

>>> import HTMLParser
>>> pars = HTMLParser.HTMLParser()
>>> pars.unescape('© €')
u'\xa9 \u20ac'
>>> print _
© €

I want to save HTML Entity (hex) from bs4 beautifulSoup object into a file

Use encoding='utf-8' on file

Ex:

from bs4 import BeautifulSoup

a=BeautifulSoup('<p class="t5">₹ 10,000 or $ 133.46</p>')

with open(filename,'w', encoding='utf-8') as infile:
infile.write(str(a)) # OR infile.write(a.prettify())

Output:

<p class="t5">₹ 10,000 or $ 133.46</p>

How to decode Angular's custom HTML encoding with Python

Angular encodes transfer state using a special escape function located here:

export function escapeHtml(text: string): string {
const escapedText: {[k: string]: string} = {
'&': '&a;',
'"': '&q;',
'\'': '&s;',
'<': '&l;',
'>': '&g;',
};
return text.replace(/[&"'<>]/g, s => escapedText[s]);
}

export function unescapeHtml(text: string): string {
const unescapedText: {[k: string]: string} = {
'&a;': '&',
'&q;': '"',
'&s;': '\'',
'&l;': '<',
'&g;': '>',
};
return text.replace(/&[^;]+;/g, s => unescapedText[s]);
}

You can reproduce the unescapeHtml function in python, and add html.unescape to resolve additionnal html entities:

import json
import requests
from bs4 import BeautifulSoup
import html

unescapedText = {
'&a;': '&',
'&q;': '"',
'&s;': '\'',
'&l;': '<',
'&g;': '>',
}

def unescape(str):
for key, value in unescapedText.items():
str = str.replace(key, value)
return html.unescape(str)

url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {
"id": "ng-lseg-state"
})
payload = json.loads(unescape(script.string))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&path=news-article"
article_body = payload[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
print(BeautifulSoup(article_body, "lxml").find_all("p"))

you were missing &s; and &a;

repl.it: https://replit.com/@bertrandmartel/AngularTransferStateDecode



Related Topics



Leave a reply



Submit