Python Code to Remove HTML Tags from a String

Strip HTML from strings in Python

I always used this function to strip HTML tags, as it requires only the Python stdlib:

For Python 3:

from io import StringIO
from html.parser import HTMLParser

class MLStripper(HTMLParser):
def __init__(self):
super().__init__()
self.reset()
self.strict = False
self.convert_charrefs= True
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self):
return self.text.getvalue()

def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()

For Python 2:

from HTMLParser import HTMLParser
from StringIO import StringIO

class MLStripper(HTMLParser):
def __init__(self):
self.reset()
self.text = StringIO()
def handle_data(self, d):
self.text.write(d)
def get_data(self):
return self.text.getvalue()

def strip_tags(html):
s = MLStripper()
s.feed(html)
return s.get_data()

Python, remove all html tags from string

You could use get_text()

for i in content:
print i.get_text()

Example below is from the docs:

>>> markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
>>> soup = BeautifulSoup(markup)
>>> soup.get_text()
u'\nI linked to example.com\n'

How do I remove HTML tags from a list of strings that contain the same HTML tags?

You can create a for-loop and call .get_text() from it:

import requests
from bs4 import BeautifulSoup

URL = "https://www.ebay.com/sch/i.html?_from=R40&_nkw=oneplus%206t&_sacat=0&rt=nc&_udlo=150&_udhi=450"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')

for price in soup.findAll("span", {"class": "s-item__price"}):
print(price.get_text(strip=True))

Prints:

$449.99
$449.99
$414.46
$399.00
$399.95
$349.99
$449.00
$585.00
...and son on.

EDIT: To print title and price, you could do for example:

for tag in soup.select('li.s-item:has(.s-item__title):has(.s-item__price)'):
print('{: <10} {}'.format(tag.select_one('.s-item__price').get_text(strip=True),
tag.select_one('.s-item__title').get_text(strip=True, separator=' ')))

Prints:

$449.99    SPONSORED OnePlus 6T 128GB 8GB RAM A6010 - Midnight Black (Unlocked) Global Version
$449.99 OnePlus 6T 128GB 8GB RAM A6010 - Midnight Black (Unlocked) Global Version
$414.46 Oneplus 6t dual sim 256gb midnight black black 6.41" unlocked ram 8gb a6010
$399.00 SPONSORED OnePlus 6T A6013, Clean ESN, Unknown Carrier, Coffee
$399.95 SPONSORED OnePlus 6T 4G LTE 6.41" 128GB ROM 8GB RAM A6013 (T-Mobile) - Mirror Black
$349.99 ONEPLUS 6T - BLACK - 128GB - (T-MOBILE) ~3841
$449.00 OnePlus 6t McLaren Edition Unlocked 256GB 10GB RAM Original Accessories Included
$434.83 OnePlus 6T 8 GB RAM 128 GB UK SIM-Free Smartphone (ML3658)
$265.74 Oneplus 6t
$241.58 New Listing OnePlus 6T 8GB 128GB UNLOCKED
$419.95 NEW IN BOX Oneplus 6T 128GB Mirror Black (T-mobile/Metro PCS/Mint) 8gb RAM
$435.99 OnePlus 6T - 128GB 6GB RAM - Mirror Black (Unlocked) Global Version

... and so on.

Python: Remove HTML Tags & text inbetween HTML Tags

You can also do it with an HTML Parser, like BeautifulSoup. The idea is to find all the tags and decompose them, then get what is left:

In [8]: from bs4 import BeautifulSoup

In [9]: price = "12.00 <b>17.50</b>"

In [10]: soup = BeautifulSoup(price, "html.parser")

In [11]: for elm in soup.find_all():
...: elm.decompose()
...:

In [12]: print(soup)
12.00

And, here is a famous topic explaining why you should not process HTML with regular expressions:

  • RegEx match open tags except XHTML self-contained tags

How to remove html tags from text using python?

... when I try to use [the text property] on the definition all of the text disappears...

This is because the tag you're targeting looks like this:

<meta content="foo bar baz..." name="Description" property="og:description">

When you try to access the text property on this object in Beautiful Soup, there isn't any text that's a child of the element. Instead, you're looking to extract the "content" attribute, which you can do with the square bracket "array"-style notation:

definition['content']

This feature is documented in the Attributes section of the Beautiful Soup documentation.

Remove encoded HTML tags from large string in Python

You could try using regular expressions instead of replace to discard the HTML tags:

import re

soup = BeautifulSoup(text)
text = soup.get_text()
text = re.sub(r'<.*?>', '', text)

Delete HTML Tags from string Python

Delete all tags:

import re
text = "This is the description of <img alt='' height='1' src='http://linkOfARandomImage.of/the/feed' width='1' /> the <br> text"
text = re.sub("<.*?>", "", text)
#text = "This is the description of the text"

Delete unnecessary whitespaces:

text = re.sub("\w*", " ", text)

EDIT:

text = re.sub("\w+", " ", text)


Related Topics



Leave a reply



Submit