Python/Beautifulsoup - How to Remove All Tags from an Element

Remove all the u and a tags from within all div tags of a class using BeautifulSoup or re

BeautifulSoup has a builtin method for getting the visible text from a tag (i.e. the text that would be displayed when rendered in a browser). Running the following code, I get your expected output:

from re import sub
from bs4 import BeautifulSoup
import re

data = """
<div class="sf-item"> The rabbit got to the halfway point at
<u><a href="https://DummyLocationURL/"> here </a></u> However, it couldn't see the turtle.
</div>
<div class="sf">
<div class="sf-item sf-icon">
<span class="supporticon is"></span>
</div>
<div class="sf-item"> He was hot and tired and decided to stop and take a short nap.
</div>
<div class="sf-item"> Even if the turtle passed him at
<u><a href="https://DummyLocationURL/">Link</a></u>. he would be able to race to the finish line ahead of
<u><a href="https://DummyLocationURL/">place</a></u>, he just kept going.
</div>
"""

soup = BeautifulSoup(data, "html.parser")

rMessage=soup.findAll("div",{'class':"sf-item"})

fResult = []

for result in rMessage:
fResult.append(result.text.replace('\n', ''))

That will give you the proper output, but with some extra spaces. If you want to reduce them all to single spaces, you can run fResult through this:

fResult = [re.sub(' +', ' ', result) for result in fResult]

BeautifulSoup Remove all html tags except for those in whitelist such as img and a tags with python

You could select all of the descendant nodes by accessing the .descendants property.

From there, you could iterate over all of the descendants and filter them based on the name property. If the node doesn't have a name property, then it is likely a text node, which you want to keep. If the name property is a or img, then you keep it as well.

# This should be the wrapper that you are targeting
container = soup.find('div')
keep = []

for node in container.descendants:
if not node.name or node.name == 'a' or node.name == 'img':
keep.append(node)

Here is an alternative where all the filtered elements are used to create the list directly:

# This should be the wrapper that you are targeting
container = soup.find('div')

keep = [node for node in container.descendants
if not node.name or node.name == 'a' or node.name == 'img']

Also, if you don't want strings that are empty to be returned, you can trim the whitespace and check for that as well:

keep = [node for node in container.descendants
if (not node.name and len(node.strip())) or
(node.name == 'a' or node.name == 'img')]

Based on the HTML that you provided, the following would be returned:

> ['Hello all ', <a href="xx"></a>, <img rscr="xx"/>]

Remove All html tag except one tag by BeautifulSoup

Consider the below:-

def cleanMe(html):
soup = BeautifulSoup(html,'html.parser') # create a new bs4 object from the html data loaded
for script in soup(["script", "style"]): # remove all javascript and stylesheet code
script.decompose()
# get text
text = soup.get_text()
for link in soup.find_all('a'):
if 'href' in link.attrs:
repl=link.get_text()
href=link.attrs['href']
link.clear()
link.attrs={}
link.attrs['href']=href
link.append(repl)
text=re.sub(repl+'(?!= *?</a>)',str(link),text,count=1)

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
return text

What we've done new is below

    for link in soup.find_all('a'):
text=re.sub(link.get_text()+'(?!= *?</a>)',str(link),text,count=1)

For each set of anchor tags replace once the text in the anchor(link) with the whole anchor itself. Note that we make replacement only once on the first appearing link text.

The regex link.get_text()+'(?!= *?</a>)' makes sure that we only replace the link text only if it was not replaced already.

(?!= *?</a>) is a negative lookahead which avoids any link that does not occur with a </a> appended.

But this is not the most fool proof way. Most fool proof way is to go through each tag and get the text out.

See the working code here

Can I remove script tags with BeautifulSoup?

from bs4 import BeautifulSoup
soup = BeautifulSoup('<script>a</script>baba<script>b</script>', 'html.parser')
for s in soup.select('script'):
s.extract()
print(soup)
baba

Remove a tag using BeautifulSoup but keep its contents

The strategy I used is to replace a tag with its contents if they are of type NavigableString and if they aren't, then recurse into them and replace their contents with NavigableString, etc. Try this:

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
soup = BeautifulSoup(html)

for tag in soup.findAll(True):
if tag.name in invalid_tags:
s = ""

for c in tag.contents:
if not isinstance(c, NavigableString):
c = strip_tags(unicode(c), invalid_tags)
s += unicode(c)

tag.replaceWith(s)

return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

The result is:

<p>Good, bad, and ugly</p>

I gave this same answer on another question. It seems to come up a lot.



Related Topics



Leave a reply



Submit