Remove a Tag Using Beautifulsoup But Keep Its Contents

Remove a tag using BeautifulSoup but keep its contents

The strategy I used is to replace a tag with its contents if they are of type NavigableString and if they aren't, then recurse into them and replace their contents with NavigableString, etc. Try this:

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
soup = BeautifulSoup(html)

for tag in soup.findAll(True):
if tag.name in invalid_tags:
s = ""

for c in tag.contents:
if not isinstance(c, NavigableString):
c = strip_tags(unicode(c), invalid_tags)
s += unicode(c)

tag.replaceWith(s)

return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

The result is:

<p>Good, bad, and ugly</p>

I gave this same answer on another question. It seems to come up a lot.

BeautifulSoup Keep some text, but remove the rest of the tag

from bs4 import BeautifulSoup, CData

txt = '''<description><![CDATA[ <p>This is a test post with a few emotes <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/grin.png?v=9" title=":grin:" class="emoji" alt=":grin:"> <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/heart.png?v=9" title=":heart:" class="emoji" alt=":heart:"></p> ]]></description>'''

# load main soup:
soup = BeautifulSoup(txt, 'html.parser')

# find CDATA inside <description>, make new soup
soup2 = BeautifulSoup(soup.find('description').find(text=lambda t: isinstance(t, CData)), 'html.parser')

# replace <img> with their alt=...
for img in soup2.select('img'):
img.replace_with(img['alt'])

# print text
print(soup2.p.text)

Prints:

This is a test post with a few emotes :grin: :heart:

How to remove br tag but keep everything within the same paragraph

To remove the <br>s from the <p> you could use extract() or decompose():

...
necessarytext = soup.find("p")

for x in necessarytext:
if x.name == 'br':
x.extract()
##or
##x.decompose()

abstracttext.append(necessarytext)
...

Note Cause it was not that clear - if you do not need the <p> at all just call abstracttext.append(soup.find("p").text) this will give plain text of <p> without <br/>

Example

import requests
from bs4 import BeautifulSoup

abstracttext = []

html='''<p>a <br/> b <br/> c</p>'''
soup = BeautifulSoup(html, "html.parser")
necessarytext = soup.find("p")
for x in necessarytext:
if x.name == 'br':
x.decompose()

abstracttext.append(necessarytext)

print(abstracttext)

Output

[<p>a  b  c</p>]

How do I use BeautifulSoup to replace a tag with its contents?

I've voted to close as a duplicate, but in case it's of use, reapplying slacy's answer from top related answer on the right gives you this solution:

from BeautifulSoup import BeautifulSoup

html = '''
<div>
<p>dvgbkfbnfd</p>
<div>
<span>dsvdfvd</span>
</div>
<p>fvjdfnvjundf</p>
</div>
'''

soup = BeautifulSoup(html)
for match in soup.findAll('div'):
match.replaceWithChildren()

print soup

... which produces the output:

<p>dvgbkfbnfd</p>

<span>dsvdfvd</span>

<p>fvjdfnvjundf</p>

BeautifulSoup - remove children but keep their contents

You can try this:

from bs4 import BeautifulSoup

payload='''
<html>
<body>
<div >
<code>
<p class="nt"><my-component</p> <p class="na">v-bind:prop1=</p><p class="s">"parentValue"</p><p class="nt">></my-component></p>
<p class="c"><!-- Or more succinctly, --></p>
<p class="nt"><my-component</p> <p class="na">:prop1=</p><p class="s">"parentValue"</p><p class="nt">></my-component></p>
</code>
</div>
<div>
<code>
<p class="nt"><my-component</p> <p class="na">v-on:myEvent=</p><p class="s">"parentHandler"</p><p class="nt">></my-component></p>
<p class="c"><!-- Or more succinctly, --></p>
<p class="nt"><my-component</p> <p class="err">@</p><p class="na">myEvent=</p><p class="s">"parentHandler"</p><p class="nt">></my-component></p>

</code>
</div>
</body>
</html>
'''

soup = BeautifulSoup(payload, 'lxml')

for match in soup.find_all('code'):
new_t=soup.new_tag('code')
new_t.string=match.text
match.replace_with(new_t)

with open(r'prove.html', "w") as file:
file.write(str(soup))

Output (prove.html):

<html>
<body>
<div>
<code>
<my-component v-bind:prop1="parentValue"></my-component>
<!-- Or more succinctly, -->
<my-component :prop1="parentValue"></my-component>
</code>
</div>
<div>
<code>
<my-component v-on:myEvent="parentHandler"></my-component>
<!-- Or more succinctly, -->
<my-component @myEvent="parentHandler"></my-component>
</code>
</div>
</body>
</html>

BeautifulSoup remove tag attributes and text contents

As the docs say,

You can’t edit a string in place, but you can replace one string with another, using replace_with()

so I would go for something like this (assume soup is exactly what you posted):

for e in soup.find_all(True):
e.attrs = {}

for i in e.contents:
if i.string:
i.string.replace_with('')

I think without looping into each tag's content you'll end up with some text leftovers in cases in which a tag has more than one child and one of them is text and another one is another tag containing text (as in your example <p><span style="color:red">My</span> first paragraph.</p>).

When run against your example:

(env) $ python strip.py                                                               
<!DOCTYPE html>

<html><body><h1></h1><p><span></span></p><img/></body></html>

(it can be changed a little so it doesn't return newlines or doctype)



Related Topics



Leave a reply



Submit