Remove a tag using BeautifulSoup but keep its contents
The strategy I used is to replace a tag with its contents if they are of type NavigableString
and if they aren't, then recurse into them and replace their contents with NavigableString
, etc. Try this:
from BeautifulSoup import BeautifulSoup, NavigableString
def strip_tags(html, invalid_tags):
soup = BeautifulSoup(html)
for tag in soup.findAll(True):
if tag.name in invalid_tags:
s = ""
for c in tag.contents:
if not isinstance(c, NavigableString):
c = strip_tags(unicode(c), invalid_tags)
s += unicode(c)
tag.replaceWith(s)
return soup
html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)
The result is:
<p>Good, bad, and ugly</p>
I gave this same answer on another question. It seems to come up a lot.
BeautifulSoup Keep some text, but remove the rest of the tag
from bs4 import BeautifulSoup, CData
txt = '''<description><![CDATA[ <p>This is a test post with a few emotes <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/grin.png?v=9" title=":grin:" class="emoji" alt=":grin:"> <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/heart.png?v=9" title=":heart:" class="emoji" alt=":heart:"></p> ]]></description>'''
# load main soup:
soup = BeautifulSoup(txt, 'html.parser')
# find CDATA inside <description>, make new soup
soup2 = BeautifulSoup(soup.find('description').find(text=lambda t: isinstance(t, CData)), 'html.parser')
# replace <img> with their alt=...
for img in soup2.select('img'):
img.replace_with(img['alt'])
# print text
print(soup2.p.text)
Prints:
This is a test post with a few emotes :grin: :heart:
How to remove br tag but keep everything within the same paragraph
To remove the <br>
s from the <p>
you could use extract()
or decompose()
:
...
necessarytext = soup.find("p")
for x in necessarytext:
if x.name == 'br':
x.extract()
##or
##x.decompose()
abstracttext.append(necessarytext)
...
Note Cause it was not that clear - if you do not need the <p>
at all just call abstracttext.append(soup.find("p").text)
this will give plain text of <p>
without <br/>
Example
import requests
from bs4 import BeautifulSoup
abstracttext = []
html='''<p>a <br/> b <br/> c</p>'''
soup = BeautifulSoup(html, "html.parser")
necessarytext = soup.find("p")
for x in necessarytext:
if x.name == 'br':
x.decompose()
abstracttext.append(necessarytext)
print(abstracttext)
Output
[<p>a b c</p>]
How do I use BeautifulSoup to replace a tag with its contents?
I've voted to close as a duplicate, but in case it's of use, reapplying slacy's answer from top related answer on the right gives you this solution:
from BeautifulSoup import BeautifulSoup
html = '''
<div>
<p>dvgbkfbnfd</p>
<div>
<span>dsvdfvd</span>
</div>
<p>fvjdfnvjundf</p>
</div>
'''
soup = BeautifulSoup(html)
for match in soup.findAll('div'):
match.replaceWithChildren()
print soup
... which produces the output:
<p>dvgbkfbnfd</p>
<span>dsvdfvd</span>
<p>fvjdfnvjundf</p>
BeautifulSoup - remove children but keep their contents
You can try this:
from bs4 import BeautifulSoup
payload='''
<html>
<body>
<div >
<code>
<p class="nt"><my-component</p> <p class="na">v-bind:prop1=</p><p class="s">"parentValue"</p><p class="nt">></my-component></p>
<p class="c"><!-- Or more succinctly, --></p>
<p class="nt"><my-component</p> <p class="na">:prop1=</p><p class="s">"parentValue"</p><p class="nt">></my-component></p>
</code>
</div>
<div>
<code>
<p class="nt"><my-component</p> <p class="na">v-on:myEvent=</p><p class="s">"parentHandler"</p><p class="nt">></my-component></p>
<p class="c"><!-- Or more succinctly, --></p>
<p class="nt"><my-component</p> <p class="err">@</p><p class="na">myEvent=</p><p class="s">"parentHandler"</p><p class="nt">></my-component></p>
</code>
</div>
</body>
</html>
'''
soup = BeautifulSoup(payload, 'lxml')
for match in soup.find_all('code'):
new_t=soup.new_tag('code')
new_t.string=match.text
match.replace_with(new_t)
with open(r'prove.html', "w") as file:
file.write(str(soup))
Output (prove.html):
<html>
<body>
<div>
<code>
<my-component v-bind:prop1="parentValue"></my-component>
<!-- Or more succinctly, -->
<my-component :prop1="parentValue"></my-component>
</code>
</div>
<div>
<code>
<my-component v-on:myEvent="parentHandler"></my-component>
<!-- Or more succinctly, -->
<my-component @myEvent="parentHandler"></my-component>
</code>
</div>
</body>
</html>
BeautifulSoup remove tag attributes and text contents
As the docs say,
You can’t edit a string in place, but you can replace one string with another, using replace_with()
so I would go for something like this (assume soup
is exactly what you posted):
for e in soup.find_all(True):
e.attrs = {}
for i in e.contents:
if i.string:
i.string.replace_with('')
I think without looping into each tag's content you'll end up with some text leftovers in cases in which a tag has more than one child and one of them is text and another one is another tag containing text (as in your example <p><span style="color:red">My</span> first paragraph.</p>
).
When run against your example:
(env) $ python strip.py
<!DOCTYPE html>
<html><body><h1></h1><p><span></span></p><img/></body></html>
(it can be changed a little so it doesn't return newlines or doctype)
Related Topics
How to Sort a List by Length of String Followed by Alphabetical Order
Example of the Right Way to Use Qthread in Pyqt
How to Get Tkinter Canvas to Dynamically Resize to Window Width
String Formatting: Columns in Line
Difference Between Parsing a Text File in R and Rb Mode
Is the Shortcircuit Behaviour of Python's Any/All Explicit
Why Can't I Use the Method _Cmp_ in Python 3 as for Python 2
Tensorflow: How to Replace or Modify Gradient
Time Complexity of Accessing a Python Dict
How to Invoke Pandas.Rolling.Apply with Parameters from Multiple Column
Stopping a Thread After a Certain Amount of Time
Python Ftp Implicit Tls Connection Issue
Python - How to Convert JSON File to Dataframe
Using Multiple Python Engines (32Bit/64Bit and 2.7/3.5)
Convert Timedelta to Total Seconds
How to Concatenate Two Layers in Keras