Remove a Tag Using Beautifulsoup But Keep Its Contents

Remove a tag using BeautifulSoup but keep its contents

The strategy I used is to replace a tag with its contents if they are of type NavigableString and if they aren't, then recurse into them and replace their contents with NavigableString, etc. Try this:

from BeautifulSoup import BeautifulSoup, NavigableString

def strip_tags(html, invalid_tags):
    soup = BeautifulSoup(html)

    for tag in soup.findAll(True):
        if tag.name in invalid_tags:
            s = ""

            for c in tag.contents:
                if not isinstance(c, NavigableString):
                    c = strip_tags(unicode(c), invalid_tags)
                s += unicode(c)

            tag.replaceWith(s)

    return soup

html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)

The result is:

<p>Good, bad, and ugly</p>

I gave this same answer on another question. It seems to come up a lot.

BeautifulSoup Keep some text, but remove the rest of the tag

from bs4 import BeautifulSoup, CData

txt = '''<description><![CDATA[ <p>This is a test post with a few emotes <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/grin.png?v=9" title=":grin:" class="emoji" alt=":grin:"> <img src="https://sjc5.discourse-cdn.com/try/images/emoji/twitter/heart.png?v=9" title=":heart:" class="emoji" alt=":heart:"></p> ]]></description>'''

# load main soup:
soup = BeautifulSoup(txt, 'html.parser')

# find CDATA inside <description>, make new soup
soup2 = BeautifulSoup(soup.find('description').find(text=lambda t: isinstance(t, CData)), 'html.parser')

# replace <img> with their alt=...
for img in soup2.select('img'):
    img.replace_with(img['alt'])

# print text
print(soup2.p.text)

Prints:

This is a test post with a few emotes :grin: :heart:

How to remove br tag but keep everything within the same paragraph

To remove the  s from the  you could use extract() or decompose():

...
necessarytext = soup.find("p")

for x in necessarytext:
    if x.name == 'br':
        x.extract()
        ##or 
        ##x.decompose()

abstracttext.append(necessarytext)
...

Note Cause it was not that clear - if you do not need the  at all just call abstracttext.append(soup.find("p").text) this will give plain text of  without  

Example

import requests
from bs4 import BeautifulSoup

abstracttext = []

html='''<p>a <br/> b <br/> c</p>'''
soup = BeautifulSoup(html, "html.parser")
necessarytext = soup.find("p")
for x in necessarytext:
    if x.name == 'br':
        x.decompose()

abstracttext.append(necessarytext)

print(abstracttext)

Output

[<p>a  b  c</p>]

How do I use BeautifulSoup to replace a tag with its contents?

I've voted to close as a duplicate, but in case it's of use, reapplying slacy's answer from top related answer on the right gives you this solution:

from BeautifulSoup import BeautifulSoup

html = '''
<div>
<p>dvgbkfbnfd</p>
<div>
<span>dsvdfvd</span>
</div>
<p>fvjdfnvjundf</p>
</div>
'''

soup = BeautifulSoup(html)
for match in soup.findAll('div'):
    match.replaceWithChildren()

print soup

... which produces the output:

<p>dvgbkfbnfd</p>

<span>dsvdfvd</span>

<p>fvjdfnvjundf</p>

BeautifulSoup - remove children but keep their contents

You can try this:

from bs4 import BeautifulSoup

payload='''
<html>
<body>
<div >
<code>
    <p class="nt"><my-component</p> <p class="na">v-bind:prop1=</p><p class="s">"parentValue"</p><p class="nt">></my-component></p>
    <p class="c"><!-- Or more succinctly, --></p>
    <p class="nt"><my-component</p> <p class="na">:prop1=</p><p class="s">"parentValue"</p><p class="nt">></my-component></p>
</code>
</div>
<div>
<code>
    <p class="nt"><my-component</p> <p class="na">v-on:myEvent=</p><p class="s">"parentHandler"</p><p class="nt">></my-component></p>
    <p class="c"><!-- Or more succinctly, --></p>
    <p class="nt"><my-component</p> <p class="err">@</p><p class="na">myEvent=</p><p class="s">"parentHandler"</p><p class="nt">></my-component></p>

</code>
</div>
</body>
</html>
'''

soup = BeautifulSoup(payload, 'lxml')

for match in soup.find_all('code'):
    new_t=soup.new_tag('code')
    new_t.string=match.text
    match.replace_with(new_t)

with open(r'prove.html', "w") as file:
    file.write(str(soup))

Output (prove.html):

<html>
<body>
<div>
<code>
<my-component v-bind:prop1="parentValue"></my-component>
<!-- Or more succinctly, -->
<my-component :prop1="parentValue"></my-component>
</code>
</div>
<div>
<code>
<my-component v-on:myEvent="parentHandler"></my-component>
<!-- Or more succinctly, -->
<my-component @myEvent="parentHandler"></my-component>
</code>
</div>
</body>
</html>

BeautifulSoup remove tag attributes and text contents

As the docs say,

You can’t edit a string in place, but you can replace one string with another, using replace_with()

so I would go for something like this (assume soup is exactly what you posted):

for e in soup.find_all(True):
    e.attrs = {}

    for i in e.contents:
        if i.string:
            i.string.replace_with('')

I think without looping into each tag's content you'll end up with some text leftovers in cases in which a tag has more than one child and one of them is text and another one is another tag containing text (as in your example My first paragraph.).

When run against your example:

(env) $ python strip.py                                                               
<!DOCTYPE html>

<html><body><h1></h1><p><span></span></p><img/></body></html>

(it can be changed a little so it doesn't return newlines or doctype)

Remove a Tag Using Beautifulsoup But Keep Its Contents