Using Beautifulsoup to Extract Text Without Tags

Using BeautifulSoup to extract text without tags

Just loop through all the <strong> tags and use next_sibling to get what you want. Like this:

for strong_tag in soup.find_all('strong'):
print(strong_tag.text, strong_tag.next_sibling)

Demo:

from bs4 import BeautifulSoup

html = '''
<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
'''

soup = BeautifulSoup(html)

for strong_tag in soup.find_all('strong'):
print(strong_tag.text, strong_tag.next_sibling)

This gives you:

YOB:  1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN

How to extract HTML text which has no tags using Beautifulsoup?

Here are a variety of locator strategies. The text is the last part of a div tag. You can use stripped_strings on the div, or target the child p of the div and use next_sibling to move across

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.pricesmart.com/site/sv/es/pagina-producto/46369')
soup = bs(r.content, 'lxml')
print(soup.select_one('#collapseOne p').next_sibling.strip().replace(u'\xa0l',''))
print([i for i in soup.select_one('#collapseOne .card-body').stripped_strings][-1].replace(u'\xa0l',''))
print([i for i in soup.select_one('.card-body:has(strong:contains("Número de item:"))').stripped_strings][-1].replace(u'\xa0l',''))
print(soup.select_one('p:has(#itemNumber2)').next_sibling.strip().replace(u'\xa0l',''))
print(soup.select_one('#collapseOne .card-body').text.split('\n')[-2].strip().replace(u'\xa0l',''))

You could also regex from response text as is present within a script tag, though that would not be my choice. Pick the simplest, least complicated method e.g. the first one shown above.

import requests, re

r = requests.get('https://www.pricesmart.com/site/sv/es/pagina-producto/46369')
print(re.search(',\{\\\\"value\\\\":\\\\"(.*?)\\\\"', r.text).groups(0)[0].replace(u'\xa0l',''))

extract text without tags with beautifulsoup

You can use re and next_sibling.You can try it:

from bs4 import BeautifulSoup
import re
html_doc = """<div class="col-md-10"><span class="title"><a href="/article/00731b64b7ae44bb96e5cd51edaa113d">Medical Device-Related Pressure Injury in health care professionals in times of pandemic</a></span><br><em>Aline Oliveira Ramalho, Paula de Souza Silva Freitas, Paula Cristina Nogueira</em><br><a href="/toc/2595-7007">Estima</a>. 2020;18(1) DOI <a href="https://doi.org/10.30886/estima.v18.867_IN">10.30886/estima.v18.867_IN</a><br><a class="doaj-public-search-abstractaction doaj-public-search-abstractaction-results" href="#" rel="00731b64b7ae44bb96e5cd51edaa113d"><strong>Abstract</strong></a> | <a href="https://www.revistaestima.com.br/index.php/estima/article/view/867/pdf">Full Text</a><div class="doaj-public-search-abstracttext doaj-public-search-abstracttext-results" rel="00731b64b7ae44bb96e5cd51edaa113d" style="display:none">Facing the number of cases of coronavirus infection (COVID-19).</div></div>"""

soup = BeautifulSoup(html_doc, 'lxml')

div = soup.find("div")

result = div.find("a", attrs={"href": re.compile("^/toc/2595-7007.*")}).next_sibling

result = result.replace('.',"")

print(result)

Output will be:

2020;18(1) DOI

How can I get text without specific tags in BeautifulSoup?

Unfortunately, my hanzi is not what it should be, but this is what I get:

targets = soup.select('span.head')
heads = []
entries = []
for target in targets:
entry = []
heads.append(target.text)
entry.append(target.next_sibling)
if target.next_sibling.next_sibling.has_attr('style'):
entry.append(target.next_sibling.next_sibling.text)
entries.append(''.join(entry).strip().replace('\n\t',''))
print(heads)
print(entries)

Output:

['東', '菄', '鶇']
['春方也〾說文曰動...爲人', '東風菜義見上注俗加艹', '鶇鵍鳥名美形出廣雅亦作?']

Is that correct?

Extracting text without tags of HTML with Beautifulsoup Python

Now you can put HTML_TEXT as the html you got from scrapping the url.

y = BeautifulSoup(HTML_TEXT)

c = y.find('body').findAll(text=True, recursive=False)

for i in c:
print i

BeautifulSoup - how to extract text without opening tag and before br tag?

Locate the h4 element and use find_next_siblings():

h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
for text in h4.find_next_siblings(text=True):
print(text.strip())

Extract text only except the content of script tag from html with BeautifulSoup

Use .find(text=True)

EX:

from bs4 import BeautifulSoup

html = """<span class="age">
Ages 15
<span class="loc" id="loc_loads1">
</span>
<script>
getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
</script>
</span>"""

soup = BeautifulSoup(html, "html.parser")
print(soup.find("span", {"class": "age"}).find(text=True).strip())

Output:

Ages 15

Getting text without tags using BeautifulSoup?

You can use previous or next tag as anchor to find the text. For example, find <strike> element first, and then get the text node next to it :

from bs4 import BeautifulSoup

html = """<strike style="color: #777777">975</strike> 487 RP<div class="gs-container default-2-col">"""
soup = BeautifulSoup(html)

#find <strike> element first, then get text element next to it
result = soup.find('strike',{'style': 'color: #777777'}).findNextSibling(text=True)

print(result.encode('utf-8'))
#output : ' 487 RP'
#you can then do simple text manipulation/regex to clean up the result

Note that above codes are for the sake of demo, not to accomplish your entire task.



Related Topics



Leave a reply



Submit