Using BeautifulSoup to extract text without tags
Just loop through all the <strong>
tags and use next_sibling
to get what you want. Like this:
for strong_tag in soup.find_all('strong'):
print(strong_tag.text, strong_tag.next_sibling)
Demo:
from bs4 import BeautifulSoup
html = '''
<p>
<strong class="offender">YOB:</strong> 1987<br />
<strong class="offender">RACE:</strong> WHITE<br />
<strong class="offender">GENDER:</strong> FEMALE<br />
<strong class="offender">HEIGHT:</strong> 5'05''<br />
<strong class="offender">WEIGHT:</strong> 118<br />
<strong class="offender">EYE COLOR:</strong> GREEN<br />
<strong class="offender">HAIR COLOR:</strong> BROWN<br />
</p>
'''
soup = BeautifulSoup(html)
for strong_tag in soup.find_all('strong'):
print(strong_tag.text, strong_tag.next_sibling)
This gives you:
YOB: 1987
RACE: WHITE
GENDER: FEMALE
HEIGHT: 5'05''
WEIGHT: 118
EYE COLOR: GREEN
HAIR COLOR: BROWN
How to extract HTML text which has no tags using Beautifulsoup?
Here are a variety of locator strategies. The text is the last part of a div tag. You can use stripped_strings on the div, or target the child p of the div and use next_sibling to move across
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.pricesmart.com/site/sv/es/pagina-producto/46369')
soup = bs(r.content, 'lxml')
print(soup.select_one('#collapseOne p').next_sibling.strip().replace(u'\xa0l',''))
print([i for i in soup.select_one('#collapseOne .card-body').stripped_strings][-1].replace(u'\xa0l',''))
print([i for i in soup.select_one('.card-body:has(strong:contains("Número de item:"))').stripped_strings][-1].replace(u'\xa0l',''))
print(soup.select_one('p:has(#itemNumber2)').next_sibling.strip().replace(u'\xa0l',''))
print(soup.select_one('#collapseOne .card-body').text.split('\n')[-2].strip().replace(u'\xa0l',''))
You could also regex from response text as is present within a script tag, though that would not be my choice. Pick the simplest, least complicated method e.g. the first one shown above.
import requests, re
r = requests.get('https://www.pricesmart.com/site/sv/es/pagina-producto/46369')
print(re.search(',\{\\\\"value\\\\":\\\\"(.*?)\\\\"', r.text).groups(0)[0].replace(u'\xa0l',''))
extract text without tags with beautifulsoup
You can use re
and next_sibling
.You can try it:
from bs4 import BeautifulSoup
import re
html_doc = """<div class="col-md-10"><span class="title"><a href="/article/00731b64b7ae44bb96e5cd51edaa113d">Medical Device-Related Pressure Injury in health care professionals in times of pandemic</a></span><br><em>Aline Oliveira Ramalho, Paula de Souza Silva Freitas, Paula Cristina Nogueira</em><br><a href="/toc/2595-7007">Estima</a>. 2020;18(1) DOI <a href="https://doi.org/10.30886/estima.v18.867_IN">10.30886/estima.v18.867_IN</a><br><a class="doaj-public-search-abstractaction doaj-public-search-abstractaction-results" href="#" rel="00731b64b7ae44bb96e5cd51edaa113d"><strong>Abstract</strong></a> | <a href="https://www.revistaestima.com.br/index.php/estima/article/view/867/pdf">Full Text</a><div class="doaj-public-search-abstracttext doaj-public-search-abstracttext-results" rel="00731b64b7ae44bb96e5cd51edaa113d" style="display:none">Facing the number of cases of coronavirus infection (COVID-19).</div></div>"""
soup = BeautifulSoup(html_doc, 'lxml')
div = soup.find("div")
result = div.find("a", attrs={"href": re.compile("^/toc/2595-7007.*")}).next_sibling
result = result.replace('.',"")
print(result)
Output will be:
2020;18(1) DOI
How can I get text without specific tags in BeautifulSoup?
Unfortunately, my hanzi is not what it should be, but this is what I get:
targets = soup.select('span.head')
heads = []
entries = []
for target in targets:
entry = []
heads.append(target.text)
entry.append(target.next_sibling)
if target.next_sibling.next_sibling.has_attr('style'):
entry.append(target.next_sibling.next_sibling.text)
entries.append(''.join(entry).strip().replace('\n\t',''))
print(heads)
print(entries)
Output:
['東', '菄', '鶇']
['春方也〾說文曰動...爲人', '東風菜義見上注俗加艹', '鶇鵍鳥名美形出廣雅亦作?']
Is that correct?
Extracting text without tags of HTML with Beautifulsoup Python
Now you can put HTML_TEXT as the html you got from scrapping the url.
y = BeautifulSoup(HTML_TEXT)
c = y.find('body').findAll(text=True, recursive=False)
for i in c:
print i
BeautifulSoup - how to extract text without opening tag and before br tag?
Locate the h4
element and use find_next_siblings()
:
h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
for text in h4.find_next_siblings(text=True):
print(text.strip())
Extract text only except the content of script tag from html with BeautifulSoup
Use .find(text=True)
EX:
from bs4 import BeautifulSoup
html = """<span class="age">
Ages 15
<span class="loc" id="loc_loads1">
</span>
<script>
getCurrentLocationVal("loc_loads1",29.45218856,59.38139268,1);
</script>
</span>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.find("span", {"class": "age"}).find(text=True).strip())
Output:
Ages 15
Getting text without tags using BeautifulSoup?
You can use previous or next tag as anchor to find the text. For example, find <strike>
element first, and then get the text node next to it :
from bs4 import BeautifulSoup
html = """<strike style="color: #777777">975</strike> 487 RP<div class="gs-container default-2-col">"""
soup = BeautifulSoup(html)
#find <strike> element first, then get text element next to it
result = soup.find('strike',{'style': 'color: #777777'}).findNextSibling(text=True)
print(result.encode('utf-8'))
#output : ' 487 RP'
#you can then do simple text manipulation/regex to clean up the result
Note that above codes are for the sake of demo, not to accomplish your entire task.
Related Topics
Matplotlib: Specify Format of Floats for Tick Labels
Color Coding Cells in a Table Based on the Cell Value Using Jinja Templates
R Foverlaps Equivalent in Python
Python Equivalent of Ruby's Each_Slice(Count)
How to Add Title to Subplots in Matplotlib
How to Return a Value from _Init_ in Python
Handling Backreferences to Capturing Groups in Re.Sub Replacement Pattern
How to Delete the Contents of a Folder
What Is the Purpose of _Str_ and _Repr_
How to Plot and Annotate a Grouped Bar Chart
How to Edit Header Row in Pandas - Styling
Combine a Folder of Text Files into a CSV with Each Content in a Cell
How Can One Find the Unicode Codepoints That a Font Has Glyphs For, on a Debian-Based System
How to Increment Datetime by Custom Months in Python Without Using Library