How find specific data attribute from html tag in BeautifulSoup4?
You can use find_all
method to get all the tags and filtering based on "data-bin" found in its attributes will get us the actual tag which has got it. Then we can simply extract the value corresponding to it, like this
from bs4 import BeautifulSoup
html_doc = """<ul data-bin="Sdafdo39">"""
bs = BeautifulSoup(html_doc)
print [item["data-bin"] for item in bs.find_all() if "data-bin" in item.attrs]
# ['Sdafdo39']
BeautifulSoup4 data extract from HTML5 data-* tag
Why look for the surrounding <span>
when you can directly access the ones you want? Also, you can use keyword arguments (though I understand why you wouldn't want to try that with the class
attribute, given that it's a Python keyword).
The get_test()
method will extract the content from between a matching pair of tags, so you end up with quite a simple program:
# coding=utf-8
data = u"""\
<span class="itm-price mrs ">
<span data-currency-iso="BDT">৳</span>
<span dir="ltr" data-price="24000">24,000.00</span>
</span>
"""
import bs4
soup = bs4.BeautifulSoup(data)
for price in soup.find_all('span', dir="ltr"):
print(price.get_text())
Extracting an attribute value with beautifulsoup
.find_all()
returns list of all found elements, so:
input_tag = soup.find_all(attrs={"name" : "stainfo"})
input_tag
is a list (probably containing only one element). Depending on what you want exactly you either should do:
output = input_tag[0]['value']
or use .find()
method which returns only one (first) found element:
input_tag = soup.find(attrs={"name": "stainfo"})
output = input_tag['value']
Problem: how to get a list of tag attribute values with beautifulsoup
x.find_all('time')
will return a list. So you'll have to get an item from the list before you can get the "datetime" attribute.
x.find_all('time')[0]['datetime']
will probably do it.
Python Beautifulsoup Getting Attribute Value
You can access the attrs using key-value pair
Ex:
from bs4 import BeautifulSoup
s = """<span class="invisible" data-datenews="2018-05-25 06:02:19" data-idnews="2736625" id="horaCompleta"></span>"""
soup = BeautifulSoup(s, "html.parser")
print(soup.span["data-datenews"])
Output:
2018-05-25 06:02:19
How to find tags with only certain attributes - BeautifulSoup
As explained on the BeautifulSoup documentation
You may use this :
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})
EDIT :
To return tags that have only the valign="top" attribute, you can check for the length of the tag attrs
property :
from BeautifulSoup import BeautifulSoup
html = '<td valign="top">.....</td>\
<td width="580" valign="top">.......</td>\
<td>.....</td>'
soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})
for result in results :
if len(result.attrs) == 1 :
print result
That returns :
<td valign="top">.....</td>
Using BeautifulSoup to find a attribute called data-stats
You need to use the get
method from BeautifulSoup to get the attributes by name
See: BeautifulSoup Get Attribute
Here is a snippet to get all the data you want from the table:
from bs4 import BeautifulSoup
import requests
url = "https://www.pro-football-reference.com/players/B/BradTo00.htm"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
# Get table
table = soup.find(class_="table_outer_container")
# Get head
thead = table.find('thead')
th_head = thead.find_all('th')
for thh in th_head:
# Get case value
print(thh.get_text())
# Get data-stat value
print(thh.get('data-stat'))
# Get body
tbody = table.find('tbody')
tr_body = tbody.find_all('tr')
for trb in tr_body:
# Get id
print(trb.get('id'))
# Get th data
th = trb.find('th')
print(th.get_text())
print(th.get('data-stat'))
for td in trb.find_all('td'):
# Get case value
print(td.get_text())
# Get data-stat value
print(td.get('data-stat'))
# Get footer
tfoot = table.find('tfoot')
thf = tfoot.find('th')
# Get case value
print(thf.get_text())
# Get data-stat value
print(thf.get('data-stat'))
for tdf in tfoot.find_all('td'):
# Get case value
print(tdf.get_text())
# Get data-stat value
print(tdf.get('data-stat'))
You can of course save the data in a csv or even a json instead of printing it
Parsing data with beautiful soup, targeting data- attribute
The data-row
attribute is added dynamically by JavaScript, so the rows need to be targeted differently. For example get all rows under the table with id="stats"
:
import requests
from bs4 import BeautifulSoup
url = 'https://www.pro-football-reference.com/players/H/HopkDe00.htm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for row in soup.select('table#stats tbody tr'):
tds = [td.get_text(strip=True) for td in row.select('td, th')]
print(*tds)
Prints:
2020-09-13 1 ARI @ SFO W 24-20 * 16 14 151 10.79 0 87.5% 9.44 0 0 0 0 0 0 0 77 94% 0 0% 0 0%
2020-09-20 2 ARI WAS W 30-15 * 9 8 68 8.50 1 88.9% 7.56 1 0 0 0 0 0 0 75 97% 0 0% 0 0%
2020-09-27 3 ARI DET L 23-26 * 12 10 137 13.70 0 83.3% 11.42 0 0 0 0 0 0 0 61 94% 0 0% 0 0%
2020-10-04 4 ARI @ CAR L 21-31 * 9 7 41 5.86 0 77.8% 4.56 0 0 0 0 0 0 0 54 95% 0 0% 0 0%
...and so on.
Related Topics
Pyinstaller Unable to Access Data Folder
How to Edit Header Row in Pandas - Styling
Python(Or Numpy) Equivalent of Match in R
What's the Ruby Equivalent of Python's Os.Walk
What Is the "Sys.Stdout.Write()" Equivalent in Ruby
How to Add a Background Thread to Flask
How to Sort Python List of Strings of Numbers
Using a Pre-Trained Word Embedding (Word2Vec or Glove) in Tensorflow
How to Sort Alpha Numeric Set in Python
Google Fonts (Ttf) Being Ignored in Qtwebengine When Using @Font Face
How to Set the R_Home Environment Variable to the R Home Directory
List Comprehension in Haskell, Python and Ruby
What Programming Language Features Are Well Suited for Developing a Live Coding Framework