How Find Specific Data Attribute from HTML Tag in Beautifulsoup4

How find specific data attribute from html tag in BeautifulSoup4?

You can use find_all method to get all the tags and filtering based on "data-bin" found in its attributes will get us the actual tag which has got it. Then we can simply extract the value corresponding to it, like this

from bs4 import BeautifulSoup
html_doc = """<ul data-bin="Sdafdo39">"""
bs = BeautifulSoup(html_doc)
print [item["data-bin"] for item in bs.find_all() if "data-bin" in item.attrs]
# ['Sdafdo39']

BeautifulSoup4 data extract from HTML5 data-* tag

Why look for the surrounding <span> when you can directly access the ones you want? Also, you can use keyword arguments (though I understand why you wouldn't want to try that with the class attribute, given that it's a Python keyword).

The get_test() method will extract the content from between a matching pair of tags, so you end up with quite a simple program:

# coding=utf-8
data = u"""\
<span class="itm-price mrs ">
<span data-currency-iso="BDT">৳</span>
<span dir="ltr" data-price="24000">24,000.00</span>
</span>
"""

import bs4
soup = bs4.BeautifulSoup(data)
for price in soup.find_all('span', dir="ltr"):
print(price.get_text())

Extracting an attribute value with beautifulsoup

.find_all() returns list of all found elements, so:

input_tag = soup.find_all(attrs={"name" : "stainfo"})

input_tag is a list (probably containing only one element). Depending on what you want exactly you either should do:

output = input_tag[0]['value']

or use .find() method which returns only one (first) found element:

input_tag = soup.find(attrs={"name": "stainfo"})
output = input_tag['value']

Problem: how to get a list of tag attribute values with beautifulsoup

x.find_all('time')

will return a list. So you'll have to get an item from the list before you can get the "datetime" attribute.

x.find_all('time')[0]['datetime']

will probably do it.

Python Beautifulsoup Getting Attribute Value

You can access the attrs using key-value pair

Ex:

from bs4 import BeautifulSoup
s = """<span class="invisible" data-datenews="2018-05-25 06:02:19" data-idnews="2736625" id="horaCompleta"></span>"""
soup = BeautifulSoup(s, "html.parser")
print(soup.span["data-datenews"])

Output:

2018-05-25 06:02:19

How to find tags with only certain attributes - BeautifulSoup

As explained on the BeautifulSoup documentation

You may use this :

soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})

EDIT :

To return tags that have only the valign="top" attribute, you can check for the length of the tag attrs property :

from BeautifulSoup import BeautifulSoup

html = '<td valign="top">.....</td>\
<td width="580" valign="top">.......</td>\
<td>.....</td>'

soup = BeautifulSoup(html)
results = soup.findAll("td", {"valign" : "top"})

for result in results :
if len(result.attrs) == 1 :
print result

That returns :

<td valign="top">.....</td>

Using BeautifulSoup to find a attribute called data-stats

You need to use the get method from BeautifulSoup to get the attributes by name
See: BeautifulSoup Get Attribute

Here is a snippet to get all the data you want from the table:

from bs4 import BeautifulSoup
import requests

url = "https://www.pro-football-reference.com/players/B/BradTo00.htm"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

# Get table
table = soup.find(class_="table_outer_container")

# Get head
thead = table.find('thead')
th_head = thead.find_all('th')

for thh in th_head:
# Get case value
print(thh.get_text())

# Get data-stat value
print(thh.get('data-stat'))

# Get body
tbody = table.find('tbody')
tr_body = tbody.find_all('tr')

for trb in tr_body:
# Get id
print(trb.get('id'))

# Get th data
th = trb.find('th')
print(th.get_text())
print(th.get('data-stat'))

for td in trb.find_all('td'):
# Get case value
print(td.get_text())
# Get data-stat value
print(td.get('data-stat'))

# Get footer
tfoot = table.find('tfoot')
thf = tfoot.find('th')

# Get case value
print(thf.get_text())
# Get data-stat value
print(thf.get('data-stat'))

for tdf in tfoot.find_all('td'):
# Get case value
print(tdf.get_text())
# Get data-stat value
print(tdf.get('data-stat'))

You can of course save the data in a csv or even a json instead of printing it

Parsing data with beautiful soup, targeting data- attribute

The data-row attribute is added dynamically by JavaScript, so the rows need to be targeted differently. For example get all rows under the table with id="stats":

import requests
from bs4 import BeautifulSoup

url = 'https://www.pro-football-reference.com/players/H/HopkDe00.htm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for row in soup.select('table#stats tbody tr'):
tds = [td.get_text(strip=True) for td in row.select('td, th')]
print(*tds)

Prints:

2020-09-13 1 ARI @ SFO W 24-20 * 16 14 151 10.79 0 87.5% 9.44 0 0 0 0 0 0 0 77 94% 0 0% 0 0%
2020-09-20 2 ARI WAS W 30-15 * 9 8 68 8.50 1 88.9% 7.56 1 0 0 0 0 0 0 75 97% 0 0% 0 0%
2020-09-27 3 ARI DET L 23-26 * 12 10 137 13.70 0 83.3% 11.42 0 0 0 0 0 0 0 61 94% 0 0% 0 0%
2020-10-04 4 ARI @ CAR L 21-31 * 9 7 41 5.86 0 77.8% 4.56 0 0 0 0 0 0 0 54 95% 0 0% 0 0%

...and so on.


Related Topics



Leave a reply



Submit