Beautifulsoup, But for CSS

parse embedded css beautifulsoup

You can use a css parser like [cssutils][1]. I don't know if there is a function in the package itself to do something like this (can someone comment regarding this?), but i made a custom function to get it.

from bs4 import BeautifulSoup
import cssutils
html='''
<html>
<head>
<style type="text/css">
* {margin:0; padding:0; text-indent:0; }
.s5 {color: #000; font-family:Verdana, sans-serif;
font-style: normal; font-weight: normal;
text-decoration: none; font-size: 17.5pt;
vertical-align: 10pt;}
</style>
</head>

<body>
<p class="s1" style="padding-left: 7pt; text-indent: 0pt; text-align:left;">
This is a sample sentence. <span class="s5"> 1</span>
</p>
</body>
</html>
'''
def get_property(class_name,property_name):
for rule in sheet:
if rule.selectorText=='.'+class_name:
for property in rule.style:
if property.name==property_name:
return property.value
soup=BeautifulSoup(html,'html.parser')
sheet=cssutils.parseString(soup.find('style').text)
vl=get_property('s5','vertical-align')
print(vl)

Output

10pt

This is not perfect but maybe you can improve upon it.
[1]: https://pypi.org/project/cssutils/

Is there a way to extract CSS from a webpage using BeautifulSoup?

If your goal is to truly parse the css:

  • There are some various methods here: Prev Question w/ Answers
  • I also have used a nice example from this site: Python Code Article

Beautiful soup will pull the entire page - and it does include the header, styles, scripts, linked in css and js, etc. I have used the method in the pythonCodeArticle before and retested it for the link you provided.

import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin

# URL of the web page you want to extract
url = "ENTER YOUR LINK HERE"

# initialize a session & set User-Agent as a regular browser
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"

# get the HTML content
html = session.get(url).content

# parse HTML using beautiful soup
soup = bs(html, "html.parser")
print(soup)

By looking at the soup output (It is very long, I will not paste here).. you can see it is a complete page. Just make sure to paste in your specific link

NOW If you wanted to parse the result to pick up all css urls.... you can add this: (I am still using parts of the code from the very well described python Code article link above)

# get the CSS files
css_files = []
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href' attribute
css_url = urljoin(url, css.attrs.get("href"))
css_files.append(css_url)
print(css_files)

The output css_files will be a list of all css files. You can now go visit those separately and see the styles that are being imported.

NOTE:this particular site has a mix of styles inline with the html (i.e. they did not always use css to set the style properties... sometimes the styles are inside the html content.)

This should get you started.

Can you write a css-selector in BeautifulSoup that uses either the class or style to identify the desired info in a div?

You can use css OR syntax to specify to match either of those patterns.The "," is the OR operator. The [] indicates attribute selector and . class selector.

data = [i.text for i in soup.select("div.text-one, div[style='display-style']")]

How to specify css selector in beautiful soup and python?

As mentioned, the information you are looking for is obtained via Javascript. It uses a slightly different URL to get the JSON data containing all of the card details. If you use this instead, you can easily list all of the card names without needing to use BeautifulSoup. For example:

import requests
import json

axis_url = "https://www.axisbank.com/AjaxService/GetCreditCardsProducts"
data = {"strcategory" : "[]", "strrewardtypes" :"[]"}
r = requests.post(axis_url, data=data)

for entry in json.loads(r.json()[0]):
print(entry['Name'])

Would give you the following cards:

Axis Bank ACE Credit Card
Axis Bank AURA Credit Card
Privilege Easy Credit Card
Axis Bank Reserve Credit Card
Axis Bank Freecharge Plus Credit Card
IndianOil Axis Bank Credit Card
Axis Bank Magnus Credit Card
Flipkart Axis Bank Credit Card
Axis Bank Freecharge Credit Card
Axis Bank MY Zone Credit Card
Axis Bank Neo Credit Card
Axis Bank Vistara Credit Card
Axis Bank Vistara Signature Credit Card
Axis Bank Vistara Infinite Credit Card
Axis Bank Privilege Credit Card
Miles and More Axis Bank Credit Card
Axis Bank Select Credit Card
Axis Bank Pride Platinum Credit Card
Axis Bank Pride Signature Credit Card
Axis Bank MY Zone Easy Credit Card
Axis Bank Insta Easy Credit Card
Axis Bank Signature Credit Card with Lifestyle Benefits
Platinum Credit Card
Titanium Smart Traveler Credit Card
Axis Bank My Wings Credit Card

Using BeautifulSoup to scrape specific element within a CSS class

Instead of a CSS selector, try selecting using normal BS methods:

print(soup.find('ul',class_='example-ul-class').find_all('li')[2].text.strip())

BeautifulSoup select method with CSS selector copied from inspecting element of a page returns nothing

You don't have the correct tags with the class you are looking for. The description is under a <span> tag, that you can get by finding the specific <h2> tag.

import requests
from bs4 import BeautifulSoup

link = 'http://shop.oreilly.com/product/0636920028154.do'
req = requests.get(link)
bs = BeautifulSoup(req.text, 'html.parser')

desc = bs.find('h2', {'class':'t-description-heading'}).find_next('span').text

or with .select

desc = bs.select('h2.t-description-heading')[0].find_next('span').text

Output:

'Get a comprehensive, in-depth introduction to the core Python language with this hands-on book. Based on author Mark Lutz’s popular training course, this updated fifth edition will help you quickly write efficient, high-quality code with Python. It’s an ideal way to begin, whether you’re new to programming or a professional developer versed in other languages.Complete with quizzes, exercises, and helpful illustrations,  this easy-to-follow, self-paced tutorial gets you started with both Python 2.7 and 3.3— the latest releases in the 3.X  and 2.X lines—plus all other releases in common use today. You’ll also learn some advanced language features that recently have become more common in Python code.Explore Python’s major built-in object types such as numbers, lists, and dictionariesCreate and process objects with Python statements, and learn Python’s general syntax modelUse functions to avoid code redundancy and package code for reuseOrganize statements, functions, and other tools into larger components with modulesDive into classes: Python’s object-oriented programming tool for structuring codeWrite large programs with Python’s exception-handling model and development toolsLearn advanced Python tools, including decorators, descriptors, metaclasses, and Unicode processing'

How to use CSS selectors to retrieve specific links lying in some class using BeautifulSoup?

The page is not the most friendly in the use of classes and markup, but even so your CSS selector is too specific to be useful here.

If you want Upcoming Events, you want just the first <div class="events-horizontal">, then just grab the <div class="title"><a href="..."></div> tags, so the links on titles:

upcoming_events_div = soup.select_one('div.events-horizontal')
for link in upcoming_events_div.select('div.title a[href]'):
print(link['href'])

Note that you should not use r.text; use r.content and leave decoding to Unicode to BeautifulSoup. See Encoding issue of a character in utf-8



Related Topics



Leave a reply



Submit