How to Use Xpath with Beautifulsoup

can we use XPath with BeautifulSoup?

Nope, BeautifulSoup, by itself, does not support XPath expressions.

An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.

Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements.

try:
    # Python 2
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

There is also a dedicated lxml.html() module with additional functionality.

Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression:

import lxml.html
import requests

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)

Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

Coming full circle: BeautifulSoup itself does have very complete CSS selector support:

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

How to get xpath using beautifulsoup after we find a tag

Clicking buttons:

To interact with the buttons you need to automate a browser e.g. with selenium.

Targeting a specific button:

You will notice the expand + buttons have an onclick attribute which triggers a function call with various arguments. It may be possible to replicate this with requests; However, I am going to focus on the fact you can target the particular + button by that attribute, and its value, based upon the unique arguments, passed to the function, using a css attribute = value selector with * contains operator.

You can see here the function that is called with Expenses and Quarters arguments in line with the part of the table to be expanded:

Sample Image

I target that onclick attribute by the argument values, escaping the ' as I wrapping with outer '. I am looking for the onclick containing those values.

Expand all +

To find all + buttons, rather than target a specific one, change the method call to find_elements_by_css_selector and change the css selector to instead look at the function rather than the arguments i.e. [onclick*="Company.showSchedule"].

You will need to loop the matched webElements collection (list) to click individual elements within the loop.

Py:

from selenium import webdriver

d = webdriver.Chrome()
d.get('https://www.screener.in/company/GRANULES/consolidated/')
d.find_element_by_css_selector('[onclick*="\'Expenses\', \'quarters\'"]').click()

Get href from html with Beautiful Soup select or lxml xpath

Try the following scripts to get the content you are interested in. Make sure to test both of them by using different movies. I suppose they both will produce the desired output. I tried to avoid any hardcoded indices to target the content.

Using css selector:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.rottentomatoes.com/m/leto')
soup = BeautifulSoup(r.text,'lxml')

directed = soup.select_one(".meta-row:contains('Directed By') > .meta-value > a").text
written = [item.text for item in soup.select(".meta-row:contains('Written By') > .meta-value > a")]
written_links = [item.get("href") for item in soup.select(".meta-row:contains('Written By') > .meta-value > a")]
print(directed,written,written_links)

Using xpath:

import requests
from lxml.html import fromstring

r = requests.get('https://www.rottentomatoes.com/m/leto')
root = fromstring(r.text)

directed = root.xpath("//*[contains(.,'Directed By')]/parent::*/*[@class='meta-value']/a/text()")
written = root.xpath("//*[contains(.,'Written By')]/parent::*/*[@class='meta-value']/a/text()")
written_links = root.xpath(".//*[contains(.,'Written By')]/parent::*/*[@class='meta-value']/a//@href")
print(directed,written,written_links)

In case of cast, I used list comprehension so that I can use .strip() on individual element to kick out whitespaces. normalize-space() is the ideal option for this, though.

cast = [item.strip() for item in root.xpath("//*[contains(@class,'cast-item')]//a/span[@title]/text()")]

How do I Iterate between Xpath using Beautifulsoup?

You could just construct them based on subtracting or adding to today's date. However, you can also use nth-child to extract the relevant nodes, specifying the first (yesterday) anchor tag, then nth-child range to get from tomorrow onwards; combining them with Or syntax. You don't need to specify Today, as that is the landing page. Then you can browser.get to each extracted link in a loop over the returned list:

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.oddsportal.com/matches/soccer/')
other_days = [i.get_attribute('href') 
              for i in browser.find_elements_by_css_selector('.next-games-date > a:nth-child(1), .next-games-date > a:nth-child(n+3)')]
print(other_days)
for a_day in other_days:
    browser.get(a_day)
    #do something

Integrating with your code shared in comments (meant a re-write of some of your existing class):

import pandas as pd
from selenium import webdriver
from datetime import datetime
from bs4 import BeautifulSoup as bs

class GameData:
    def __init__(self):
        self.country = []

def get_urls(browser, landing_page):
    
    browser.get(landing_page)
    urls = [i.get_attribute('href') for i in 
           browser.find_elements_by_css_selector('.next-games-date > a:nth-child(1), .next-games-date > a:nth-child(n+3)')]
    
    return urls

def parse_data(html):

    df = pd.read_html(html, header=0)[0]
    html = browser.page_source
    soup = bs(html, "lxml")
    cont = soup.find('div', {'id': 'wrap'})
    content = cont.find('div', {'id': 'col-content'})
    content = content.find('table', {'class': 'table-main'}, {'id': 'table-matches'})
    main = content.find('th', {'class': 'first2 tl'})
    
    if main is None:
        return None
    
    count = main.findAll('a')
    country = count[0].text
    game_data = GameData()
    
    for row in df.itertuples():
        if not isinstance(row[1], str):
            continue
        elif ':' not in row[1]:
            country = row[1].split('»')[0]
            continue
        game_data.country.append(country)

    return game_data

if __name__ == '__main__':
  
    start_url = "https://www.oddsportal.com/matches/soccer/"
    urls = []
    browser = webdriver.Chrome()
    results = None
    urls = get_urls(browser, start_url)
    urls.insert(0, start_url)
    
    for number, url in enumerate(urls):
        if number > 0:
            browser.get(url)
        html = browser.page_source
        game_data = parse_data(html)
        
        if game_data is None:
            continue
        
        result = pd.DataFrame(game_data.__dict__)
        
        if results is None:
            results = result
        else:
            results = results.append(result, ignore_index=True)

Need to scrape a data from a website using xpath and beautifulsoup

A simple way is to use pandas. Here is how you do it:

import pandas as pd
import requests

r = requests.get('https://tipsters.asianbookie.com/index.cfm?player=Mitya68&ID=297754&sortleague=1#playersopenbets&tz=5.5').text

dfs = pd.read_html(r)

df = dfs[141]

df.columns = df.iloc[0]

df = df.drop(0)

df['Bet Placed ≡'] = [value.split('.')[-1] for value in df['Bet Placed ≡']]

print(df)

Output:

0   Bet Placed ≡              Team A  ...   Rate         Pending Status
1    9 hours ago         Real Madrid  ...  1.975            pending ?-?
2    9 hours ago   Red Bull Salzburg  ...  1.875            pending ?-?
3    9 hours ago                Ajax  ...   2.00            pending ?-?
4    9 hours ago       Bayern Munich  ...   2.00            pending ?-?
5    9 hours ago       Bayern Munich  ...   1.85            pending ?-?
6    9 hours ago         Inter Milan  ...  1.875            pending ?-?
7    9 hours ago     Manchester City  ...   1.95            pending ?-?
8    9 hours ago         Midtjylland  ...  1.875            pending ?-?
9    9 hours ago  Olympiakos Piraeus  ...   1.95            pending ?-?
10   9 hours ago          Hamburg SV  ...  1.925            pending ?-?
11   9 hours ago         Vissel Kobe  ...  1.925   Lost(-25,000) FT 1-3
12   9 hours ago     Shonan Bellmare  ...  1.825   Won½(+10,313) FT 0-0
13   9 hours ago    Yokohama Marinos  ...  2.025   Won½(+12,812) FT 2-1
14   9 hours ago        RKC Waalwijk  ...  1.875            pending ?-?
15   9 hours ago            Espanyol  ...  2.075  lose(-25,000) 29' 1-0

[15 rows x 7 columns]

You can also get these values as separate lists by adding these lines to your code:

team_a = list(df['Team A'])
team_b = list(df['Team B'])
rate = list(df['Rate'])
stake = list(df['Stake'])

If you want to print them in the format that you have mentioned, add these lines to your code:

final_lst = zip(team_a,team_b,stake,rate)

for teamA,teamB,stakee,ratee in final_lst:
    print(f"{teamA} vs {teamB} - Stake: {stakee}, Rate: {ratee}")

Output:

Real Madrid vs Shaktar Donetsk - Stake: 25000.00, Rate: 1.975
Red Bull Salzburg vs Lokomotiv Moscow - Stake: 100000.00, Rate: 1.875
Ajax vs Liverpool - Stake: 25000.00, Rate: 2.00
Bayern Munich vs Atl. Madrid - Stake: 25000.00, Rate: 2.00
Bayern Munich vs Atl. Madrid - Stake: 25000.00, Rate: 1.85
Inter Milan vs Monchengladbach - Stake: 25000.00, Rate: 1.875
Manchester City vs Porto - Stake: 25000.00, Rate: 1.95
Midtjylland vs Atalanta - Stake: 100000.00, Rate: 1.875
Olympiakos Piraeus vs Marseille - Stake: 25000.00, Rate: 1.95
Hamburg SV vs Erzgebirge Aue - Stake: 100000.00, Rate: 1.925
Vissel Kobe vs Kashima Antlers - Stake: 25000.00, Rate: 1.925
Shonan Bellmare vs Sagan Tosu - Stake: 25000.00, Rate: 1.825
Yokohama Marinos vs Nagoya - Stake: 25000.00, Rate: 2.025
RKC Waalwijk vs PEC Zwolle - Stake: 25000.00, Rate: 1.875
Espanyol vs Mirandes - Stake: 25000.00, Rate: 2.075

xpath loop pages with beautifulsoup

You actually don't need selenium or even beautifulsoup, all the content is being loaded via xhr requests without the need of javascript, as you can see on the network tab of the browser developer tools, i.e.:

import requests

h = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0',
    'X-Requested-With': 'XMLHttpRequest',
}

u = "https://www.systembolaget.se/api/productsearch/search/sok-dryck/?searchquery=land&sortdirection=Ascending&site=all&fullassortment=1"

req = requests.get(u,headers=h).json()

for prod in req['ProductSearchResults']:
    ProductId = prod['ProductId']
    ProductNumber = prod['ProductNumber']
    ProductNameBold = prod['ProductNameBold']
    ProductNameThin = prod['ProductNameThin']
    Volume = prod['Volume']
    Price = prod['Price']
    Category = prod['Category']
    ProductNumberShort = prod['ProductNumberShort']
    TasteAndUsage = prod['TasteAndUsage']
    #...  and many other product details
    print(ProductNumber, ProductNameBold)

9001001 Evening Land Vineyard Seven Springs
7233401 Seven Springs
694701 Tyrrell's Wines
57801 Barceló
53701 Highland Queen
57802 Barceló
...

To retrieve detailed information about one or more products, you can issue the following request:

u = "https://www.systembolaget.se/api/product/GetProductsForAnalytics"
d = {"ProductNumbers":["2048801"]} # use the ProductNumber(s) gattered above
prod_detail = requests.post(u, json=d, headers=h).json()
print(prod_detail)

{'Products': [{'ProductId': '303987', 'ProductNumberShort': '20488', 'Assortment': 'FS', 'CustomerOrderSupplySource': 'depot', 'SupplyCode': None, 'IsNewVintage': False, 'OriginLevel1': 'Skottland', 'OriginLevel2': 'Highlands', 'OriginLevel3': 'Western Highlands', 'OriginLevel4': None, 'OriginLevel5': None, 'BrandOrigin': None, 'BottleCode': 'Helbutelj', 'BottleTypeGroup': 'Helflaska glas', 'IsWebLaunch': False, 'Seal': 'Skruvkapsyl', 'VatCode': 1, 'PriceExclVat': 1199.2, 'PriceInclVatExclRecycleFee': 1499.0, 'PriorPrice': 0.0, 'ComparisonPrice': 2141.43, 'SellStartDate': '2008-04-01T00:00:00', 'SellStartTime': '10:00:00', 'BottleTypes': None, 'IsSellStartDateHighlighted': False, 'SellStartSearchUrl': '/sok-dryck/?sellstartdatefrom=2008-04-01&sellstartdateto=2008-04-01', 'ProducerName': 'Lang Brothers Limited', 'ProducerDescription': 'Familjen Edmonstone erhöll licens att destillera 1833. 1851 såldes destilleriet, under namnet Burnfoot, till MacLelleand som säljer det till Lang Brothers 1876, då under namnet Glen Guin. \n1905 blir det officiella namnet Glengoyne. 1965 blir Langs en del av Robertson & Baxter Group, som senare blir Edrington. 2003 köper företaget Ian Macleod & Co. destilleriet och varumärket Langs. Samtidigt byter man namn till Ian Macleod Distillers.', 'TasteAndUsage': 'Komplex, smakrik whisky med tydlig karaktär av sherryfat, inslag av valnötter, torkad frukt, kakao, lakrits och honung. Serveras rumstempererad som avec.', 'Production': None, 'CultivationArea': 'Glengoyne ligger i Western Highlands, alltså de västra delarna av Skottland. Gränsen för vad som räknas till de norra och västra högländerna brukar gå vid destillerier som ligger i nordvästlig riktning räknat från Staden Inverness.', 'Harvest': None, 'Soil': None, 'SupplierName': 'Cask Sweden AB', 'IsManufacturingCountry': True, 'IsSupplierTemporaryNotAvailable': False, 'IsSupplierNotAvailable': False, 'BackInStockAtSupplier': None, 'IsDiscontinued': False, 'IsCompletelyOutOfStock': False, 'IsTemporaryOutOfStock': False, 'RestrictedParcelQuantity': 0, 'IsRegionalRestricted': False, 'IsNewInAssortment': False, 'IsLimitedEdition': False, 'IsFsAssortment': True, 'IsTseAssortment': False, 'IsTsLsAssortment': False, 'IsHidden': False, 'IsSearchable': True, 'IsInAnyStoreSearchAssortment': True, 'IsStoreOrderApplicable': True, 'IsHomeOrderApplicable': True, 'IsAgentOrderApplicable': True, 'SeasonName': None, 'IsDishMatchable': False, 'IsDKI': False, 'AllergenStatement': 'Inga Allergener', 'IngredientStatement': None, 'AlcoholPercentage': 43.0, 'TasteSymbols': 'Avec/digestif;', 'TasteClockGroup': 'Ingen rökighet', 'TasteClockBitter': None, 'TasteClockFruitacid': None, 'TasteClockBody': None, 'TasteClockRoughness': None, 'TasteClockSweetness': None, 'TasteClockCasque': 8, 'TasteClockSmokiness': 1, 'IsCategoryBeer': False, 'IsCategoryBeerOrWhiskey': False, 'IsNewsIconVisible': False, 'Grapes': None, 'RawMaterial': 'Mältat korn av typen golden promise.', 'SugarContent': '<3', 'Additives': None, 'Storage': None, 'Preservable': '', 'HasInbounddeliveries': False, 'IsGlutenFree': False, 'IsEthical': False, 'EthicalLabel': None, 'IsKosher': False, 'Created': '2014-04-23T10:47:03', 'Modified': '2020-04-24T02:09:06', 'ShowAdditionalBottleTypes': False, 'OriginLevels': ['Skottland', 'Highlands', 'Western Highlands'], 'TasteSymbolsList': ['Avec/digestif'], 'IsTasteAndUsageAlone': True, 'ImageItem': [{'ImageUrl': 'https://static.systembolaget.se/imagelibrary/publishedmedia/rdxofqidiov294hwwqxz/303987.jpg', 'ImageAltAttribute': '303987'}], 'WebLaunch': None, 'ProductNutritionHeaders': [], 'HasProductImage': True, 'HasAnyTaste': True, 'HasSymbolsOrRecycleFee': False, 'HasTasteAndRestrictions': False, 'HasRestrictions': False, 'HasSymbols': False, 'HasAnyTasteClocks': True, 'HasAnyTasteSymbols': True, 'ProductNumber': '2048801', 'ProductNameBold': 'Glengoyne', 'ProductNameThin': '21 Years', 'PriceInclVat': 1499.0, 'IsOrganic': False, 'IsLightWeightBottle': False, 'Volume': 700.0, 'Vintage': 0, 'Country': 'Storbritannien', 'Category': 'Sprit', 'SubCategory': 'Whisky', 'Type': 'Maltwhisky', 'Style': None, 'BeverageDescriptionShort': 'Maltwhisky', 'StyleDescription': None, 'RecycleFee': 0.0, 'RecycleFeeIndicator': '', 'BottleTextShort': 'Flaska', 'IsAddableToBasket': True, 'IsFsTsAssortment': True, 'IsBSAssortment': False, 'IsPaAssortment': False, 'ShowAdditionalBsInformation': False, 'Usage': 'Serveras rumstempererad som avec.', 'Color': 'Brungul färg.', 'Aroma': 'Komplex, harmonisk doft med tydlig karaktär av sherryfat, inslag av torkad frukt, valnötter, lakritsfudge, pomerans och mörk choklad.', 'Taste': 'Komplex, smakrik whisky med tydlig karaktär av sherryfat, inslag av valnötter, torkad frukt, kakao, lakrits och honung.', 'AdditionalInformation': "Avec är drycker som konsumeras efter middagen till kaffet. Vanliga exempel på avec är whisky, cognac, mörk rom eller likörer. Ordet är franskt, betyder 'med' och är en kortform av uttrycket 'du café avec le petit verre' som betyder kaffe med det lilla glaset.  "}], 'StockBalances': []}

How to get string using xpath in python bs4?

In xpath just you have to use text() method

from bs4 import BeautifulSoup
from lxml import etree

html_doc = """
<html>
<head>
</head>
<body>
   <div class="container">
      <section id="page">
         <div class="content">   
            <div class="box">  
               <ul>
                  <li>Name: Peter</li>
                  <li>Age: 21</li>
                  <li>Status: Active</li>
               </ul> 
            </div>
         </div>
      </section>
   </div>
</body>
</html>
"""

soup = BeautifulSoup(html_doc, 'lxml')
dom = etree.HTML(str(soup))
print(dom.xpath('/html/body/div/section/div[1]/div[1]/ul/li[3]/text())

Output:

 ['Status: Active']

#OR

for li in dom.xpath('/html/body/div/section/div[1]/div[1]/ul/li[3]/text()'):
    txt=li.split()[1]
    print(txt)

Output:

Active

#OR

print(' '.join(dom.xpath('/html/body/div/section/div[1]/div[1]/ul/li[3]/text()')))

Output:

Status: Active

#OR

print(''.join(dom.xpath('//*[@class="box"]/ul/li[3]/text()')))

Output:

Status: Active

How to Use Xpath with Beautifulsoup

can we use XPath with BeautifulSoup?

How to get xpath using beautifulsoup after we find a tag

Get href from html with Beautiful Soup select or lxml xpath

How do I Iterate between Xpath using Beautifulsoup?

Need to scrape a data from a website using xpath and beautifulsoup

xpath loop pages with beautifulsoup

How to get string using xpath in python bs4?

Related Topics

Leave a reply