Beautifulsoup Not Grabbing Dynamic Content

Problems scraping dynamic content with requests and BeautifulSoup

You do not need BeautifulSoup but the correct url to get only the result of written number:

https://www.languagesandnumbers.com/ajax/en/

Cause it returns in this way ack:::dreiundzwanzig you hav to extract the string:

response.text.split(':')[-1]

Example

import requests

with requests.Session() as session:
    response = session.post('https://www.languagesandnumbers.com/ajax/en/', data={
        "numberz": "23",
        "lang": "deu"
    })
response.text.split(':')[-1]

Output

dreiundzwanzig

Webscraping of dynamic content with Beautiful soup

Since the contents are dynamically loaded, you can parse the number of job result only after a certain element is visible, in that case, all elements will be loaded and you can successfully parse your desired data.

You can also increase the sleep time to load all data but that's a bad solution.

Working code -

import time

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()

# options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1920x1080")
options.add_argument("--disable-extensions")

chrome_driver = webdriver.Chrome(
    service=Service(ChromeDriverManager().install()),
    options=options
)

def arbeitsagentur_scraper():
    URL = "https://www.arbeitsagentur.de/jobsuche/suche?angebotsart=1&was=Informatiker%2Fin"
    with chrome_driver as driver:
        driver.implicitly_wait(15)
        driver.get(URL)
        wait = WebDriverWait(driver, 10)
        
        # time.sleep(10) # increase the load time to fetch all element, not advised solution
       
        # wait until this element is visible 
        wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.liste-container')))
        
        elem = driver.find_element(By.XPATH,
                                   '/html/body/jb-root/main/jb-jobsuche/jb-jobsuche-suche/div[1]/div/jb-h1zeile/h2')
        print(elem.text)

arbeitsagentur_scraper()

Output -

12.165 Jobs für Informatiker/in

Not able to scrape dynamic content using Selenium or BeautifulSoup

Go under the dev tools and look at XHR. you'll see the url to pull the data directly. It's returned as json, but can convert that to a table:

Code:

import requests
from pandas.io.json import json_normalize

url = 'https://www.prokabaddi.com/sifeeds/kabaddi/static/json/1_0_102_stats.json'
jsonData = requests.get(url).json()

table = json_normalize(jsonData['data'])

Output:

print (table.head(5).to_string())
   match_played  player_id         player_name  position_id position_name rank team        team_full_name  team_id team_name  value
0           101        197      Pardeep Narwal          8.0        Raider    1  PAT         Patna Pirates        6       PAT   1055
1           116         81     Rahul Chaudhari          8.0        Raider    2   TT       Tamil Thalaivas       29        TT    987
2           118         41  Deepak Niwas Hooda          1.0   All Rounder    3  JAI  Jaipur Pink Panthers        3       JAI    892
3           115         26         Ajay Thakur          8.0        Raider    4   TT       Tamil Thalaivas       29        TT    811
4            88        326         Rohit Kumar          8.0        Raider    5  BEN       Bengaluru Bulls        1       BEN    689

And filter to only get name and points:

print (table[['player_name','value']])
                     player_name  value
0                 Pardeep Narwal   1055
1                Rahul Chaudhari    987
2             Deepak Niwas Hooda    892
3                    Ajay Thakur    811
4                    Rohit Kumar    689
5                 Maninder Singh    673
6               Rishank Devadiga    619
7                Kashiling Adake    612
8                     Anup Kumar    596
9           Pawan Kumar Sehrawat    572
10              Manjeet Chhillar    562
11                Sandeep Narwal    533
12                    Monu Goyat    475
13                  Jang Kun Lee    462
14                 Sachin Tanwar    456
15                   Nitin Tomar    445
16                  Jasvir Singh    412
17                 Rajesh Narwal    397
18                  Sukesh Hegde    395
19                  Meraj Sheykh    393
20                  Naveen Kumar    364
21                Vikash Kandola    358
22           Prashanth Kumar Rai    358
23                  K. Prapanjan    357
24               Shrikant Jadhav    342
25        Siddharth Sirish Desai    337
26                     Ran Singh    319
27                Ravinder Pahal    317
28                 Deepak Narwal    306
29                   Wazir Singh    300
..                           ...    ...
359         Rohit Kumar Prajapat      1
360              Kazuhiro Takano      1
361             Inderpal Bishnoi      1
362                   Amit Kumar      1
363          Sunil Subhash Lande      1
364                  Atif Waheed      1
365                  Nithesh B R      1
366  Mohammad Taghi Paein Mahali      1
367                  Yong Joo Ok      1
368               Vishnu Uthaman      1
369               Ajvender Singh      1
370                        Sanju      1
371              Ravinandan G.M.      1
372                 Navjot Singh      1
373                Parvesh Attri      1
374                Hardeep Duhan      1
375               Parveen Narwal      1
376                   Ajay Singh      1
377                  Nitin Kumar      1
378                       Jishnu      1
379                Naveen Narwal      1
380                   M. Abishek      1
381               Vikas Chhillar      1
382                         Aman      1
383                      Satywan      1
384               Vikram Kandola      1
385             Emad Sedaghatnia      1
386                Aashish Nagar      1
387        Ajinkya Rohidas Kapre      1
388                       Munish      1

[389 rows x 2 columns]

How to scrape dynamic content with beautifulsoup?

Set headers to your request and store your information in a more structured way.

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0'}
URLs = ['https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2']

data = []
for url in URLs:

    results = requests.get(url,headers=headers)
    soup = BeautifulSoup(results.text, "html.parser")
    data.append({
        'name': soup.find('span', class_ = 'main-detail__name').get_text(strip=True),
        'brand': soup.find('span', class_ = 'main-detail__marque').get_text(strip=True),
        'ref':soup.find('span', class_ = 'main-detail__ref').get_text(strip=True),
        'price':soup.find('span', {'itemprop':'price'}).get('content'),
        'url':url
    })

pd.DataFrame(data)

Output

name	brand	ref	price	url
Montre The Longines Legend Diver L3.774.4.30.2	Longines	Référence : L3.774.4.30.2	2240	https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2

Issues with web scraping using Beautiful Soup on dynamic HTML websites

The page is loading dynamically through Ajax. Looking at network inspector, the page loads all data from very big JSON file located at https://www.acc.co.nz/for-providers/treatment-recovery/work-type-detail-sheets/getSheets. To load all job data, you can use this script:

url = "https://www.acc.co.nz/for-providers/treatment-recovery/work-type-detail-sheets/getSheets"

import requests
import json

headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get(url, headers=headers)
data = json.loads(r.text)

# For printing all data in pretty form uncoment this line:
# print(json.dumps(data, indent=4, sort_keys=True))

for d in data:
    print(f'ID:\t{d["ID"]}')
    print(f'Job Title:\t{d["JobTitle"]}')
    print(f'Created:\t{d["Created"]}')
    print('*' * 80)

# Available keys in this JSON:
# ClassName
# LastEdited
# Created
# ANZSCO
# JobTitle
# Description
# WorkTasks
# WorkEnvironment
# PhysicalMentalDemands
# Comments
# EntryRequirements
# Group
# ID
# RecordClassName

This prints:

ID: 2327
Job Title:  Watch and Clock Maker and Repairer   
Created:    2017-07-11 11:33:52
********************************************************************************
ID: 2328
Job Title:  Web Administrator
Created:    2017-07-11 11:33:52
********************************************************************************
ID: 2329
Job Title:  Welder 
Created:    2017-07-11 11:33:52

...and so on

In the script I wrote available keys you can use to access your specific job data.

Parsing webpage with beautifulsoup to get dynamic content

This is handled probably by some ajax calls so it will not be in the source,

I think you would need to "monitor network" through developer tools in the browser and look for requests you are interested in.

i.e. a random picked request URL from this page:

http://ws.audioscrobbler.com/2.0/?api_key=73581584905631c5fc15720f03b0b9c8&format=json&callback=jQuery1703329798618797213_1380004055342&method=track.getSimilar&limit=10&artist=roxy%20music&track=while%20my%20heart%20is%20still%20beating&_=1380004055943

to get/see the response enter the above URL in the browser and see the content of the response.

so you need to simulate the requests in python and after you get the response you have to parse the response for interesting details.

Beautiful Soup not finding element by ID

For dynamically created elements give Selenium a try

from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
URL = 'https://www.forexfactory.com/news'

driver.get(URL)

Wait a few seconds until also dynamically content is loaded

driver.implicitly_wait(5) # wait for seconds

Get your element

uiOuter = driver.find_element_by_id('ui-outer')

Example for all links (story tile)

aHref = driver.find_elements_by_css_selector('div.flexposts__story-title a')
[x.text for x in aHref]

Output

    ['EU\'s Barnier says "fundamental divergences" persist in UK trade talks',
 'With end of crisis programs, Fed faces tricky post-pandemic transition',
 'Markets Look Past Near-Term Challenges',
 'Time is short, Divergences remain, but we continue to work hard for a deal',
 'EURUSD consolidates around 1.1900; indicators are flat',
 'Rush for emerging market company bonds as investors look beyond COVID-19',
 'Europe’s Virus Lockdowns Push Economy Into Another Contraction',
 'Interactive Brokers enhances Client Portal',
 'BoE’s Haldane: Risk That Anxiety Leads To Gloom Loop',
 'Sharpest fall in UK private sector output since May. Manufacturing growth offset by renewed...',
 'Remote Working Shift Offers Silver Lining for Finance Gender Gap',
 'EU Flash PMI signals steep downturn in November amid COVID-19 lockdowns',
 'German PMI drops to five-month low in November due to tightening of COVID-19 restrictions, but...',
 'Sharp decline in French business activity amid fresh COVID-19 lockdown',
 'Rishi Sunak says Spending Review will not spell austerity',
 'Remote Working Shift Offers Silver Lining for Finance Gender Gap',
 'Japan’s Labor Thanksgiving Day goes viral',
 'Ranking Asset Classes by Historical Returns (1985-2020)',
 'Time is short, Divergences remain, but we continue to work hard for a deal',
 'EURUSD consolidates around 1.1900; indicators are flat',
 'US Dollar stuck near support, NZ$ strikes two-year high',
 'US Dollar stuck near support, NZ$ strikes two-year high',
 'Georgia confirms results in latest setback for Trump bid to overturn Biden win',
 'Canada to roll over terms of EU trade deal with UK',
 'Time is short, Divergences remain, but we continue to work hard for a deal',
 'German PMI drops to five-month low in November due to tightening of COVID-19 restrictions, but...',
 "COVID-19: 'No return to austerity', says chancellor as he hints at public sector pay freeze",
 'EURUSD consolidates around 1.1900; indicators are flat',
 'New Zealand Dollar May Rise as RBNZ Holds Fire on Negative Rates',
 'Interactive Brokers enhances Client Portal']

Beautifulsoup Not Grabbing Dynamic Content

Problems scraping dynamic content with requests and BeautifulSoup

Example

Output

Webscraping of dynamic content with Beautiful soup

Not able to scrape dynamic content using Selenium or BeautifulSoup

How to scrape dynamic content with beautifulsoup?

Example

Output

Issues with web scraping using Beautiful Soup on dynamic HTML websites

Parsing webpage with beautifulsoup to get dynamic content

Beautiful Soup not finding element by ID

Related Topics

Leave a reply