Beautifulsoup Not Grabbing Dynamic Content

Problems scraping dynamic content with requests and BeautifulSoup

You do not need BeautifulSoup but the correct url to get only the result of written number:

https://www.languagesandnumbers.com/ajax/en/

Cause it returns in this way ack:::dreiundzwanzig you hav to extract the string:

response.text.split(':')[-1]

Example

import requests

with requests.Session() as session:
response = session.post('https://www.languagesandnumbers.com/ajax/en/', data={
"numberz": "23",
"lang": "deu"
})
response.text.split(':')[-1]

Output

dreiundzwanzig

Webscraping of dynamic content with Beautiful soup

Since the contents are dynamically loaded, you can parse the number of job result only after a certain element is visible, in that case, all elements will be loaded and you can successfully parse your desired data.

You can also increase the sleep time to load all data but that's a bad solution.

Working code -

import time

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()

# options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1920x1080")
options.add_argument("--disable-extensions")

chrome_driver = webdriver.Chrome(
service=Service(ChromeDriverManager().install()),
options=options
)

def arbeitsagentur_scraper():
URL = "https://www.arbeitsagentur.de/jobsuche/suche?angebotsart=1&was=Informatiker%2Fin"
with chrome_driver as driver:
driver.implicitly_wait(15)
driver.get(URL)
wait = WebDriverWait(driver, 10)

# time.sleep(10) # increase the load time to fetch all element, not advised solution

# wait until this element is visible
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.liste-container')))

elem = driver.find_element(By.XPATH,
'/html/body/jb-root/main/jb-jobsuche/jb-jobsuche-suche/div[1]/div/jb-h1zeile/h2')
print(elem.text)

arbeitsagentur_scraper()

Output -

12.165 Jobs für Informatiker/in

Not able to scrape dynamic content using Selenium or BeautifulSoup

Go under the dev tools and look at XHR. you'll see the url to pull the data directly. It's returned as json, but can convert that to a table:

Code:

import requests
from pandas.io.json import json_normalize

url = 'https://www.prokabaddi.com/sifeeds/kabaddi/static/json/1_0_102_stats.json'
jsonData = requests.get(url).json()

table = json_normalize(jsonData['data'])

Output:

print (table.head(5).to_string())
match_played player_id player_name position_id position_name rank team team_full_name team_id team_name value
0 101 197 Pardeep Narwal 8.0 Raider 1 PAT Patna Pirates 6 PAT 1055
1 116 81 Rahul Chaudhari 8.0 Raider 2 TT Tamil Thalaivas 29 TT 987
2 118 41 Deepak Niwas Hooda 1.0 All Rounder 3 JAI Jaipur Pink Panthers 3 JAI 892
3 115 26 Ajay Thakur 8.0 Raider 4 TT Tamil Thalaivas 29 TT 811
4 88 326 Rohit Kumar 8.0 Raider 5 BEN Bengaluru Bulls 1 BEN 689

And filter to only get name and points:

print (table[['player_name','value']])
player_name value
0 Pardeep Narwal 1055
1 Rahul Chaudhari 987
2 Deepak Niwas Hooda 892
3 Ajay Thakur 811
4 Rohit Kumar 689
5 Maninder Singh 673
6 Rishank Devadiga 619
7 Kashiling Adake 612
8 Anup Kumar 596
9 Pawan Kumar Sehrawat 572
10 Manjeet Chhillar 562
11 Sandeep Narwal 533
12 Monu Goyat 475
13 Jang Kun Lee 462
14 Sachin Tanwar 456
15 Nitin Tomar 445
16 Jasvir Singh 412
17 Rajesh Narwal 397
18 Sukesh Hegde 395
19 Meraj Sheykh 393
20 Naveen Kumar 364
21 Vikash Kandola 358
22 Prashanth Kumar Rai 358
23 K. Prapanjan 357
24 Shrikant Jadhav 342
25 Siddharth Sirish Desai 337
26 Ran Singh 319
27 Ravinder Pahal 317
28 Deepak Narwal 306
29 Wazir Singh 300
.. ... ...
359 Rohit Kumar Prajapat 1
360 Kazuhiro Takano 1
361 Inderpal Bishnoi 1
362 Amit Kumar 1
363 Sunil Subhash Lande 1
364 Atif Waheed 1
365 Nithesh B R 1
366 Mohammad Taghi Paein Mahali 1
367 Yong Joo Ok 1
368 Vishnu Uthaman 1
369 Ajvender Singh 1
370 Sanju 1
371 Ravinandan G.M. 1
372 Navjot Singh 1
373 Parvesh Attri 1
374 Hardeep Duhan 1
375 Parveen Narwal 1
376 Ajay Singh 1
377 Nitin Kumar 1
378 Jishnu 1
379 Naveen Narwal 1
380 M. Abishek 1
381 Vikas Chhillar 1
382 Aman 1
383 Satywan 1
384 Vikram Kandola 1
385 Emad Sedaghatnia 1
386 Aashish Nagar 1
387 Ajinkya Rohidas Kapre 1
388 Munish 1

[389 rows x 2 columns]

How to scrape dynamic content with beautifulsoup?

Set headers to your request and store your information in a more structured way.

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0'}
URLs = ['https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2']

data = []
for url in URLs:

results = requests.get(url,headers=headers)
soup = BeautifulSoup(results.text, "html.parser")
data.append({
'name': soup.find('span', class_ = 'main-detail__name').get_text(strip=True),
'brand': soup.find('span', class_ = 'main-detail__marque').get_text(strip=True),
'ref':soup.find('span', class_ = 'main-detail__ref').get_text(strip=True),
'price':soup.find('span', {'itemprop':'price'}).get('content'),
'url':url
})

pd.DataFrame(data)

Output





















namebrandrefpriceurl
Montre The Longines Legend Diver L3.774.4.30.2LonginesRéférence : L3.774.4.30.22240https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2

Issues with web scraping using Beautiful Soup on dynamic HTML websites

The page is loading dynamically through Ajax. Looking at network inspector, the page loads all data from very big JSON file located at https://www.acc.co.nz/for-providers/treatment-recovery/work-type-detail-sheets/getSheets. To load all job data, you can use this script:

url = "https://www.acc.co.nz/for-providers/treatment-recovery/work-type-detail-sheets/getSheets"

import requests
import json

headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get(url, headers=headers)
data = json.loads(r.text)

# For printing all data in pretty form uncoment this line:
# print(json.dumps(data, indent=4, sort_keys=True))

for d in data:
print(f'ID:\t{d["ID"]}')
print(f'Job Title:\t{d["JobTitle"]}')
print(f'Created:\t{d["Created"]}')
print('*' * 80)

# Available keys in this JSON:
# ClassName
# LastEdited
# Created
# ANZSCO
# JobTitle
# Description
# WorkTasks
# WorkEnvironment
# PhysicalMentalDemands
# Comments
# EntryRequirements
# Group
# ID
# RecordClassName

This prints:

ID: 2327
Job Title: Watch and Clock Maker and Repairer
Created: 2017-07-11 11:33:52
********************************************************************************
ID: 2328
Job Title: Web Administrator
Created: 2017-07-11 11:33:52
********************************************************************************
ID: 2329
Job Title: Welder
Created: 2017-07-11 11:33:52

...and so on

In the script I wrote available keys you can use to access your specific job data.

Parsing webpage with beautifulsoup to get dynamic content

This is handled probably by some ajax calls so it will not be in the source,

I think you would need to "monitor network" through developer tools in the browser and look for requests you are interested in.

i.e. a random picked request URL from this page:

http://ws.audioscrobbler.com/2.0/?api_key=73581584905631c5fc15720f03b0b9c8&format=json&callback=jQuery1703329798618797213_1380004055342&method=track.getSimilar&limit=10&artist=roxy%20music&track=while%20my%20heart%20is%20still%20beating&_=1380004055943

to get/see the response enter the above URL in the browser and see the content of the response.

so you need to simulate the requests in python and after you get the response you have to parse the response for interesting details.

Beautiful Soup not finding element by ID

For dynamically created elements give Selenium a try

from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
URL = 'https://www.forexfactory.com/news'

driver.get(URL)

Wait a few seconds until also dynamically content is loaded

driver.implicitly_wait(5) # wait for seconds

Get your element

uiOuter = driver.find_element_by_id('ui-outer')

Example for all links (story tile)

aHref = driver.find_elements_by_css_selector('div.flexposts__story-title a')
[x.text for x in aHref]

Output

    ['EU\'s Barnier says "fundamental divergences" persist in UK trade talks',
'With end of crisis programs, Fed faces tricky post-pandemic transition',
'Markets Look Past Near-Term Challenges',
'Time is short, Divergences remain, but we continue to work hard for a deal',
'EURUSD consolidates around 1.1900; indicators are flat',
'Rush for emerging market company bonds as investors look beyond COVID-19',
'Europe’s Virus Lockdowns Push Economy Into Another Contraction',
'Interactive Brokers enhances Client Portal',
'BoE’s Haldane: Risk That Anxiety Leads To Gloom Loop',
'Sharpest fall in UK private sector output since May. Manufacturing growth offset by renewed...',
'Remote Working Shift Offers Silver Lining for Finance Gender Gap',
'EU Flash PMI signals steep downturn in November amid COVID-19 lockdowns',
'German PMI drops to five-month low in November due to tightening of COVID-19 restrictions, but...',
'Sharp decline in French business activity amid fresh COVID-19 lockdown',
'Rishi Sunak says Spending Review will not spell austerity',
'Remote Working Shift Offers Silver Lining for Finance Gender Gap',
'Japan’s Labor Thanksgiving Day goes viral',
'Ranking Asset Classes by Historical Returns (1985-2020)',
'Time is short, Divergences remain, but we continue to work hard for a deal',
'EURUSD consolidates around 1.1900; indicators are flat',
'US Dollar stuck near support, NZ$ strikes two-year high',
'US Dollar stuck near support, NZ$ strikes two-year high',
'Georgia confirms results in latest setback for Trump bid to overturn Biden win',
'Canada to roll over terms of EU trade deal with UK',
'Time is short, Divergences remain, but we continue to work hard for a deal',
'German PMI drops to five-month low in November due to tightening of COVID-19 restrictions, but...',
"COVID-19: 'No return to austerity', says chancellor as he hints at public sector pay freeze",
'EURUSD consolidates around 1.1900; indicators are flat',
'New Zealand Dollar May Rise as RBNZ Holds Fire on Negative Rates',
'Interactive Brokers enhances Client Portal']


Related Topics



Leave a reply



Submit