Problems scraping dynamic content with requests and BeautifulSoup
You do not need BeautifulSoup
but the correct url
to get only the result of written number:
https://www.languagesandnumbers.com/ajax/en/
Cause it returns in this way ack:::dreiundzwanzig
you hav to extract the string:
response.text.split(':')[-1]
Example
import requests
with requests.Session() as session:
response = session.post('https://www.languagesandnumbers.com/ajax/en/', data={
"numberz": "23",
"lang": "deu"
})
response.text.split(':')[-1]
Output
dreiundzwanzig
Webscraping of dynamic content with Beautiful soup
Since the contents are dynamically loaded, you can parse the number of job result
only after a certain element is visible, in that case, all elements will be loaded and you can successfully parse your desired data.
You can also increase the sleep time to load all data but that's a bad solution.
Working code -
import time
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
options = webdriver.ChromeOptions()
# options.add_argument("--headless")
options.add_argument("--no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1920x1080")
options.add_argument("--disable-extensions")
chrome_driver = webdriver.Chrome(
service=Service(ChromeDriverManager().install()),
options=options
)
def arbeitsagentur_scraper():
URL = "https://www.arbeitsagentur.de/jobsuche/suche?angebotsart=1&was=Informatiker%2Fin"
with chrome_driver as driver:
driver.implicitly_wait(15)
driver.get(URL)
wait = WebDriverWait(driver, 10)
# time.sleep(10) # increase the load time to fetch all element, not advised solution
# wait until this element is visible
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.liste-container')))
elem = driver.find_element(By.XPATH,
'/html/body/jb-root/main/jb-jobsuche/jb-jobsuche-suche/div[1]/div/jb-h1zeile/h2')
print(elem.text)
arbeitsagentur_scraper()
Output -
12.165 Jobs für Informatiker/in
Not able to scrape dynamic content using Selenium or BeautifulSoup
Go under the dev tools and look at XHR. you'll see the url to pull the data directly. It's returned as json, but can convert that to a table:
Code:
import requests
from pandas.io.json import json_normalize
url = 'https://www.prokabaddi.com/sifeeds/kabaddi/static/json/1_0_102_stats.json'
jsonData = requests.get(url).json()
table = json_normalize(jsonData['data'])
Output:
print (table.head(5).to_string())
match_played player_id player_name position_id position_name rank team team_full_name team_id team_name value
0 101 197 Pardeep Narwal 8.0 Raider 1 PAT Patna Pirates 6 PAT 1055
1 116 81 Rahul Chaudhari 8.0 Raider 2 TT Tamil Thalaivas 29 TT 987
2 118 41 Deepak Niwas Hooda 1.0 All Rounder 3 JAI Jaipur Pink Panthers 3 JAI 892
3 115 26 Ajay Thakur 8.0 Raider 4 TT Tamil Thalaivas 29 TT 811
4 88 326 Rohit Kumar 8.0 Raider 5 BEN Bengaluru Bulls 1 BEN 689
And filter to only get name and points:
print (table[['player_name','value']])
player_name value
0 Pardeep Narwal 1055
1 Rahul Chaudhari 987
2 Deepak Niwas Hooda 892
3 Ajay Thakur 811
4 Rohit Kumar 689
5 Maninder Singh 673
6 Rishank Devadiga 619
7 Kashiling Adake 612
8 Anup Kumar 596
9 Pawan Kumar Sehrawat 572
10 Manjeet Chhillar 562
11 Sandeep Narwal 533
12 Monu Goyat 475
13 Jang Kun Lee 462
14 Sachin Tanwar 456
15 Nitin Tomar 445
16 Jasvir Singh 412
17 Rajesh Narwal 397
18 Sukesh Hegde 395
19 Meraj Sheykh 393
20 Naveen Kumar 364
21 Vikash Kandola 358
22 Prashanth Kumar Rai 358
23 K. Prapanjan 357
24 Shrikant Jadhav 342
25 Siddharth Sirish Desai 337
26 Ran Singh 319
27 Ravinder Pahal 317
28 Deepak Narwal 306
29 Wazir Singh 300
.. ... ...
359 Rohit Kumar Prajapat 1
360 Kazuhiro Takano 1
361 Inderpal Bishnoi 1
362 Amit Kumar 1
363 Sunil Subhash Lande 1
364 Atif Waheed 1
365 Nithesh B R 1
366 Mohammad Taghi Paein Mahali 1
367 Yong Joo Ok 1
368 Vishnu Uthaman 1
369 Ajvender Singh 1
370 Sanju 1
371 Ravinandan G.M. 1
372 Navjot Singh 1
373 Parvesh Attri 1
374 Hardeep Duhan 1
375 Parveen Narwal 1
376 Ajay Singh 1
377 Nitin Kumar 1
378 Jishnu 1
379 Naveen Narwal 1
380 M. Abishek 1
381 Vikas Chhillar 1
382 Aman 1
383 Satywan 1
384 Vikram Kandola 1
385 Emad Sedaghatnia 1
386 Aashish Nagar 1
387 Ajinkya Rohidas Kapre 1
388 Munish 1
[389 rows x 2 columns]
How to scrape dynamic content with beautifulsoup?
Set headers
to your request and store your information in a more structured way.
Example
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
URLs = ['https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2']
data = []
for url in URLs:
results = requests.get(url,headers=headers)
soup = BeautifulSoup(results.text, "html.parser")
data.append({
'name': soup.find('span', class_ = 'main-detail__name').get_text(strip=True),
'brand': soup.find('span', class_ = 'main-detail__marque').get_text(strip=True),
'ref':soup.find('span', class_ = 'main-detail__ref').get_text(strip=True),
'price':soup.find('span', {'itemprop':'price'}).get('content'),
'url':url
})
pd.DataFrame(data)
Output
name | brand | ref | price | url |
---|---|---|---|---|
Montre The Longines Legend Diver L3.774.4.30.2 | Longines | Référence : L3.774.4.30.2 | 2240 | https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2 |
Issues with web scraping using Beautiful Soup on dynamic HTML websites
The page is loading dynamically through Ajax. Looking at network inspector, the page loads all data from very big JSON file located at https://www.acc.co.nz/for-providers/treatment-recovery/work-type-detail-sheets/getSheets. To load all job data, you can use this script:
url = "https://www.acc.co.nz/for-providers/treatment-recovery/work-type-detail-sheets/getSheets"
import requests
import json
headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get(url, headers=headers)
data = json.loads(r.text)
# For printing all data in pretty form uncoment this line:
# print(json.dumps(data, indent=4, sort_keys=True))
for d in data:
print(f'ID:\t{d["ID"]}')
print(f'Job Title:\t{d["JobTitle"]}')
print(f'Created:\t{d["Created"]}')
print('*' * 80)
# Available keys in this JSON:
# ClassName
# LastEdited
# Created
# ANZSCO
# JobTitle
# Description
# WorkTasks
# WorkEnvironment
# PhysicalMentalDemands
# Comments
# EntryRequirements
# Group
# ID
# RecordClassName
This prints:
ID: 2327
Job Title: Watch and Clock Maker and Repairer
Created: 2017-07-11 11:33:52
********************************************************************************
ID: 2328
Job Title: Web Administrator
Created: 2017-07-11 11:33:52
********************************************************************************
ID: 2329
Job Title: Welder
Created: 2017-07-11 11:33:52
...and so on
In the script I wrote available keys you can use to access your specific job data.
Parsing webpage with beautifulsoup to get dynamic content
This is handled probably by some ajax calls so it will not be in the source,
I think you would need to "monitor network" through developer tools in the browser and look for requests you are interested in.
i.e. a random picked request URL from this page:
http://ws.audioscrobbler.com/2.0/?api_key=73581584905631c5fc15720f03b0b9c8&format=json&callback=jQuery1703329798618797213_1380004055342&method=track.getSimilar&limit=10&artist=roxy%20music&track=while%20my%20heart%20is%20still%20beating&_=1380004055943
to get/see the response enter the above URL in the browser and see the content of the response.
so you need to simulate the requests in python and after you get the response you have to parse the response for interesting details.
Beautiful Soup not finding element by ID
For dynamically created elements give Selenium a try
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
URL = 'https://www.forexfactory.com/news'
driver.get(URL)
Wait a few seconds until also dynamically content is loaded
driver.implicitly_wait(5) # wait for seconds
Get your element
uiOuter = driver.find_element_by_id('ui-outer')
Example for all links (story tile)
aHref = driver.find_elements_by_css_selector('div.flexposts__story-title a')
[x.text for x in aHref]
Output
['EU\'s Barnier says "fundamental divergences" persist in UK trade talks',
'With end of crisis programs, Fed faces tricky post-pandemic transition',
'Markets Look Past Near-Term Challenges',
'Time is short, Divergences remain, but we continue to work hard for a deal',
'EURUSD consolidates around 1.1900; indicators are flat',
'Rush for emerging market company bonds as investors look beyond COVID-19',
'Europe’s Virus Lockdowns Push Economy Into Another Contraction',
'Interactive Brokers enhances Client Portal',
'BoE’s Haldane: Risk That Anxiety Leads To Gloom Loop',
'Sharpest fall in UK private sector output since May. Manufacturing growth offset by renewed...',
'Remote Working Shift Offers Silver Lining for Finance Gender Gap',
'EU Flash PMI signals steep downturn in November amid COVID-19 lockdowns',
'German PMI drops to five-month low in November due to tightening of COVID-19 restrictions, but...',
'Sharp decline in French business activity amid fresh COVID-19 lockdown',
'Rishi Sunak says Spending Review will not spell austerity',
'Remote Working Shift Offers Silver Lining for Finance Gender Gap',
'Japan’s Labor Thanksgiving Day goes viral',
'Ranking Asset Classes by Historical Returns (1985-2020)',
'Time is short, Divergences remain, but we continue to work hard for a deal',
'EURUSD consolidates around 1.1900; indicators are flat',
'US Dollar stuck near support, NZ$ strikes two-year high',
'US Dollar stuck near support, NZ$ strikes two-year high',
'Georgia confirms results in latest setback for Trump bid to overturn Biden win',
'Canada to roll over terms of EU trade deal with UK',
'Time is short, Divergences remain, but we continue to work hard for a deal',
'German PMI drops to five-month low in November due to tightening of COVID-19 restrictions, but...',
"COVID-19: 'No return to austerity', says chancellor as he hints at public sector pay freeze",
'EURUSD consolidates around 1.1900; indicators are flat',
'New Zealand Dollar May Rise as RBNZ Holds Fire on Negative Rates',
'Interactive Brokers enhances Client Portal']
Related Topics
Pick N Distinct Items at Random from Sequence of Unknown Length, in Only One Iteration
Finding What Methods a Python Object Has
How to Style Gtkbox Margin/Padding with CSS Only
Set Background Color for Subplot
Restrictons of Python Compared to Ruby: Lambda'S
Output Seckeycopyexternalrepresentation
Check If String Ends with One of the Strings from a List
Performing a Getattr() Style Lookup in a Django Template
How to Put Parameterized SQL Query into Variable and Then Execute in Python
Detect & Record Audio in Python
(Z3Py) Checking All Solutions for Equation
Logger Configuration to Log to File and Print to Stdout
How to Use Tailwindcss with Django
R Foverlaps Equivalent in Python
Swift Playground Error: Module 'Python' Has No Member Named 'Import'
Ipython Reads Wrong Python Version
Typeerror: Can't Use a String Pattern on a Bytes-Like Object in Re.Findall()