Web scraping program cannot find element which I can see in the browser
The element you're interested in is dynamically generated, after the initial page load, which means that your browser executed JavaScript, made other network requests, etc. in order to build the page. Requests is just an HTTP library, and as such will not do those things.
You could use a tool like Selenium, or perhaps even analyze the network traffic for the data you need and make the requests directly.
Cannot web-scrape cause cannot find the form element
The data on this site is loaded dynamically, using javascript. If you dig into the XHRs (using the Developer tab in your browser), you'll see how this information is loaded into the page. BTW, the following assumes you're using python; if not you'll have to find an equivalent in another language.
import requests
import json
target = 'ATORVASTATIN AS CALCIUM' #this is just a random drug from their list
data = '{"val":"'+target+'","prescription":false,"healthServices":false,"pageIndex":1,"orderBy":0}'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'en-US,en;q=0.5',
'Content-Type': 'application/json',
'Origin': 'https://israeldrugs.health.gov.il',
'Connection': 'keep-alive',
'Referer': 'https://israeldrugs.health.gov.il/',
}
response = requests.post('https://israeldrugs.health.gov.il/GovServiceList/IDRServer/SearchByName', headers=headers, data=data)
#load the json response
meds = json.loads(response.text)
#a random item from the 8th (random, again) drug in the response
meds['results'][7]['dragHebName']
output:
'טורבה 10'
Hidden element can not found by Beautifulsoup
Minimal example which uses Selenium to control web browser which loads page and run JavaScript.
Because JavaScript needs some time to add elements so I use time.sleep(10)
but you can use special function to wait for elements. See Waits
Because div.game-result
is inside <iframe>
so first I have to find iframe
and switch to this iframe. In example I check in all iframes but you chould use only all_iframes[1]
to get elements.
Selenium
has many functions find_element_by_...
and find_elements_by_... to search elements in HTML so you could do it without
BeautifulSoup`
import selenium.webdriver
from bs4 import BeautifulSoup
import time
driver = selenium.webdriver.Firefox()
driver.get("https://www.bet.co.za/bet-games/")
time.sleep(10)
all_iframes = driver.find_elements_by_tag_name('iframe')
print('len(all_iframes):', len(all_iframes))
for number, iframe in enumerate(all_iframes):
print('--- iframe', number, '---')
driver.switch_to.frame(iframe)
soup = BeautifulSoup(driver.page_source, "html.parser")
samples = soup.find_all('div', {'class': 'game-result'})
print('len(samples):', len(samples))
for item in samples:
print(item.get_text(separator=','))
driver.switch_to.default_content()
Result:
len(all_iframes): 4
--- iframe 0 ---
len(samples): 0
--- iframe 1 ---
len(samples): 5
13,15,35,21,4
3,14,4,25,33
25,34,14,4,8
30,18,25,24,10
35,30,5,34,21
--- iframe 2 ---
len(samples): 0
--- iframe 3 ---
len(samples): 0
EDIT: Similar version with one iframe
and without BeautifulSoup
import selenium.webdriver
import time
driver = selenium.webdriver.Firefox()
driver.get("https://www.bet.co.za/bet-games/")
time.sleep(10)
all_iframes = driver.find_elements_by_tag_name('iframe')
driver.switch_to.frame(all_iframes[1])
all_samples = driver.find_elements_by_css_selector('div.game-result')
print('len(all_samples):', len(all_samples))
for sample in all_samples:
all_balls = sample.find_elements_by_css_selector('span.ball-item')
all_text = [ball.text for ball in all_balls]
print(','.join(all_text))
Result:
len(all_samples): 5
13,1,12,2,10
13,14,33,26,4
21,18,12,9,4
13,15,35,21,4
3,14,4,25,33
BTW: Sometimes page displays video instead of these numbers and then code may gives empty strings. It may need more complex code to wait for end of video.
EDIT:
To change game you have to find link to Lucky 7
and click()
it
all_titles = driver.find_elements_by_css_selector('div.game-title')
all_titles[6].click()
Minimal working example
import selenium.webdriver
import time
driver = selenium.webdriver.Firefox()
driver.get("https://www.bet.co.za/bet-games/")
time.sleep(10)
all_iframes = driver.find_elements_by_tag_name('iframe')
driver.switch_to.frame(all_iframes[1])
all_titles = driver.find_elements_by_css_selector('div.game-title')
print('len(all_titles):', len(all_titles))
# click on link to `Lucky 7`
all_titles[6].click()
time.sleep(1)
all_samples = driver.find_elements_by_css_selector('div.game-result')
print('len(all_samples):', len(all_samples))
for sample in all_samples:
all_balls = sample.find_elements_by_css_selector('span.ball-item')
all_text = [ball.text for ball in all_balls]
print(','.join(all_text))
BTW:
Normally you could also use text
link = driver.find_element_by_link_text('Lucky 7')
link.click()
but this element is not in <a>
so it doesn't work.
But works:
link = driver.find_element_by_xpath('//*[text()="Lucky 7"]')
link.click()
Scraping problem: Inspect Element different from View Page Source
web scraping with beautifullsoup can not be done correctly if a page has javascript DOM element. the page your are trying to scrape has javascript element and shows data with that.
The difference between View Source and Inspect Element is due to the browser. Actually the browser makes it readable for users.
To sum up, you have to use simulate the browser to achieve those data you are looking for. This can be done by Selenium. you can search for using selenium and python for webscraping.
here is a simple example of using selenium and python for web scraping:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
url = 'http://tsetmc.com/Loader.aspx?ParTree=151311&i=63917421733088077#'
#firefox driver for selenium from: https://github.com/mozilla/geckodriver/releases
driver = webdriver.Firefox(executable_path=r'your-path\geckodriver.exe')
driver.get(url)
wait = WebDriverWait(driver, 10)
try:
#wait for the page to load completely
element = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "/html/body/div[4]/form/div[3]/div[2]/div[1]/div[2]/div[1]/table/tbody")))
time.sleep(1)
finally:
driver.quit()
This code will open the firefox you have to put your directory in the 'your-path\geckodriver.exe'
section.
Pay attention to the comment which is about geckodriver. you need it for running selenium.
you can search for more information about Selenium.
Related Topics
Pythonic Way to Create Union of All Values Contained in Multiple Lists
Python Mixed Integer Linear Programming
How to Remove the First Item from a List
Oserror: [Winerror 193] %1 Is Not a Valid Win32 Application
How to Use Win32 API with Python
How to Represent an Infinite Number in Python
Getting the Index of a Row in a Pandas Apply Function
How to Use a Custom Comparison Function in Python 3
How to Add Conda Environment to Jupyter Lab
Send Data from a Textbox into Flask
Replace Nth Occurrence of Substring in String
How to Use a Conditional Expression (Expression with If and Else) in a List Comprehension
Setting Django Up to Use MySQL