Web Scraping Program Cannot Find Element Which I Can See in the Browser

Web scraping program cannot find element which I can see in the browser

The element you're interested in is dynamically generated, after the initial page load, which means that your browser executed JavaScript, made other network requests, etc. in order to build the page. Requests is just an HTTP library, and as such will not do those things.

You could use a tool like Selenium, or perhaps even analyze the network traffic for the data you need and make the requests directly.

Cannot web-scrape cause cannot find the form element

The data on this site is loaded dynamically, using javascript. If you dig into the XHRs (using the Developer tab in your browser), you'll see how this information is loaded into the page. BTW, the following assumes you're using python; if not you'll have to find an equivalent in another language.

import requests
import json

target = 'ATORVASTATIN AS CALCIUM' #this is just a random drug from their list
data = '{"val":"'+target+'","prescription":false,"healthServices":false,"pageIndex":1,"orderBy":0}'

headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'en-US,en;q=0.5',
'Content-Type': 'application/json',
'Origin': 'https://israeldrugs.health.gov.il',
'Connection': 'keep-alive',
'Referer': 'https://israeldrugs.health.gov.il/',
}

response = requests.post('https://israeldrugs.health.gov.il/GovServiceList/IDRServer/SearchByName', headers=headers, data=data)

#load the json response
meds = json.loads(response.text)
#a random item from the 8th (random, again) drug in the response
meds['results'][7]['dragHebName']

output:

'טורבה 10'

Hidden element can not found by Beautifulsoup

Minimal example which uses Selenium to control web browser which loads page and run JavaScript.

Because JavaScript needs some time to add elements so I use time.sleep(10) but you can use special function to wait for elements. See Waits

Because div.game-result is inside <iframe> so first I have to find iframe and switch to this iframe. In example I check in all iframes but you chould use only all_iframes[1] to get elements.

Selenium has many functions find_element_by_... and find_elements_by_... to search elements in HTML so you could do it without BeautifulSoup`

import selenium.webdriver
from bs4 import BeautifulSoup
import time

driver = selenium.webdriver.Firefox()
driver.get("https://www.bet.co.za/bet-games/")

time.sleep(10)

all_iframes = driver.find_elements_by_tag_name('iframe')
print('len(all_iframes):', len(all_iframes))

for number, iframe in enumerate(all_iframes):
print('--- iframe', number, '---')

driver.switch_to.frame(iframe)

soup = BeautifulSoup(driver.page_source, "html.parser")
samples = soup.find_all('div', {'class': 'game-result'})
print('len(samples):', len(samples))

for item in samples:
print(item.get_text(separator=','))

driver.switch_to.default_content()

Result:

len(all_iframes): 4
--- iframe 0 ---
len(samples): 0
--- iframe 1 ---
len(samples): 5
13,15,35,21,4
3,14,4,25,33
25,34,14,4,8
30,18,25,24,10
35,30,5,34,21
--- iframe 2 ---
len(samples): 0
--- iframe 3 ---
len(samples): 0

EDIT: Similar version with one iframe and without BeautifulSoup

import selenium.webdriver
import time

driver = selenium.webdriver.Firefox()
driver.get("https://www.bet.co.za/bet-games/")

time.sleep(10)

all_iframes = driver.find_elements_by_tag_name('iframe')
driver.switch_to.frame(all_iframes[1])

all_samples = driver.find_elements_by_css_selector('div.game-result')
print('len(all_samples):', len(all_samples))

for sample in all_samples:
all_balls = sample.find_elements_by_css_selector('span.ball-item')
all_text = [ball.text for ball in all_balls]
print(','.join(all_text))

Result:

len(all_samples): 5
13,1,12,2,10
13,14,33,26,4
21,18,12,9,4
13,15,35,21,4
3,14,4,25,33

BTW: Sometimes page displays video instead of these numbers and then code may gives empty strings. It may need more complex code to wait for end of video.


EDIT:

To change game you have to find link to Lucky 7 and click() it

all_titles = driver.find_elements_by_css_selector('div.game-title')
all_titles[6].click()

Minimal working example

import selenium.webdriver
import time

driver = selenium.webdriver.Firefox()
driver.get("https://www.bet.co.za/bet-games/")

time.sleep(10)

all_iframes = driver.find_elements_by_tag_name('iframe')
driver.switch_to.frame(all_iframes[1])

all_titles = driver.find_elements_by_css_selector('div.game-title')
print('len(all_titles):', len(all_titles))
# click on link to `Lucky 7`
all_titles[6].click()
time.sleep(1)

all_samples = driver.find_elements_by_css_selector('div.game-result')
print('len(all_samples):', len(all_samples))

for sample in all_samples:
all_balls = sample.find_elements_by_css_selector('span.ball-item')
all_text = [ball.text for ball in all_balls]
print(','.join(all_text))

BTW:

Normally you could also use text

link = driver.find_element_by_link_text('Lucky 7')
link.click()

but this element is not in <a> so it doesn't work.

But works:

link = driver.find_element_by_xpath('//*[text()="Lucky 7"]')
link.click()

Scraping problem: Inspect Element different from View Page Source

web scraping with beautifullsoup can not be done correctly if a page has javascript DOM element. the page your are trying to scrape has javascript element and shows data with that.
The difference between View Source and Inspect Element is due to the browser. Actually the browser makes it readable for users.
To sum up, you have to use simulate the browser to achieve those data you are looking for. This can be done by Selenium. you can search for using selenium and python for webscraping.

here is a simple example of using selenium and python for web scraping:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

url = 'http://tsetmc.com/Loader.aspx?ParTree=151311&i=63917421733088077#'

#firefox driver for selenium from: https://github.com/mozilla/geckodriver/releases

driver = webdriver.Firefox(executable_path=r'your-path\geckodriver.exe')
driver.get(url)

wait = WebDriverWait(driver, 10)

try:
#wait for the page to load completely
element = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "/html/body/div[4]/form/div[3]/div[2]/div[1]/div[2]/div[1]/table/tbody")))
time.sleep(1)
finally:
driver.quit()

This code will open the firefox you have to put your directory in the 'your-path\geckodriver.exe' section.
Pay attention to the comment which is about geckodriver. you need it for running selenium.

you can search for more information about Selenium.



Related Topics



Leave a reply



Submit