Navigating Through Pagination With Selenium in Python

navigating through pagination with selenium in python

Before moving on to automating any scenario, always write down the manual steps you would perform to execute the scenario. Manual steps for what you want to (which I understand from the question) is -

1) Go to site - https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList

2) Select first week option

3) Click search

4) Get the data from every page

5) Load the url again

6) Select second week option

7) Click search

8) Get the data from every page

.. and so on.

You are having a loop to select different weeks but inside each loop iteration for the week option, you also need to include a loop to iterate over all the pages. Since you are not doing that, your code is returning only the data from the first page.

Another problem is with how you are locaing the 'Next' button -

driver.find_element_by_xpath('//*[@id="form1"]/div[3]/a[4]').click()

You are selecting the 4th <a> element which is ofcourse not robust because in different pages, the Next button's index will be different. Instead, use this better locator -

driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()

Logic for creating loop which will iterate through pages -

First you will need the number of pages. I did that by locating the <a> immediately before the "Next" button. As per the screenshot below, it is clear that the text of this element will be equal to the number of pages -

screenshot -

I did that using following code -

number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)

Now once you have number of pages as number_of_pages, you only need to click "Next" button number_of_pages - 1 times!

Final code for your main function-

def main():
 all_data = []
 select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
 list_options = select.options

 for item in range(len(list_options)):
    select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
    select.select_by_index(str(item))
    driver.find_element_by_css_selector("input.formbutton#csbtnSearch").click()
    number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)
    for j in range(number_of_pages - 1):
      all_data.extend(getData())
      driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()
      time.sleep(1)
    driver.get(url)

 with open( 'wiltshire.json', 'w+' ) as f:
    json.dump( all_data, f )
 driver.quit()

Navigate pagination with Selenium Webdriver

Since the urls are in the format BASE_URL+page=NUM_PAGE, you could simply get the maximum page number (7 in your case).

In that way you can build all the urls with something like:

BASE_URL = "https://dati.comune.milano.it/dataset?groups=heal"
urls = []
for page_num in range(1, MAX_PAGES):
    urls.append(f"{BASE_URL}&page={page_num}")

In this way you'll have all the pages without having to click anything, just knowing the maximum number of pages which you can easily find as you already did.

SELENIUM SOLUTION

There are probably one thousand ways and more to do this in a cleaner way, but this one works for me. Simply loop over the list of numbers until finding the "active" as you said and click the following.

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

BASE_URL = "https://dati.comune.milano.it/dataset?groups=heal&page=1" #first page

driver = webdriver.Chrome(
        executable_path=ChromeDriverManager().install()
    )
driver.get(BASE_URL)

url_list_xpath = "/html/body/div[2]/div/div[3]/div/section[1]/div[2]/ul" # this is the page bar at the bottom

to_click = False
last_page = driver.find_element_by_xpath("/html/body/div[2]/div/div[3]/div/section[1]/div[2]/ul/li[5]/a") \
    .get_attribute("href") # find last page
current_page = BASE_URL

# iterate over the urls and click the next url after the active one
while current_page!=last_page:
    ul = driver.find_element_by_xpath(url_list_xpath)
    for li in ul.find_elements_by_tag_name("li"):
        if to_click:
            break
        if li.get_attribute("class") == 'active':
            to_click = True
    to_click = False  
    current_page = li.find_elements_by_tag_name("a")[0].get_attribute("href")
    driver.get(current_page)

How to navigate through pagination? (Selenium)

You can wrap the code inside a while loop, with a variable pagination_starting_point and initial value could be set as 2, and for each iteration we will increase the counter.

Code :

driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(30)
wait = WebDriverWait(driver, 30)

while True:
    try:
        col1 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[1]')
        col2 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[2]')
        col3 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[3]')
        col4 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[4]')
        col5 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[5]')
        col6 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[6]')
        col7 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[7]')
        col8 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[8]')
        col9 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[9]')

        col1_data = [s.text for s in col1]
        col2_data = [s.text for s in col2]
        col3_data = [s.text for s in col3]
        col4_data = [s.text for s in col4]
        col5_data = [s.text for s in col5]
        col6_data = [s.text for s in col6]
        col7_data = [s.text for s in col7]
        col8_data = [s.text for s in col8]
        col9_data = [s.text for s in col9]

        wait.until(EC.element_to_be_clickable((By.XPATH, f"//a[contains(@href,'Page${pagination_starting_point}')]"))).click()
        print("Click on page " + pagination_starting_point)
        pagination_starting_point = pagination_starting_point + 1
        if pagination_starting_point == 45:
            break
    except:
        print("Looks like job done !")
        break

Imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Updated to use xpath //a[contains(@href,'Page${pagination_starting_point}')] instead of page number.

Issue in navigating ASPX pagination using Selenium

Instead of clicking on every number, use the next button.

driver.implicitly_wait(5)
driver.get("https://www.shoroukbookstores.com/books/publishing-house.aspx?id=429c2704-5fa3-43b0-9ad7-c2ec9062beb3")

while True:

    try:
        navigation_links = driver.find_element_by_xpath(f"//*[@id='Body_AspNetPager']").find_elements_by_tag_name("a")
        next_page = len(navigation_links) - 1 # next page button is second from behind
        driver.find_element_by_xpath(f"//*[@id='Body_AspNetPager']/a[{next_page}]").click()
        print("Next page clicked")

    except NoSuchElementException:
        print("No more pages")
        break

driver.quit()

How can i navigate through pagination using selenium python?

In such cases, it is always interesting to have a closer look to the page to understand how the data is actually updated.

I did so opening the console in Firefox and having a look at the XHR traffic network.

enter image description here

... interesting. The page is getting its results from an endpoint we could identify.

It returns json data which is great:

{'totalJobs': 2541,
 'jobs': [{'location': [{'jobLocationID': 0,
     'jobID': 24986,
     'countryID': 0,
     'country': 'Pakistan',
     'cityID': None,
     'cityText': 'Karachi',
     'jobShiftID': 0,
     'name': None}],
   'jobID': 24986,
   'jobIDEncrypted': '26cfb27ee6b2abad',
   'title': 'Marketing Officer - Freelancer',
   'jobDescription': '<p>We are growing, energetic, and highly-reputed Public Relation (PR) and Digital Marketing Agency.<br />\nCurrently, we are looking for ...

Lets use this to write our script:

import requests
import math

#The scrapping function
def getJobs(pageNumber):

    #Defining the headers
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0',
        'Accept': 'application/json, text/plain, */*',
        'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
        'X-Requested-With': 'XMLHttpRequest',
        'Content-Type': 'application/json;charset=utf-8',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Referer': 'https://jobee.pk/jobs-in-pakistan',
        'Pragma': 'no-cache'      
    }

    #Setting the right params for the request we will make, pageSize is set to 200 (results by page)
    data = {"model":{"titles":[],"cities":[],"shifts":[],"experinces":[],"careerLevels":[],"functionalAreas":[],"genders":[],"industries":[],"degreeLevels":[],"companies":[]},"pageNumber":1,"pageSize":200}

    #Updating the page number
    data['pageNumber'] = pageNumber
    data = json.dumps(data)

    #Collecting the results
    response = requests.post('https://jobee.pk/job/jobsearch', headers=headers, data=data)

    #Just in case an error shows up
    try:
        return json.loads(response.content)
    except:
        return {'jobs': []}

#Then lets get the page numbers from page 1        
data = getJobs(1)
totalJobs = data['totalJobs']
number_of_pages = math.ceil(totalJobs /200)

#Initializing our job list
jobs_list = []

#Looping through the pages
for pageNumber in range(1,number_of_pages + 1):
    results  = getJobs(pageNumber)

    #If no results we end the loop
    if len(result) == 0: 
        break
    else:
        #We append the results in the ['job'] key to append it to our list
        jobs_list += results['jobs']
        print ('Page', pageNumber,'-', len(jobs_list), "jobs collected")

#Lets have a look to the data into a dataframe
df = pd.DataFrame(jobs_list)
print(df)

Output

Page 1 - 200 jobs collected
Page 2 - 400 jobs collected
Page 3 - 600 jobs collected
...

+----+----------------------+--------------------+-------------+----------------------+------------------+----------------------------------------------------+--------+-------------------+----------------------------------------------------+----------------+--------------------+--------------------------+--------------------------+----------------+----------------------------------------------------+--------------------------------------------+----------------------------------------+-----------+
|    |    appliedByDate     |    companyName     | experience  |     expiredDate      | isSalaryVisible  |                  jobDescription                    | jobID  |  jobIDEncrypted   |                     location                       |     logo       | numberOfPositions  |        postDate          |       publishDate        |  salaryRange   |                      skills                        |                   title                    |     titleWithoutSpecialCharacters      | viewCount |
+----+----------------------+--------------------+-------------+----------------------+------------------+----------------------------------------------------+--------+-------------------+----------------------------------------------------+----------------+--------------------+--------------------------+--------------------------+----------------+----------------------------------------------------+--------------------------------------------+----------------------------------------+-----------+
| 0  | 0001-01-01T00:00:00  | Custom House       | Fresh       | 2019-09-19T00:00:00  | True             | <p>We require Mean Stack Developer Interns who...  | 27925  | a0962bea0bc174a1  | [{'jobLocationID': 0, 'jobID': 27925, 'country...  | 14564Logo.jpg  |                 3  | 2019-06-21T14:04:01.363  | 2019-06-21T19:26:24.213  | 5000 - 10000   | [AngularJs, Mongo DB, JavaScript, Node Js, Mea...  | Mean Stack Developer - Intern              | Mean-Stack-Developer-Intern            |        10 |
| 1  | 0001-01-01T00:00:00  | Custom House       | Fresh       | 2019-09-19T00:00:00  | True             | <p>We requires SEO, Digital Marketing and Grap...  | 27924  | 81e4e7f7d672dffd  | [{'jobLocationID': 0, 'jobID': 27924, 'country...  | 14564Logo.jpg  |                 2  | 2019-06-21T14:00:26.45   | 2019-06-21T19:25:04.493  | 5000 - 10000   | [Graphic Design, Search Engine Optimization (S...  | SEO Executive / Graphic Designer - Intern  | SEO-Executive-Graphic-Designer-Intern  |        10 |
| 2  | 0001-01-01T00:00:00  | Printoscan Lahore  | 1 Year      | 2019-09-19T00:00:00  | True             | <p>We require an <strong>Accounts Assistant / ...  | 27923  | 137a257e9e5bbb5d  | [{'jobLocationID': 0, 'jobID': 27923, 'country...  | None           |                 1  | 2019-06-21T13:59:37.373  | 2019-06-21T19:19:07.36   | 15000 - 20000  | [Accounts Services, Administrative Skills, Acc...  | Accounts Assistant / Administrator         | Accounts-Assistant-Administrator       |         6 |
+----+----------------------+--------------------+-------------+----------------------+------------------+----------------------------------------------------+--------+-------------------+----------------------------------------------------+----------------+--------------------+--------------------------+--------------------------+----------------+----------------------------------------------------+--------------------------------------------+----------------------------------------+-----------+

This is what we wanted.

Trouble Navigating To the Next Page Using Python Selenium

The following calculates the number of pages based on the result count and the known max number of results per page.

It loops through clicking on the appropriate href containing this page number. Where this number is not visible, the raised exception is handled and the initial pagination ellipsis is clicked to reveal the page.

I print the first tr first td, of pages greater than 1, to show that the page has been visited. I also swop out hard-coded waits for wait conditions.

I have used ChromeDriver.

This is to give you a framework to use. I tested it and it ran for all region selections and pages.

import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException, NoSuchElementException
import math

results_per_page = 50
url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome() #FireFox()
browser.get(url)
print("Retriving the site...")

# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada)', 'Middle East and Northern Africa', 'South Asia']

for region in regions:
   # Select all options from drop down menu
    selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))

    print("Now constructing output for: " + region)
    
    # Select table and wait for data to populate
    selectOption.select_by_visible_text(region)
    
    WebDriverWait(browser, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#MainContent_gvUtilities tr > td')))
    num_results = int(browser.find_element_by_id('MainContent_lblqResults').text)
    num_pages = math.ceil(num_results/results_per_page)
    print(f'pages to scrape are: {num_pages}')
    
    for page in range(2, num_pages + 1):
        print(f'visiting page {page}')
        try:
            browser.find_element_by_css_selector(f'.pagination > li > [href*="Page\${page}"]').click()
            WebDriverWait(browser, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#MainContent_gvUtilities tr > td')))
            print(browser.find_element_by_css_selector('#MainContent_gvUtilities tr:nth-child(2) span').text)
        except NoSuchElementException:
            browser.find_element_by_css_selector('.pagination > li > a').click()
        except Exception as e:
            print(e)
            continue

how to handle pagination and scrape using selenium

If I understand your question in the right way, the following should do it:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get('https://www.amazon.in/Skybags-Brat-Black-Casual-Backpack/dp/B08Z1HHHTD/ref=sr_1_2?dchild=1&keywords=skybags&qid=1627786382&sr=8-2')

product_title = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, "productTitle"))).text

print(product_title)

WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[@data-hook='see-all-reviews-link-foot']"))).click()

while True:
    for item in WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-hook='review']"))):
        reviewer = item.find_element_by_css_selector("span.a-profile-name").text
        review = ' '.join([i.text.strip() for i in item.find_elements_by_xpath(".//span[@data-hook='review-body']")])
        print(reviewer,review)

    try:
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//*[@data-hook='pagination-bar']//a[contains(@href,'/product-reviews/') and contains(text(),'Next page')]"))).click()
        WebDriverWait(driver, 10).until(EC.staleness_of(item))
    except Exception as e:
        break

driver.quit()

Navigating Through Pagination With Selenium in Python