Scrape Multiple Pages with Beautifulsoup and Python

Scraping multiple pages using beautiful soup

Here is the working solution:

import requests
from bs4 import BeautifulSoup
for page in range(1,65):
url = "https://www.expansion.com/empresas-de/ganaderia/granjas-en-general/{page}.html".format(page =page)
#print(url)
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
lists = soup.select("div#simulacion_tabla ul")

for lis in lists:
title = lis.find('li', class_="col1").text
location = lis.find('li', class_="col2").text
province = lis.find('li', class_="col3").text
info = [title, location, province]
print(info)

Scrape multiple pages with BeautifulSoup and Python

The trick here is to check the requests that are coming in and out of the page-change action when you click on the link to view the other pages. The way to check this is to use Chrome's inspection tool (via pressing F12) or installing the Firebug extension in Firefox. I will be using Chrome's inspection tool in this answer. See below for my settings.

Sample Image

Now, what we want to see is either a GET request to another page or a POST request that changes the page. While the tool is open, click on a page number. For a really brief moment, there will only be one request that will appear, and it's a POST method. All the other elements will quickly follow and fill the page. See below for what we're looking for.

Sample Image

Click on the above POST method. It should bring up a sub-window of sorts that has tabs. Click on the Headers tab. This page lists the request headers, pretty much the identification stuff that the other side (the site, for example) needs from you to be able to connect (someone else can explain this muuuch better than I do).

Whenever the URL has variables like page numbers, location markers, or categories, more often that not, the site uses query-strings. Long story made short, it's similar to an SQL query (actually, it is an SQL query, sometimes) that allows the site to pull the information you need. If this is the case, you can check the request headers for query string parameters. Scroll down a bit and you should find it.

Sample Image

As you can see, the query string parameters match the variables in our URL. A little bit below, you can see Form Data with pageNum: 2 beneath it. This is the key.

POST requests are more commonly known as form requests because these are the kind of requests made when you submit forms, log in to websites, etc. Basically, pretty much anything where you have to submit information. What most people don't see is that POST requests have a URL that they follow. A good example of this is when you log-in to a website and, very briefly, see your address bar morph into some sort of gibberish URL before settling on /index.html or somesuch.

What the above paragraph basically means is that you can (but not always) append the form data to your URL and it will carry out the POST request for you on execution. To know the exact string you have to append, click on view source.

Sample Image

Test if it works by adding it to the URL.

Sample Image

Et voila, it works. Now, the real challenge: getting the last page automatically and scraping all of the pages. Your code is pretty much there. The only things remaining to be done are getting the number of pages, constructing a list of URLs to scrape, and iterating over them.

Modified code is below:

from bs4 import BeautifulSoup as bsoup
import requests as rq
import re

base_url = 'http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY'
r = rq.get(base_url)

soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try: # Make sure there are more than one page, otherwise, set to 1.
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1

# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]

# Open the text file. Use with to save self from grief.
with open("results.txt","wb") as acct:
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')

We use regular expressions to get the proper links. Then using list comprehension, we built a list of URL strings. Finally, we iterate over them.

Results:

Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=1...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=2...
Processing http://my.gwu.edu/mod/pws/courses.cfm?campId=1&termId=201501&subjId=ACCY&pageNum=3...
[Finished in 6.8s]

Sample Image

Hope that helps.

EDIT:

Out of sheer boredom, I think I just created a scraper for the entire class directory. Also, I update both the above and below codes to not error out when there is only a single page available.

from bs4 import BeautifulSoup as bsoup
import requests as rq
import re

spring_2015 = "http://my.gwu.edu/mod/pws/subjects.cfm?campId=1&termId=201501"
r = rq.get(spring_2015)
soup = bsoup(r.text)
classes_url_list = [c["href"] for c in soup.find_all("a", href=re.compile(r".*courses.cfm\?campId=1&termId=201501&subjId=.*"))]
print classes_url_list

with open("results.txt","wb") as acct:
for class_url in classes_url_list:
base_url = "http://my.gwu.edu/mod/pws/{}".format(class_url)
r = rq.get(base_url)

soup = bsoup(r.text)
# Use regex to isolate only the links of the page numbers, the one you click on.
page_count_links = soup.find_all("a",href=re.compile(r".*javascript:goToPage.*"))
try:
num_pages = int(page_count_links[-1].get_text())
except IndexError:
num_pages = 1

# Add 1 because Python range.
url_list = ["{}&pageNum={}".format(base_url, str(page)) for page in range(1, num_pages + 1)]

# Open the text file. Use with to save self from grief.
for url_ in url_list:
print "Processing {}...".format(url_)
r_new = rq.get(url_)
soup_new = bsoup(r_new.text)
for tr in soup_new.find_all('tr', align='center'):
stack = []
for td in tr.findAll('td'):
stack.append(td.text.replace('\n', '').replace('\t', '').strip())
acct.write(", ".join(stack) + '\n')

Scraping: scrape multiple pages in looping (Beautifulsoup)

I see that the url you are using belongs to page 1 only.

https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1

Are you changing it anywhere in your code? If not, then no matter what you fetch, it would fetch from page 1 only.

You should do something like this:

    for page in range(1,pages_number+1):
chromedriver = "./chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

#initial link
link = f"https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page={page}"
driver.get(link)
time.sleep(15)
data = driver.execute_script("return document.getElementsByTagName('html' [0].innerHTML")
soup_complete_source = BeautifulSoup(data.encode('utf-8'), "lxml")
driver.close()

Test Output (not the soup part) - for pages_number = 3 (stored urls in a list, for easy view):

['https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=1', 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=2', 'https://www.vivareal.com.br/aluguel/sp/sao-paulo/?__vt=lnv:a&page=3']

Process finished with exit code 0

BeautifulSoup - Scrape multiple pages

Just two changes needed to be made to get it to scrape everything.

  1. r = requests.get("https://www.bodia.com/spa-members/page/"+ format(i)) needs to be changed to r = requests.get("https://www.bodia.com/spa-members/page/{}".format(i)). Your use of format was incorrect.

  2. You were not looping over all the code, so the result was that it only printed out one set of names and then had no way to return to the start of the loop. Indenting everything under the for loop fixed that.

import requests
from bs4 import BeautifulSoup

for i in range (1,4): #to scrape names of page 1 to 3
r = requests.get("https://www.bodia.com/spa-members/page/{}".format(i))
soup = BeautifulSoup(r.text,"html.parser")
lights = soup.findAll("span",{"class":"light"})
lights_list = []
for l in lights[0:]:
result = l.text.strip()
lights_list.append(result)

print(lights_list)

The above code was spitting out a list of names every 3 seconds for the pages it scraped.

How to scrape multiple pages of a site using paging using BeautifulSoup and requests?

If your page is working for single page then with a little change it will work on next pages also. Just try to change page number in the url as ask.com supports it.

def search(request):
if request.method == 'POST':
search = request.POST['search']
max_pages_to_scrap = 5
final_result = []
for page_num in range(1, max_pages_to_scrap+1):
url = "https://www.ask.com/web?q=" + search + "&qo=pagination&page=" + str(page_num)
res = requests.get(url)
soup = bs(res.text, 'lxml')
result_listings = soup.find_all('div', {'class': 'PartialSearchResults-item'})

for result in result_listings:
result_title = result.find(class_='PartialSearchResults-item-title').text
result_url = result.find('a').get('href')
result_desc = result.find(class_='PartialSearchResults-item-abstract').text

final_result.append((result_title, result_url, result_desc))

context = {'final_result': final_result}

Web-Scraping a list with multiple pages with beautifulsoup

To get all pages you can use next example:

import requests
from bs4 import BeautifulSoup

headers = {
"Wicket-Ajax": "true",
"Wicket-Ajax-BaseURL": "antrag/ba/baantraguebersicht?0",
}

def get_info(soup):
rv = []
for lg in soup.select("li.list-group-item"):
title = " ".join(lg.select_one(".headline-link").text.split("\r\n"))
ba = lg.select_one(
'.keyvalue-key:-soup-contains("Beschlossen am:") + div'
).text.strip()
# get other info here
# ...
rv.append((title, ba))
return rv

with requests.session() as s:
# the first page:
page = s.get("https://risi.muenchen.de/risi/antrag/ba/baantraguebersicht?0")
soup = BeautifulSoup(page.content, "html.parser")

counter = 1
while True:

for title, ba in get_info(soup):
print(counter, title, ba)
counter += 1

# is there next page?
tag = soup.select_one('[title="Eine Seite vorwärts gehen"]')

if not tag:
# no, we are done here:
break

headers["Wicket-FocusedElementId"] = tag["id"]

page = s.get(
"https://risi.muenchen.de/risi/antrag/ba/baantraguebersicht?0-1.0-color_container-list-cardheader-nav_top-next",
headers=headers,
)
soup = BeautifulSoup(page.content, "xml")
soup = BeautifulSoup(
soup.select_one("ajax-response").text, "html.parser"
)

Prints:

1 Auskunft über geplante Wohnbebauung westlich der Drygalski-Allee 02.08.2022
2 Bestellung einer städtischen Leistung: Finanzierung von Ferien- und Familienpässen für Einrichtun... 02.08.2022
3 Verzögerungen bei der Verlegung von Glasfaserkabeln im 19. Stadtbezirk 02.08.2022
4 Virtuelle Tagungsmöglichkeiten für Unterausschüsse weiter ermöglichen 02.08.2022
5 Offene Fragen zur Schließung des Maria-Einsiedel-Bades 02.08.2022

...

139 Bestellung einer städtischen Leistung, hier: Topo-Box-Einsatz in der Schröfelhofstraße / Ossinger... 11.07.2022
140 Bestellung einer städtischen Leistung, hier: Topo-Box-Einsatz in der Pfingstrosenstraße 11.07.2022
141 Bestellung einer städtischen Leistung; hier: Topo-Box-Einsatz in der Alpenveilchenstraße 11.07.2022

EDIT: To select a "Wahlperiod"

import requests
from bs4 import BeautifulSoup

headers = {
"Wicket-Ajax": "true",
"Wicket-Ajax-BaseURL": "antrag/ba/baantraguebersicht?0",
}

def get_info(soup):
rv = []
for lg in soup.select("li.list-group-item"):
title = " ".join(lg.select_one(".headline-link").text.split("\r\n"))
ba = lg.select_one(
'.keyvalue-key:-soup-contains("Beschlossen am:") + div'
).text.strip()
# get other info here
# ...
rv.append((title, ba))
return rv

with requests.session() as s:
# the first page:
page = s.get("https://risi.muenchen.de/risi/antrag/ba/baantraguebersicht?0")
soup = BeautifulSoup(page.content, "html.parser")

# select the "wahlperiod 01.05.2008 bis 30.04.2014"
tag = soup.select_one(
'.dropdown-item[title="Belegt die Datumsfelder mit dem Datumsbereich von 01.05.2008 bis 30.04.2014"]'
)

headers["Wicket-FocusedElementId"] = tag["id"]

# for different wahlperiod change the `periodenEintrag-2` part
page = s.get(
"https://risi.muenchen.de/risi/antrag/ba/baantraguebersicht?0-1.0-form-periodeButton-periodenEintrag-2-periode=&_=1660125170317",
headers=headers,
)

# reload first page with new data:
page = s.get("https://risi.muenchen.de/risi/antrag/ba/baantraguebersicht?0")
soup = BeautifulSoup(page.content, "html.parser")

counter = 1
while True:

for title, ba in get_info(soup):
print(counter, title, ba)
counter += 1

# is there next page?
tag = soup.select_one('[title="Eine Seite vorwärts gehen"]')

if not tag:
# no, we are done here:
break

headers["Wicket-FocusedElementId"] = tag["id"]

page = s.get(
"https://risi.muenchen.de/risi/antrag/ba/baantraguebersicht?0-2.0-color_container-list-cardheader-nav_top-next",
headers=headers,
)
soup = BeautifulSoup(page.content, "xml")
soup = BeautifulSoup(
soup.select_one("ajax-response").text, "html.parser"
)

Prints:


...

1836 Willkommen auf der Wärmeinsel München - begünstigt durch Nachverdichtung und Versiegelung in den ... 24.05.2012
1837 Kontingentplätze in Kitas der freien Träger 24.05.2012
1838 Brachliegende Grundstücke in der Messestadt 24.05.2012
1839 Mut zur nachhaltigen Gestaltung - externe kleinteilige B-Pläne zulassen 24.05.2012
1840 Grundstücksverkauf 4. Bauabschnitt Wohnen in der Messestadt 24.05.2012

...

How to scrape multiple pages of search results with Beautiful Soup

Ad a do loop around your whole snippet that scrapes one of the tables, and increment the url by 25. In the snippet below I just made a counter variable that's initially zero, and gets incremented by 25 each loop. The code will break the loop when the response to the request is no longer valid, meaning you hit an error or hit the end of your search results. You could maybe modify that statement to break if it 404's, or print the error, etc.

Code below not tested, just a demonstration of my concept.

blah = []

url = 'https://www.prosportstransactions.com/basketball/Search/SearchResults.php?Player=&Team=Celticss&PlayerMovementChkBx=yes&submit=Search&start='

counter = 0

while True:
url += str(counter)
webpage = requests.get(url)

if webpage.status_code != 200:
break

content = webpage.content
soup = BeautifulSoup(content)

for item in soup.find_all('tr'):
for value in item.find_all('td'):
gm = value.text
blah.append(gm)

counter += 25

Scraping Multiple Pages without manually getting the amount of pages

you can use pagination class to fetch the last a tag and from that you can fetch data-pagenumber and then use it get all the links. Follow the below code to get it done.

Code:

import requests
from bs4 import BeautifulSoup

#url="https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467"
url="https://www.property24.com/for-sale/woodstock/cape-town/western-cape/10164"
data=requests.get(url)
soup=BeautifulSoup(data.content,"html.parser")
noofpages=soup.find("ul",{"class":"pagination"}).find_all("a")[-1]["data-pagenumber"]
for i in range(1,int(noofpages)+1):
print(f"{url}/p{i}")

Output:
output
Let me know if you have any questions :)



Related Topics



Leave a reply



Submit