Unable to Scrape Content from a Website

Unable to scrape data from Website with Python

If you look at Network tab, you will see cbm_reporting_cbricsL.htm which is what you need to scrape. By the way, you should also add headers for requests to work properly. See detailed explanation in this thread:

import requests
import pandas as pd
from bs4 import BeautifulSoup

res = requests.get(
'https://www1.nseindia.com/products/dynaContent/debt/corp_bonds/htms/cbm_reporting_cbricsL.htm',
headers={"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
)

soup = BeautifulSoup(res.text, 'lxml')

raw_columns = [row.find_all('td') for row in soup.find_all('tr')]

# first 3 items were dummy
df = pd.DataFrame.from_records(raw_columns[3:])

The result would look like:

0   [INE001A07TA7]  [HOUSING DEVELOPMENT FINANCE CORPORATION LTD S...   [ 100.0030] [ 4.7082]   [ 16]   [[ 168000.00]]  [ 100.0000] [ 4.7091]
1 [INE134E07AP6] [POWER FINANCE CORPORATION LTD. TRI SRV CATIII... [ 100.8500] [ 6.6934] [ 1] [ 1000.00 ] [ 100.8500] [ 6.6934]
2 [INE020B08963] [RURAL ELECTRIFICATION CORPORATION LIMITED SR-... [ 107.6835] [ 5.9200] [ 1] [ 1500.00 ] [ 107.6835] [ 5.9200]
3 [INE163N08131] [-] [ 104.2195] [ 6.6200] [ 1] [ 780.00 ] [ 104.2195] [ 6.6200]
4 [INE540P07343] [-] [ 104.3408] [ 9.3603] [ 6] [[ 1110.00]] [ 104.2640] [ 9.3800]
... ... ... ... ... ... ... ... ...
93 [INE377Y07250] [BAJAJ HOUSING FINANCE LIMITED SR 27 5.69 NCD ... [ 100.0300] [ 5.6845] [ 1] [ 9000.00 ] [ 100.0300] [ 5.6845]
94 [INE115A07ML7] [LIC HOUSING FINANCE LIMITED SRTR349OP-1 7.4NC... [ 105.0991] [ 5.5000] [ 1] [ 1000.00 ] [ 105.0991] [ 5.5000]
95 [INE020B07HN3] [RURAL ELECTRIFICATION CORPORATION LIMITED SR-... [ 123.6000] [ 4.4400] [ 1] [ 10.00 ] [ 123.6000] [ 4.4400]
96 [INE101A08070] [MAHINDRA AND MAHINDRA LIMITED 9.55 NCD 04JL63... [ 125.5000] [ 7.5218] [ 1] [ 820.00 ] [ 125.5000] [ 7.5218]
97 [INE062A08215] [STATE BANK OF INDIA SERIES I 8.75 BD PERPETUA... [ 104.5304] [ 7.0000] [ 1] [ 10.00 ] [ 104.5304] [ 7.0000]

Why can't scrape some webpages using Python and bs4?

You are missing some headers that the site may require.

I suggests using requests package instead of urllib, as it's more flexible. See a working example below:

import requests

url = "https://www.idealista.com/areas/alquiler-viviendas/?shape=%28%28wt_%7BF%60m%7Be%40njvAqoaXjzjFhecJ%7BebIfi%7DL%29%29"

querystring = {"shape":"((wt_{F`m{e@njvAqoaXjzjFhecJ{ebIfi}L))"}

payload = ""
headers = {
'authority': "www.idealista.com",
'cache-control': "max-age=0",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
'sec-fetch-site': "none",
'sec-fetch-mode': "navigate",
'sec-fetch-user': "?1",
'sec-fetch-dest': "document",
'accept-language': "en-US,en;q=0.9"
}

response = requests.request("GET", url, data=payload, headers=headers, params=querystring)

print(response.text)

From there you can parse the body using bs4:

pageSoup = soup(response.text, "html.parser")

However, beware that the site your are trying to scrape may show a CAPTCHA, so you'll probably need to rotate your user-agent header and IP address.

Why am I unable to web-scrape URL from a hyperlink in this website?

You are unable to parse it as the data is dynamically loaded. As you can see in the following image, the HTML data that is being written to the page doesn't actually exist when you download the HTML source code. The JavaScript later parses the window.__SITE variable and extracts the data from there:

code screenshot

However, we can replicate this in Python. After downloading the page:

import requests

url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)

You can use re (regex) to extract the encoded page source:

import re

encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]

Afterwards, you can use urllib to URL-decode the text, and json to parse the JSON string data:

from urllib.parse import unquote
from json import loads

json_data = loads(unquote(encoded_data))

You can then parse the JSON tree to get to the HTML source data:

html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]

At that point, you can use your own code to parse the HTML:

soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())

links = soup.find_all('a')

for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")

If you put it all together, here's the final script:

import requests
import re
from urllib.parse import unquote
from json import loads
from bs4 import BeautifulSoup

# Download URL
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)

# Get encoded JSON from HTML source
encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]

# Decode and load as dictionary
json_data = loads(unquote(encoded_data))

# Get the HTML source code for the links
html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]

# Parse it using BeautifulSoup
soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())

# Get links
links = soup.find_all('a')

# For each link...
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")

Can't scrape all data from website with BeautifulSoup

The 5,95 is calculated from the percentage score which is obtained via a separate JSON request. The value is calculated as 100 - (100 * score):

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib import parse
import json

# Set up scraper
url = (f"https://aktie.traderfox.com/visualizations/US30303M1027/DI/facebook-inc")
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, "html.parser")

# << Your code here to get other items >>

# Locate the stock ID and request the JSON data for it
stock_id = soup.find('span', attrs={"data-id" : True})['data-id']
data = parse.urlencode({"stock_id" : stock_id}).encode()
req_fa = Request("https://aktie.traderfox.com/ajax/getFactorAnalysis.php", data=data)
json_data = json.loads(urlopen(req_fa).read())

umsatzwachstum_growth = 100 - (100 * json_data["data"]["scores"]["salesgrowth5"]["score"])
eps_growth = 100 - (100 * json_data["data"]["scores"]["epsgrowth5"]["score"])
print(f"{umsatzwachstum_growth:.2f}, {eps_growth:.2f}")

This would give you:

5.95, 3.55

I suggest you print out the json_data to better understand the format of the data that is returned.

I am unable to scrape domain name from this website? Postman returns json() but requests through exception When I call response.json()

The data you see on the page is stored inside embedded Json. To parse it, you can use next example:

import json
import requests
from bs4 import BeautifulSoup

url = "https://cloud28plus.com/en/partner/resecurity--inc-"

soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").contents[0])

# uncomment this to see all data:
# print(json.dumps(data, indent=4))

print(data["props"]["initialProps"]["pageProps"]["element"]["twitter"])

Prints:

https://twitter.com/RESecurity


Related Topics



Leave a reply



Submit