Unable to scrape data from Website with Python
If you look at Network tab, you will see cbm_reporting_cbricsL.htm
which is what you need to scrape. By the way, you should also add headers for requests to work properly. See detailed explanation in this thread:
import requests
import pandas as pd
from bs4 import BeautifulSoup
res = requests.get(
'https://www1.nseindia.com/products/dynaContent/debt/corp_bonds/htms/cbm_reporting_cbricsL.htm',
headers={"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}
)
soup = BeautifulSoup(res.text, 'lxml')
raw_columns = [row.find_all('td') for row in soup.find_all('tr')]
# first 3 items were dummy
df = pd.DataFrame.from_records(raw_columns[3:])
The result would look like:
0 [INE001A07TA7] [HOUSING DEVELOPMENT FINANCE CORPORATION LTD S... [ 100.0030] [ 4.7082] [ 16] [[ 168000.00]] [ 100.0000] [ 4.7091]
1 [INE134E07AP6] [POWER FINANCE CORPORATION LTD. TRI SRV CATIII... [ 100.8500] [ 6.6934] [ 1] [ 1000.00 ] [ 100.8500] [ 6.6934]
2 [INE020B08963] [RURAL ELECTRIFICATION CORPORATION LIMITED SR-... [ 107.6835] [ 5.9200] [ 1] [ 1500.00 ] [ 107.6835] [ 5.9200]
3 [INE163N08131] [-] [ 104.2195] [ 6.6200] [ 1] [ 780.00 ] [ 104.2195] [ 6.6200]
4 [INE540P07343] [-] [ 104.3408] [ 9.3603] [ 6] [[ 1110.00]] [ 104.2640] [ 9.3800]
... ... ... ... ... ... ... ... ...
93 [INE377Y07250] [BAJAJ HOUSING FINANCE LIMITED SR 27 5.69 NCD ... [ 100.0300] [ 5.6845] [ 1] [ 9000.00 ] [ 100.0300] [ 5.6845]
94 [INE115A07ML7] [LIC HOUSING FINANCE LIMITED SRTR349OP-1 7.4NC... [ 105.0991] [ 5.5000] [ 1] [ 1000.00 ] [ 105.0991] [ 5.5000]
95 [INE020B07HN3] [RURAL ELECTRIFICATION CORPORATION LIMITED SR-... [ 123.6000] [ 4.4400] [ 1] [ 10.00 ] [ 123.6000] [ 4.4400]
96 [INE101A08070] [MAHINDRA AND MAHINDRA LIMITED 9.55 NCD 04JL63... [ 125.5000] [ 7.5218] [ 1] [ 820.00 ] [ 125.5000] [ 7.5218]
97 [INE062A08215] [STATE BANK OF INDIA SERIES I 8.75 BD PERPETUA... [ 104.5304] [ 7.0000] [ 1] [ 10.00 ] [ 104.5304] [ 7.0000]
Why can't scrape some webpages using Python and bs4?
You are missing some headers that the site may require.
I suggests using requests
package instead of urllib
, as it's more flexible. See a working example below:
import requests
url = "https://www.idealista.com/areas/alquiler-viviendas/?shape=%28%28wt_%7BF%60m%7Be%40njvAqoaXjzjFhecJ%7BebIfi%7DL%29%29"
querystring = {"shape":"((wt_{F`m{e@njvAqoaXjzjFhecJ{ebIfi}L))"}
payload = ""
headers = {
'authority': "www.idealista.com",
'cache-control': "max-age=0",
'upgrade-insecure-requests': "1",
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
'sec-fetch-site': "none",
'sec-fetch-mode': "navigate",
'sec-fetch-user': "?1",
'sec-fetch-dest': "document",
'accept-language': "en-US,en;q=0.9"
}
response = requests.request("GET", url, data=payload, headers=headers, params=querystring)
print(response.text)
From there you can parse the body using bs4:
pageSoup = soup(response.text, "html.parser")
However, beware that the site your are trying to scrape may show a CAPTCHA, so you'll probably need to rotate your user-agent
header and IP address.
Why am I unable to web-scrape URL from a hyperlink in this website?
You are unable to parse it as the data is dynamically loaded. As you can see in the following image, the HTML data that is being written to the page doesn't actually exist when you download the HTML source code. The JavaScript later parses the window.__SITE
variable and extracts the data from there:
However, we can replicate this in Python. After downloading the page:
import requests
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)
You can use re
(regex) to extract the encoded page source:
import re
encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]
Afterwards, you can use urllib
to URL-decode the text, and json
to parse the JSON string data:
from urllib.parse import unquote
from json import loads
json_data = loads(unquote(encoded_data))
You can then parse the JSON tree to get to the HTML source data:
html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]
At that point, you can use your own code to parse the HTML:
soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())
links = soup.find_all('a')
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")
If you put it all together, here's the final script:
import requests
import re
from urllib.parse import unquote
from json import loads
from bs4 import BeautifulSoup
# Download URL
url = "https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/"
req = requests.get(url)
# Get encoded JSON from HTML source
encoded_data = re.search("window\.__SITE=\"(.*)\"", req.text).groups()[0]
# Decode and load as dictionary
json_data = loads(unquote(encoded_data))
# Get the HTML source code for the links
html_src = json_data["site"]["data"]["values"]["layout"]["sections"][1]["rows"][0]["cards"][0]["component"]["settings"]["markdown"]
# Parse it using BeautifulSoup
soup = BeautifulSoup(html_src, 'html.parser')
print(soup.prettify())
# Get links
links = soup.find_all('a')
# For each link...
for link in links:
if "href" in link.attrs:
print(str(link.attrs['href'])+"\n")
Can't scrape all data from website with BeautifulSoup
The 5,95
is calculated from the percentage score which is obtained via a separate JSON request. The value is calculated as 100 - (100 * score)
:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from urllib import parse
import json
# Set up scraper
url = (f"https://aktie.traderfox.com/visualizations/US30303M1027/DI/facebook-inc")
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, "html.parser")
# << Your code here to get other items >>
# Locate the stock ID and request the JSON data for it
stock_id = soup.find('span', attrs={"data-id" : True})['data-id']
data = parse.urlencode({"stock_id" : stock_id}).encode()
req_fa = Request("https://aktie.traderfox.com/ajax/getFactorAnalysis.php", data=data)
json_data = json.loads(urlopen(req_fa).read())
umsatzwachstum_growth = 100 - (100 * json_data["data"]["scores"]["salesgrowth5"]["score"])
eps_growth = 100 - (100 * json_data["data"]["scores"]["epsgrowth5"]["score"])
print(f"{umsatzwachstum_growth:.2f}, {eps_growth:.2f}")
This would give you:
5.95, 3.55
I suggest you print out the json_data
to better understand the format of the data that is returned.
I am unable to scrape domain name from this website? Postman returns json() but requests through exception When I call response.json()
The data you see on the page is stored inside embedded Json. To parse it, you can use next example:
import json
import requests
from bs4 import BeautifulSoup
url = "https://cloud28plus.com/en/partner/resecurity--inc-"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").contents[0])
# uncomment this to see all data:
# print(json.dumps(data, indent=4))
print(data["props"]["initialProps"]["pageProps"]["element"]["twitter"])
Prints:
https://twitter.com/RESecurity
Related Topics
How to Add Rel="Nofollow" to Links with Preg_Replace()
Url Encoded Forward Slashes Breaking My Codeigniter App
Zend Framework 2 - Removed Form Element Causes Validation to Fail
How to Allow a User to Download a File Which Is Stored Outside of the Webroot
Php: How to Add Leading Zeros/Zero Padding to Float via Sprintf()
PHP Datetime::Createfromformat in 5.2
Shell_Exec() Timeout Management & Exec()
How to Decode a Base64 String (Gif) into Image in PHP/Html
How to Format a PHP Include() Absolute (Rather Than Relative) Path
Destroy PHP Session on Page Leaving
PHP Remove/Fix Module Not Found or Already Loaded Warnings
How to Access MySQLi Connection in Another Class on Another Page
Problems with Secure Bind to Active Directory Using PHP
Call to Undefined Function Oci_Connect, PHP_Oci8_12C.Dll, Windows 8.1, PHP5.6.6