How to Get JSON from Webpage into Python Script

Load a JSON File from a URL request

url = "example.url"
response = requests.request("GET", url, headers=headers)

data = response.json()

python3: Read json file from url

You were close:

import requests
import json
response = json.loads(requests.get("your_url").text)

How can i read JSON data from online file?(python)

With requests library:

import requests

f = "https://api.npoint.io/7872500d7eef44a03194"
data = requests.get(f).json()

data

Output:

{'sample': 'this is only a sample'}

Python 3 Get and parse JSON API

Version 1: (do a pip install requests before running the script)

import requests
r = requests.get(url='https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty')
print(r.json())

Version 2: (do a pip install wget before running the script)

import wget

fs = wget.download(url='https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty')
with open(fs, 'r') as f:
content = f.read()
print(content)

Parse JSON from within HTML webpage Using Python

Example how to parse the Json data contained within this page:

import json
import requests
from bs4 import BeautifulSoup

url = "https://www.bizbuysell.com/connecticut-businesses-for-sale/?q=bHQ9MzAsNDAsODA%3D"

headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0"
}

soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
data = soup.select_one('[data-stype="searchResultsPage"]').contents[0]
data = json.loads(data)

# pretty print the data
print(json.dumps(data, indent=4))

Prints:

{
"@context": "http://schema.org",
"@type": "SearchResultsPage",
"speakable": {
"@type": "SpeakableSpecification",
"xpath": [
"/html/head/title",
"/html/head/meta[@name='description']/@content"
]
},
"about": [
{
"item": {
"@type": "Product",
"name": "Moving Company",
"alternateName": null,
"logo": "https://images.bizbuysell.com/shared/listings/179/1791243/ade90fd4-5537-4545-9011-58eb2f257a99-W496.jpg",
"image": "https://images.bizbuysell.com/shared/listings/179/1791243/ade90fd4-5537-4545-9011-58eb2f257a99-W496.jpg",
"description": "The company is made up of three department. A licensed Household Goods Relocation and Eviction, an Insurance Agency, and also a Thrift Store. The reason why the company is established this way is the three departments work very well together. Most times someone calls us for services and require special Insurance Coverages. We represent several Insurance Companies and Wholesalers which us a great advantage to obtain the required Insurance Coverage without delays. Most time we relocate clients who are downsizing, children are grown, and moved out, and therefore do not have need for lots of furniture which we either purchase at minimal cost or given to us for free. It's a win win situation for the company. The items are sold very fast because the selling price is extremely low and the profit margin is very high.",
"url": "/Business-Opportunity/moving-company/1791243",
"productId": "1791243",
"offers": {
"@type": "Offer",
"price": 450000,
"priceCurrency": "USD",
"availability": "http://schema.org/InStock",
"url": "/Business-Opportunity/moving-company/1791243",
"image": "https://images.bizbuysell.com/shared/listings/179/1791243/ade90fd4-5537-4545-9011-58eb2f257a99-W496.jpg",
"availableAtOrFrom": {
"@type": "Place",
"address": {
"@type": "PostalAddress",
"addressLocality": "Hartford County",
"addressRegion": " CT"
}
}
}
},
"@type": "ListItem",
"position": 0
},

...

How can to get the JSON out of webpage?

First, you want to access the raw file, and not the UI. Just like Kache mentioned, you can get the JSON using:

resp = requests.get('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
obj = json.loads(base64.decodebytes(resp.text.encode()))

Then, you can use the following script to extract only the data you want:

import requests
import json
import base64

def extract_log(log):
keys = [ 'description', 'log_id' ]
return { key: log[key] for key in keys }

def extract_logs(logs):
return [ extract_log(log) for log in logs ]

def extract_operator(operator):
return {
'name': operator['name'],
'logs': extract_logs(operator['logs'])
}

def extract_certificates(obj):
return [ extract_operator(operator) for operator in obj['operators'] ]

def scrape_certificates(url):
resp = requests.get(url)
obj = json.loads(base64.decodebytes(resp.text.encode()))
return extract_certificates(obj)

def main():
out = scrape_certificates('https://chromium.googlesource.com/chromium/src/+/main/components/certificate_transparency/data/log_list.json?format=TEXT')
print(json.dumps(out, indent=4))

if __name__ == '__main__':
main()

Trying to read JSON from URL and parse into CSV format

You are close, here's what you need to change:

  1. You can use pandas dataframes to read json using df = pd.read_json(text, lines=True) - for this make sure to specify lines=True because some of your data contains \n characters
  2. You can use the same dataframe to output to a csv using df.to_csv(file)

All in all, there are some things in your code that could be removed, e.g. you're calling requests.get twice for no real reason, which slows your code down substantially.

import requests
import pandas as pd

all_links = ['https://www.baptisthealthsystem.com/docs/global/standard-charges/474131755_abrazomaranahospital_standardcharges.json?sfvrsn=9a27928_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621861138_abrazocavecreekhospital_standardcharges.json?sfvrsn=674fd6f_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621809851_abrazomesahospital_standardcharges.json?sfvrsn=13953222_2',
'https://www.baptisthealthsystem.com/docs/global/standard-charges/621811285_abrazosurprisehospital_standardcharges.json?sfvrsn=c8113dcf_2']
for item in all_links:
try:
length = len(item)
first_under = item.find('_') + 1
last_under = item.rfind('?') - 21
file_name = item[first_under:last_under]
r = requests.get(item)
df = pd.read_json(r.text, lines=True)
DOWNLOAD_PATH = 'C:\\Users\\ryans\\Desktop\\hospital_data\\' + file_name + '.csv'
with open(DOWNLOAD_PATH,'wb') as f:
df.to_csv(f)
except Exception as e: print(e)


Related Topics



Leave a reply



Submit