Pandas Read_CSV from Url

Pandas read_csv from url

UPDATE: From pandas 0.19.2 you can now just pass read_csv() the url directly, although that will fail if it requires authentication.


For older pandas versions, or if you need authentication, or for any other HTTP-fault-tolerant reason:

Use pandas.read_csv with a file-like object as the first argument.

  • If you want to read the csv from a string, you can use io.StringIO.

  • For the URL https://github.com/cs109/2014_data/blob/master/countries.csv, you get html response, not raw csv; you should use the url given by the Raw link in the github page for getting raw csv response , which is https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv

Example:

import pandas as pd
import io
import requests
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))

Notes:

in Python 2.x, the string-buffer object was StringIO.StringIO

Pandas read_csv from web URL

First of all, let's understand about error. The error you are facing was stated below:-

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

  • You have noticed that our error type is the UnicodeDecodeError with 0xff Codec.

Why this error occurred and how to resolve it?

In our case pd.read_csv() module use encoding = 'utf-8' for Encoding Data. and you are facing error with 0xff Codec. So, 0xff is a number represented in the hexadecimal numeral system (base 16). It's composed of two f numbers in hex. As we know, f in hex is equivalent to 1111 in the binary numeral system.

  • Solution:- Use encoding = 'utf-16' while fetching Data.

After this scenario, you may face Error tokenizing data. C error: Expected 1 fields in line 3, saw 3 Error Which has been occurred due to Separation Error of header and footer. So, the solution for your query was given below:-

# Import all the important Libraries
import pandas as pd

# Fetch 'CSV' Data Using 'URL' and store it in 'df'
url = 'https://www.hkex.com.hk/eng/dwrc/search/dwFullList.csv'
df = pd.read_csv(url, encoding = 'utf-16', sep = '\t', error_bad_lines = False, skiprows = 1, skipfooter = 3, engine = 'python')

# Print a few records of df
df.head()

Output of Above Cell:-
Output of Above Code

To Learn more about pd.read_csv():- Click Here !!!
To Learn more about Encoding List:- Click Here !!!

As you can see we have achieved our desired Output. Hope this Solution helps you.

Pandas Error attempting to read csv from a url

Use the url for the raw data.

pd.read_csv('https://raw.githubusercontent.com/kjhealy/fips-codes/master/state_and_county_fips_master.csv')

Pandas read_csv from URL and include request header

As of pandas 1.3.0, you can now pass custom HTTP(s) headers using storage_options argument:

url = "https://moz.com:443/top500/domains/csv"

hdr = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}

domains_df = pd.read_csv(url, storage_options=hdr)

Pandas read_csv response codes when using an external url

You can use url in read_csv() but it has no method to gives you status code. It simply raises error when it has non-200 status code and you have to use try/except to catch it. You have example in other answer.

But if you have to use requests then you can later use io.StringIO to create file-like object (file in memory) and use it in read_csv().

import io
import requests
import pandas as pd

response = requests.get("https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv")

print('status_code:', response.status_code)

#if response.status_code == 200:
if response.ok:
df = pd.read_csv( io.StringIO(response.text) )
else:
df = None

print(df)

The same way you can use io.StringIO when you create web page which gets csv using HTML with <form>.


As I know read_csv(url) works in similar way - it uses requests.get() to get file data from server and later it uses io.StringIO to read data.

Progress in bytes when reading CSV from URL with pandas

Not thoroughly tested, but you can implement custom class with read() method where you read from requests response line by line and update the tqdm bar:

import requests
import pandas as pd
from tqdm import tqdm

url = "https://public.tableau.com/views/PPBOpenDataDownloads/UseOfForce-All.csv?:showVizHome=no"

class TqdmReader:
def __init__(self, resp):
total_size = int(resp.headers.get("Content-Length", 0))

self.resp = resp
self.bar = tqdm(
desc=resp.url,
total=total_size,
unit="iB",
unit_scale=True,
unit_divisor=1024,
)

self.reader = self.read_from_stream()

def read_from_stream(self):
for line in self.resp.iter_lines():
line += b"\n"
self.bar.update(len(line))
yield line

def read(self, n=0):
try:
return next(self.reader)
except StopIteration:
return ""

with requests.get(url, params=None, stream=True) as resp:
df = pd.read_csv(TqdmReader(resp))

print(len(df))

Prints:

https://public.tableau.com/views/PPBOpenDataDownloads/UseOfForce-All.csv?:showVizHome=no: 100%|██████████████████████████████████████████████████████████████████████████████| 2.09M/2.09M [00:00<00:00, 2.64MiB/s]
7975


Related Topics



Leave a reply



Submit