Pandas read_csv from url
UPDATE: From pandas 0.19.2
you can now just pass read_csv()
the url directly, although that will fail if it requires authentication.
For older pandas versions, or if you need authentication, or for any other HTTP-fault-tolerant reason:
Use pandas.read_csv
with a file-like object as the first argument.
If you want to read the csv from a string, you can use
io.StringIO
.For the URL
https://github.com/cs109/2014_data/blob/master/countries.csv
, you gethtml
response, not raw csv; you should use the url given by theRaw
link in the github page for getting raw csv response , which ishttps://raw.githubusercontent.com/cs109/2014_data/master/countries.csv
Example:
import pandas as pd
import io
import requests
url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))
Notes:
in Python 2.x, the string-buffer object was StringIO.StringIO
Pandas read_csv from web URL
First of all, let's understand about error
. The error
you are facing was stated below:-
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
- You have noticed that our
error
type is theUnicodeDecodeError
with0xff
Codec.
Why this error
occurred and how to resolve
it?
In our case pd.read_csv()
module use encoding = 'utf-8'
for Encoding
Data. and you are facing error
with 0xff
Codec. So, 0xff
is a number represented in the hexadecimal numeral system (base 16)
. It's composed of two f
numbers in hex
. As we know, f
in hex
is equivalent to 1111
in the binary numeral system
.
- Solution:- Use
encoding = 'utf-16'
while fetchingData
.
After this scenario, you may face Error tokenizing data. C error: Expected 1 fields in line 3, saw 3
Error
Which has been occurred due to Separation Error
of header
and footer
. So, the solution for your query was given below:-
# Import all the important Libraries
import pandas as pd
# Fetch 'CSV' Data Using 'URL' and store it in 'df'
url = 'https://www.hkex.com.hk/eng/dwrc/search/dwFullList.csv'
df = pd.read_csv(url, encoding = 'utf-16', sep = '\t', error_bad_lines = False, skiprows = 1, skipfooter = 3, engine = 'python')
# Print a few records of df
df.head()
Output of Above Cell:-
To Learn more about
pd.read_csv()
:- Click Here !!!
To Learn more aboutEncoding List
:- Click Here !!!
As you can see we have achieved our desired Output
. Hope this Solution helps you.
Pandas Error attempting to read csv from a url
Use the url for the raw data.
pd.read_csv('https://raw.githubusercontent.com/kjhealy/fips-codes/master/state_and_county_fips_master.csv')
Pandas read_csv from URL and include request header
As of pandas 1.3.0, you can now pass custom HTTP(s) headers using storage_options
argument:
url = "https://moz.com:443/top500/domains/csv"
hdr = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'
}
domains_df = pd.read_csv(url, storage_options=hdr)
Pandas read_csv response codes when using an external url
You can use url
in read_csv()
but it has no method to gives you status code. It simply raises error when it has non-200 status code and you have to use try/except
to catch it. You have example in other answer.
But if you have to use requests
then you can later use io.StringIO
to create file-like object (file in memory) and use it in read_csv()
.
import io
import requests
import pandas as pd
response = requests.get("https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv")
print('status_code:', response.status_code)
#if response.status_code == 200:
if response.ok:
df = pd.read_csv( io.StringIO(response.text) )
else:
df = None
print(df)
The same way you can use io.StringIO
when you create web page which gets csv
using HTML
with <form>
.
As I know read_csv(url)
works in similar way - it uses requests.get()
to get file data from server and later it uses io.StringIO
to read data.
Progress in bytes when reading CSV from URL with pandas
Not thoroughly tested, but you can implement custom class with read()
method where you read from requests
response line by line and update the tqdm
bar:
import requests
import pandas as pd
from tqdm import tqdm
url = "https://public.tableau.com/views/PPBOpenDataDownloads/UseOfForce-All.csv?:showVizHome=no"
class TqdmReader:
def __init__(self, resp):
total_size = int(resp.headers.get("Content-Length", 0))
self.resp = resp
self.bar = tqdm(
desc=resp.url,
total=total_size,
unit="iB",
unit_scale=True,
unit_divisor=1024,
)
self.reader = self.read_from_stream()
def read_from_stream(self):
for line in self.resp.iter_lines():
line += b"\n"
self.bar.update(len(line))
yield line
def read(self, n=0):
try:
return next(self.reader)
except StopIteration:
return ""
with requests.get(url, params=None, stream=True) as resp:
df = pd.read_csv(TqdmReader(resp))
print(len(df))
Prints:
https://public.tableau.com/views/PPBOpenDataDownloads/UseOfForce-All.csv?:showVizHome=no: 100%|██████████████████████████████████████████████████████████████████████████████| 2.09M/2.09M [00:00<00:00, 2.64MiB/s]
7975
Related Topics
How to Compare Version Numbers in Python
Why am I Seeing "Typeerror: String Indices Must Be Integers"
Activate a Virtualenv with a Python Script
Why Do You Need Explicitly Have the "Self" Argument in a Python Method
How to Delete Items from a Dictionary While Iterating Over It
Is There a List of Pytz Timezones
What Does a B Prefix Before a Python String Mean
Post-Install Script with Python Setuptools
Python Pandas Insert List into a Cell
How to Correctly Clean Up a Python Object
How to Display Pandas Dataframe of Floats Using a Format String for Columns
Random.Seed(): What Does It Do
How to Get Monitor Resolution in Python
In Python, How to Convert All of the Items in a List to Floats
Good Python Modules for Fuzzy String Comparison
How to Get the Original Variable Name of Variable Passed to a Function