Pandas: How to read CSV file from google drive public?
Using pandas
import pandas as pd
url='https://drive.google.com/file/d/0B6GhBwm5vaB2ekdlZW5WZnppb28/view?usp=sharing'
file_id=url.split('/')[-2]
dwn_url='https://drive.google.com/uc?id=' + file_id
df = pd.read_csv(dwn_url)
print(df.head())
Using pandas and requests
import pandas as pd
import requests
from io import StringIO
url='https://drive.google.com/file/d/0B6GhBwm5vaB2ekdlZW5WZnppb28/view?usp=sharing'
file_id = url.split('/')[-2]
dwn_url='https://drive.google.com/uc?export=download&id=' + file_id
url2 = requests.get(dwn_url).text
csv_raw = StringIO(url2)
df = pd.read_csv(csv_raw)
print(df.head())
output
sex age state cheq_balance savings_balance credit_score special_offer
0 Female 10.0 FL 7342.26 5482.87 774 True
1 Female 14.0 CA 870.39 11823.74 770 True
2 Male 0.0 TX 3282.34 8564.79 605 True
3 Female 37.0 TX 4645.99 12826.76 608 True
4 Male NaN FL NaN 3493.08 551 False
Read csv file hosted on Google Drive
You could try it like this
id <- "0B-wuZ2XMFIBUd09Ob0pKVkRzQTA" # google file ID
read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id))
Read CSV file from Google Drive or any cloud service with Python Pandas
You data is UTF16 encoded. You can read it specifying the encoding:
pd.read_csv(dwn_url, encoding='utf16')
Result:
email first_name last_name
0 NaN NaN NaN
1 uno@gmail.com Luca Rossi
2 due@gmail.com Daniel Bianchi
3 tre@gmail.com Gabriel Domeneghetti
4 qua@gmail.com Christian Bona
5 cin@gmail.com Simone Marsango
(read_csv
can directly read from a url, no need for requests
and StringIO
.)
Error reading cvs with pandas from google drive url
Short answer - you can't put Google Drive URL to pd.read_csv()
. You have to download the CSV file and use the actual path to it.
Basically, the Google Drive URL shows you that there is some CSV file. In reality, it's just a website (with HTML content) that shows you some information about the CSV file that they are hosting. That's what you see: <!DOCTYPE html>...
.
Locally, this works because you use an actual file system path that Pandas can read. If you want to do this with a remote file, you have to fetch the file so it's available in a local file system. In general, you can use wget
or curl
command, but this is not straightforward to do with Google Drive because you need to be authenticated with your Google account to access the file. There are some ideas on how to do that here and here.
The best way to download a file in Python / Jupyter notebook is to use gdown
. You can install it via pip and provide your URL and it will download it for you.
# install gdown in terminal
pip install gdown
# download your file
gdown 'https://drive.google.com/uc?id=1iE1nHPJvglklttBEqX92_Mfg6421CtMq'
Notice the URL that we're providing to gdown
.
import pandas as pd
pd.read_csv('/path/to/file.csv')
I created an example notebook for you in Deepnote, you can do the same in local Python repl, in VSCode, in Jupyter notebook, or in Google Colab.
There is a special way for you to connect to Drive from Colab by mounting Drive. More on that here.
Get CSV from google drive and then load to pandas
I believe your goal and situation as follows.
- You want to download the CSV data from the CSV file on Google Drive.
- You can get values from Google Spreadsheet using googleapis for python.
Pattern 1:
In this pattern, the CSV data is downloaded with googleapis. The downloaded CSV data is saved as a file. And the value is retrieved by the method of "Files: get" in Drive API v3.
Sample script:
file_id = "###" # Please set the file ID of the CSV file.
service = build('drive', 'v3', credentials=creds)
request = service.files().get_media(fileId=file_id)
fh = io.FileIO("sample.csv", mode='wb')
downloader = MediaIoBaseDownload(fh, request)
done = False
while done is False:
status, done = downloader.next_chunk()
print("Download %d%%." % int(status.progress() * 100))
- In this case, the CSV data can be converted to the dataframe with
df = pd.read_csv("sample.csv")
.
Pattern 2:
In this pattern, as a simple method, the access token is used from creds
. The downloaded CSV data is not saved as a file. And the value is retrieved by the method of "Files: get" in Drive API v3.
Sample script:
file_id = "###" # Please set the file ID of the CSV file.
access_token = creds.token
url = "https://www.googleapis.com/drive/v3/files/" + file_id + "?alt=media"
res = requests.get(url, headers={"Authorization": "Bearer " + access_token})
print(res.text)
- In this case, the CSV data can be directly converted to the dataframe with
df = pd.read_csv(io.StringIO(res.text))
.
Note:
- In the following scripts, please include the scope of
https://www.googleapis.com/auth/drive.readonly
and/orhttps://www.googleapis.com/auth/drive
. When you modified the scopes, please reauthorize the scopes. By this, the modified scopes are included in the access token. Please be careful this.
Reference:
- Download files
Getting a csv read into R though a shareable google drive link
The answer was indicated in the post you linked. Namely,
id <- "0B5V8AyEFBTmXM1VIYUYxSG5tSjQ"
stuff <- read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id))
Related Topics
Pandoc Insert Appendix After Bibliography
Clipping Raster Using Shapefile in R, But Keeping the Geometry of the Shapefile
Sources on S4 Objects, Methods and Programming in R
How to Extract Fitted Splines from a Gam ('Mgcv::Gam')
Does the Ternary Operator Exist in R
Best Practices for Storing and Using Data Frames Too Large for Memory
Compare Two Character Vectors in R
How to Annotate() Ggplot with Latex
Avoid Wasting Space When Placing Multiple Aligned Plots Onto One Page
Easier Way to Plot the Cumulative Frequency Distribution in Ggplot
How to Use Outlier Tests in R Code
Plot Size and Resolution with R Markdown, Knitr, Pandoc, Beamer
Remove Rows Where All Variables Are Na Using Dplyr
R Reading in a Zip Data File Without Unzipping It
Geom_Col Is Assigning the Wrong Independent Variable
Get All the Rows with Rownames Starting with Abc111
Applying the Same Factor Levels to Multiple Variables in an R Data Frame