Read dataset from Kaggle
I found a solution based on the answer posted here. Someone posted the link in the comment but I don't see the comment any more. Thank you Good Samaritan!
library(httr)
dataset <- httr::GET("https://www.kaggle.com/api/v1/competitions/data/download/10445/train.csv",
httr::authenticate(username, authkey, type = "basic"))
temp <- tempfile()
download.file(dataset$url,temp)
data <- read.csv(unz(temp, "train.csv"))
unlink(temp)
How can I import data downloaded from Kaggle to DBFS using Databricks Community Edition?
spark.read...
works with DBFS paths by default, so you have two choices:
use
file:/databricks/driver/...
to force reading from the local file system - it will work on the community edition because it's single node cluster. It won't work on the distributed clustercopy files to DBFS using the
dbutils.fs.cp
command (docs) and read from DBFS:
dbutils.fs.cp("file:/databricks/driver/WDataFiles_Stage1/Cities.csv",
"/FileStore/Cities.csv")
df = spark.read.csv("/FileStore/Cities.csv")
....
How to download a file from Kaggle and work on it in python
zipfile.BadZipFile: File is not a zip file
Clearly what you got is not ZIP file, Content-Type
response header is useful for determining what you got, I did
import requests
r = requests.get("https://www.kaggle.com/quora/question-pairs-dataset/download")
print(r.headers['Content-Type'])
output
text/html; charset=utf-8
So this is HTML page, as my browser is loading a zip file I suspect that access to this resource required being logged in otherwise you are redirect to page allowing logging in. To make requests
-based downloading work you would need to find how checking is done by Kaggle and conform to it.
How to read file from Kaggle in Jupyter Notebook in Microsoft Azure?
Kaggle has already provided extensive documentation for their command line API here, which has been built using Python and the source can be found here so reverse engineering it is very straight forward in order to use Kaggle API pythonically.
Assuming you've already exported the username and key as environment variables
import os
os.environ['KAGGLE_USERNAME'] = '<kaggle-user-name>'
os.environ['KAGGLE_KEY'] = '<kaggle-key>'
os.environ['KAGGLE_PROXY'] = '<proxy-address>' ## skip this step if you are not working behind a firewall
or
you've successfully downloaded kaggle.json
from the API section in your Kaggle Account page and copied this JSON to ~/.kaggle/
i.e. the Kaggle configuration directory in your system.
Then, you can use the following code in your Jupyter notebook to load this dataset to a pandas dataframe:
- Import libraries
import kaggle as kg
import pandas as pd
- Download the dataset locally
kg.api.authenticate()
kg.api.dataset_download_files(dataset="START-UMD/gtd", path='gt.zip', unzip=True)
- Read the downloaded dataset
df = pd.read_csv('gt.zip/globalterrorismdb_0718dist.csv', encoding='ISO-8859-1')
Related Topics
R: Need Finite 'Ylim' Values in Function
Creating "Word" Cloud of Phrases, Not Individual Words in R
Calculate Centroid Within/Inside a Spatialpolygon
Make Legend Invisible But Keep Figure Dimensions and Margins the Same
Why "Character Is Often Preferred to Factor" in Data.Table for Key
Creating Sequence of Dates for Each Group in R
R Shiny: Multiple Use in UI of Same Renderui in Server
Sum Columns by Group (Row Names) in a Matrix
Dummy Variables to Single Categorical Variable (Factor) in R
Create a Concentric Circle Legend for a Ggplot Bubble Chart
R Shiny: Plot with Dynamical Size
Mass Variable Declaration and Assignment in R
Getting the Minimum of the Rows in a Data Frame
How to Obtain All Combinations of the Columns of a Data Frame Taken by 2