Read Dataset from Kaggle

Read dataset from Kaggle

I found a solution based on the answer posted here. Someone posted the link in the comment but I don't see the comment any more. Thank you Good Samaritan!

library(httr)
dataset <- httr::GET("https://www.kaggle.com/api/v1/competitions/data/download/10445/train.csv",
httr::authenticate(username, authkey, type = "basic"))

temp <- tempfile()
download.file(dataset$url,temp)
data <- read.csv(unz(temp, "train.csv"))
unlink(temp)

How can I import data downloaded from Kaggle to DBFS using Databricks Community Edition?

spark.read... works with DBFS paths by default, so you have two choices:

  • use file:/databricks/driver/... to force reading from the local file system - it will work on the community edition because it's single node cluster. It won't work on the distributed cluster

  • copy files to DBFS using the dbutils.fs.cp command (docs) and read from DBFS:

dbutils.fs.cp("file:/databricks/driver/WDataFiles_Stage1/Cities.csv", 
"/FileStore/Cities.csv")
df = spark.read.csv("/FileStore/Cities.csv")
....

How to download a file from Kaggle and work on it in python

zipfile.BadZipFile: File is not a zip file

Clearly what you got is not ZIP file, Content-Type response header is useful for determining what you got, I did

import requests
r = requests.get("https://www.kaggle.com/quora/question-pairs-dataset/download")
print(r.headers['Content-Type'])

output

text/html; charset=utf-8

So this is HTML page, as my browser is loading a zip file I suspect that access to this resource required being logged in otherwise you are redirect to page allowing logging in. To make requests-based downloading work you would need to find how checking is done by Kaggle and conform to it.

How to read file from Kaggle in Jupyter Notebook in Microsoft Azure?

Kaggle has already provided extensive documentation for their command line API here, which has been built using Python and the source can be found here so reverse engineering it is very straight forward in order to use Kaggle API pythonically.

Assuming you've already exported the username and key as environment variables

import os
os.environ['KAGGLE_USERNAME'] = '<kaggle-user-name>'
os.environ['KAGGLE_KEY'] = '<kaggle-key>'
os.environ['KAGGLE_PROXY'] = '<proxy-address>' ## skip this step if you are not working behind a firewall

or
you've successfully downloaded kaggle.json from the API section in your Kaggle Account page and copied this JSON to ~/.kaggle/ i.e. the Kaggle configuration directory in your system.

Then, you can use the following code in your Jupyter notebook to load this dataset to a pandas dataframe:

  1. Import libraries
import kaggle as kg
import pandas as pd


  1. Download the dataset locally
kg.api.authenticate()
kg.api.dataset_download_files(dataset="START-UMD/gtd", path='gt.zip', unzip=True)

  1. Read the downloaded dataset
df = pd.read_csv('gt.zip/globalterrorismdb_0718dist.csv', encoding='ISO-8859-1')


Related Topics



Leave a reply



Submit