Hadoop/Hive : Loading data from .csv on a remote machine
Hive Load command is a follows :
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
1) if LOCAL specified - Loads from local FS filepath
2) if no LOCAL - Loads from HDFS filepath only i.e,:
filepath must refer to files within the same filesystem as the table's (or partition's) location
So Load from remote http:path won't work. refer HIVE DML . The possible way is (Staging) to load the data from remote http:path to LocalFS or HDFS , then to Hive Warehouse.
Loading Data from Remote Machine to Hive Database
Since Hive basically applies a schema to data that resides in HDFS, you'll want to create a location in HDFS, move your data there, and then create a Hive table that points to that location. If you're using a commercial distribution, this may be possible from Hue (the Hadoop User Environment web UI).
Here's an example from the command line.
Create csv file on local machine:
$ vi famous_dictators.csv
... and this is what the file looks like:
$ cat famous_dictators.csv
1,Mao Zedong,63000000
2,Jozef Stalin,23000000
3,Adolf Hitler,17000000
4,Leopold II of Belgium,8000000
5,Hideki Tojo,5000000
6,Ismail Enver Pasha,2500000
7,Pol Pot,1700000
8,Kim Il Sung,1600000
9,Mengistu Haile Mariam,950000
10,Yakubu Gowon,1100000
Then scp
the csv file to a cluster node:
$ scp famous_dictators.csv hadoop01:/tmp/
ssh
into the node:
$ ssh hadoop01
Create a folder in HDFS:
[awoolford@hadoop01 ~]$ hdfs dfs -mkdir /tmp/famous_dictators/
Copy the csv file from the local filesystem into the HDFS folder:
[awoolford@hadoop01 ~]$ hdfs dfs -copyFromLocal /tmp/famous_dictators.csv /tmp/famous_dictators/
Then login to hive and create the table:
[awoolford@hadoop01 ~]$ hive
hive> CREATE TABLE `famous_dictators`(
> `rank` int,
> `name` string,
> `deaths` int)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> LINES TERMINATED BY '\n'
> LOCATION
> 'hdfs:///tmp/famous_dictators';
You should now be able to query your data in Hive:
hive> select * from famous_dictators;
OK
1 Mao Zedong 63000000
2 Jozef Stalin 23000000
3 Adolf Hitler 17000000
4 Leopold II of Belgium 8000000
5 Hideki Tojo 5000000
6 Ismail Enver Pasha 2500000
7 Pol Pot 1700000
8 Kim Il Sung 1600000
9 Mengistu Haile Mariam 950000
10 Yakubu Gowon 1100000
Time taken: 0.789 seconds, Fetched: 10 row(s)
How to load a file from desktop into Hive
Depends where this "desktop" is and which exactly web tool you are using (hue? - I think you can't).
Then you have 2 options for loading data into hive from the file:
(1) Local - from the unix box on which hdfs is located (most likely not your "desktop")
(2) Not-Local - from hdfs (you can e.g. interact with webhdf - to dump file directly there: https://hadoop.apache.org/docs/r1.0.4/webhdfs.html, or do hadoop fs -put
from the mentioned unix box)
Documentation REF: https://cwiki.apache.org/confluence/display/hive/languagemanual+dml#LanguageManualDML-Loadingfilesintotables
Why csv data loads to only to the first column of hive table?
How do create the table? you must specify the delimiter :
hive> CREATE TABLE dev.k_site(String Location,Year String,perc_food double,perc_g double) row format delimited fields terminated by ‘,’ ;
Error loading csv data into Hive table
I have developed a tool to generate hive scripts from a csv file. Following are few examples on how files are generated.
Tool -- https://sourceforge.net/projects/csvtohive/?source=directory
Select a CSV file using Browse and set hadoop root directory ex: /user/bigdataproject/
Tool Generates Hadoop script with all csv files and following is a sample of
generated Hadoop script to insert csv into Hadoop#!/bin/bash -v
hadoop fs -put ./AllstarFull.csv /user/bigdataproject/AllstarFull.csv
hive -f ./AllstarFull.hivehadoop fs -put ./Appearances.csv /user/bigdataproject/Appearances.csv
hive -f ./Appearances.hivehadoop fs -put ./AwardsManagers.csv /user/bigdataproject/AwardsManagers.csv
hive -f ./AwardsManagers.hiveSample of generated Hive scripts
CREATE DATABASE IF NOT EXISTS lahman;
USE lahman;
CREATE TABLE AllstarFull (playerID string,yearID string,gameNum string,gameID string,teamID string,lgID string,GP string,startingPos string) row format delimited fields terminated by ',' stored as textfile;
LOAD DATA INPATH '/user/bigdataproject/AllstarFull.csv' OVERWRITE INTO TABLE AllstarFull;
SELECT * FROM AllstarFull;
Thanks
Vijay
LOAD DATA INPATH loads same CSV-base data into two different and external Hive tables
It looks like you just need to specify a different 'LOCATION' for the second table. When you do the 'LOAD DATA', Hive is actually copying data into that path. If both tables have the same 'LOCATION', they will share the same data.
Related Topics
Strange Postgresql "Value Too Long for Type Character Varying(500)"
Load Data Local, How to Skip the First Line
Select All Threads and Order by the Latest One
Get .SQL File from SQL Server 2012 Database
A Simple Way to Sum a Result from Union in MySQL
How to Quote Values Using Group_Concat
Query the Two Cities in Station with the Shortest and Longest City Names,
How to Create a One-Time-Use Function in a Script or Stored Procedure
T-Sql: Separate String into Multiple Columns
Compress Rows with Nulls and Duplicates into Single Rows
MySQL Count(*) on Multiple Tables
Which Table Exactly Is the "Left" Table and "Right" Table in a Join Statement (Sql)
Postgresql Get a Random Datetime/Timestamp Between Two Datetime/Timestamp