Hadoop/Hive:Loading Data from .CSV on a Local MAChine

Hadoop/Hive : Loading data from .csv on a remote machine

Hive Load command is a follows :

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

1) if LOCAL specified - Loads from local FS filepath

2) if no LOCAL - Loads from HDFS filepath only i.e,:
filepath must refer to files within the same filesystem as the table's (or partition's) location

So Load from remote http:path won't work. refer HIVE DML . The possible way is (Staging) to load the data from remote http:path to LocalFS or HDFS , then to Hive Warehouse.

Loading Data from Remote Machine to Hive Database

Since Hive basically applies a schema to data that resides in HDFS, you'll want to create a location in HDFS, move your data there, and then create a Hive table that points to that location. If you're using a commercial distribution, this may be possible from Hue (the Hadoop User Environment web UI).

Here's an example from the command line.

Create csv file on local machine:

$ vi famous_dictators.csv

... and this is what the file looks like:

$ cat famous_dictators.csv 
1,Mao Zedong,63000000
2,Jozef Stalin,23000000
3,Adolf Hitler,17000000
4,Leopold II of Belgium,8000000
5,Hideki Tojo,5000000
6,Ismail Enver Pasha,2500000
7,Pol Pot,1700000
8,Kim Il Sung,1600000
9,Mengistu Haile Mariam,950000
10,Yakubu Gowon,1100000

Then scp the csv file to a cluster node:

$ scp famous_dictators.csv hadoop01:/tmp/

ssh into the node:

$ ssh hadoop01

Create a folder in HDFS:

[awoolford@hadoop01 ~]$ hdfs dfs -mkdir /tmp/famous_dictators/

Copy the csv file from the local filesystem into the HDFS folder:

[awoolford@hadoop01 ~]$ hdfs dfs -copyFromLocal /tmp/famous_dictators.csv /tmp/famous_dictators/

Then login to hive and create the table:

[awoolford@hadoop01 ~]$ hive

hive> CREATE TABLE `famous_dictators`(
> `rank` int,
> `name` string,
> `deaths` int)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> LINES TERMINATED BY '\n'
> LOCATION
> 'hdfs:///tmp/famous_dictators';

You should now be able to query your data in Hive:

hive> select * from famous_dictators;
OK
1 Mao Zedong 63000000
2 Jozef Stalin 23000000
3 Adolf Hitler 17000000
4 Leopold II of Belgium 8000000
5 Hideki Tojo 5000000
6 Ismail Enver Pasha 2500000
7 Pol Pot 1700000
8 Kim Il Sung 1600000
9 Mengistu Haile Mariam 950000
10 Yakubu Gowon 1100000
Time taken: 0.789 seconds, Fetched: 10 row(s)

How to load a file from desktop into Hive

Depends where this "desktop" is and which exactly web tool you are using (hue? - I think you can't).

Then you have 2 options for loading data into hive from the file:

(1) Local - from the unix box on which hdfs is located (most likely not your "desktop")

(2) Not-Local - from hdfs (you can e.g. interact with webhdf - to dump file directly there: https://hadoop.apache.org/docs/r1.0.4/webhdfs.html, or do hadoop fs -put from the mentioned unix box)

Documentation REF: https://cwiki.apache.org/confluence/display/hive/languagemanual+dml#LanguageManualDML-Loadingfilesintotables

Why csv data loads to only to the first column of hive table?

How do create the table? you must specify the delimiter :

hive> CREATE TABLE dev.k_site(String Location,Year String,perc_food double,perc_g double) row format delimited fields terminated by ‘,’ ;

Error loading csv data into Hive table

I have developed a tool to generate hive scripts from a csv file. Following are few examples on how files are generated.
Tool -- https://sourceforge.net/projects/csvtohive/?source=directory

  1. Select a CSV file using Browse and set hadoop root directory ex: /user/bigdataproject/

  2. Tool Generates Hadoop script with all csv files and following is a sample of
    generated Hadoop script to insert csv into Hadoop

    #!/bin/bash -v

    hadoop fs -put ./AllstarFull.csv /user/bigdataproject/AllstarFull.csv
    hive -f ./AllstarFull.hive

    hadoop fs -put ./Appearances.csv /user/bigdataproject/Appearances.csv
    hive -f ./Appearances.hive

    hadoop fs -put ./AwardsManagers.csv /user/bigdataproject/AwardsManagers.csv
    hive -f ./AwardsManagers.hive

  3. Sample of generated Hive scripts

    CREATE DATABASE IF NOT EXISTS lahman;

    USE lahman;

    CREATE TABLE AllstarFull (playerID string,yearID string,gameNum string,gameID string,teamID string,lgID string,GP string,startingPos string) row format delimited fields terminated by ',' stored as textfile;

    LOAD DATA INPATH '/user/bigdataproject/AllstarFull.csv' OVERWRITE INTO TABLE AllstarFull;

    SELECT * FROM AllstarFull;

Thanks
Vijay

LOAD DATA INPATH loads same CSV-base data into two different and external Hive tables

It looks like you just need to specify a different 'LOCATION' for the second table. When you do the 'LOAD DATA', Hive is actually copying data into that path. If both tables have the same 'LOCATION', they will share the same data.



Related Topics



Leave a reply



Submit