How to generate all n-grams in Hive
This is maybe not the most optimal but quite working solution. Split sentence by delimiter (in my example it is one or more space or comma), then explode and join to get n-grams, then assemble array of n-grams using collect_set
(if you need unique n-grams) or collect_list
:
with src as
(
select source_data.sentence, words.pos, words.word
from
(--Replace this subquery (source_data) with your table
select stack (2,
'This is my sentence',
'This is another sentence'
) as sentence
) source_data
--split and explode words
lateral view posexplode(split(sentence, '[ ,]+')) words as pos, word
)
select s1.sentence, collect_set(concat_ws(' ',s1.word, s2.word)) as ngrams
from src s1
inner join src s2 on s1.sentence=s2.sentence and s1.pos+1=s2.pos
group by s1.sentence;
Result:
OK
This is another sentence ["This is","is another","another sentence"]
This is my sentence ["This is","is my","my sentence"]
Time taken: 67.832 seconds, Fetched: 2 row(s)
ngrams combination of words hive
with cte as
(
select t.ID
,t.COUNTRY
,pe.pos
,pe.val
from mytable t
lateral view posexplode (split(VAR1,'\\s+')) pe
)
select t1.COUNTRY
,concat_ws(' ',t1.val,t2.val) as combination
,count (*) as cnt
from cte t1
join cte t2
on t2.id =
t1.id
where t1.pos < t2.pos
group by t1.COUNTRY
,t1.val
,t2.val
;
+----------+--------------+------+
| country | combination | cnt |
+----------+--------------+------+
| CANADA | dad child | 1 |
| CANADA | dad dog | 1 |
| CANADA | dad mum | 2 |
| CANADA | mum child | 1 |
| CANADA | mum dog | 1 |
| USA | bird car | 1 |
| USA | bird lion | 1 |
| USA | lion car | 1 |
+----------+--------------+------+
How do I get ngrams array string and estfrequency as seperate elements in a hive table using HiveQL?
To change your code from 1-grams, to 2-grams, change array(null)
to array(null,null)
.
Import data to HVE (Finding trending topics using Google Books n-grams data )
I'm not clear with your problem, but the table you create is english_1grams
but your hql is select * from ngrams
.
Here is my step, and I am sure I can see the data.
- I download the data from the site google book and decompress to my home folder.
- Enter hive copy data from local to hadoop
dfs -copyFromLocal ${env:HOME}/googlebooks-eng-all-1gram-20120701-2 /user/hive/warehose/googlebook
I create the table
create external table english_gram(
gram string,
year int,
page int,
book int)
row format delimited
fields terminated by '\t'
stored as textfile
location 'hdfs://ubuntu:9005/user/hive/warehouse/googlebook';select * from english_gram;
you will see the data.
Related Topics
Calling a Stored Proc Over a Dblink
Count Number of Not Null Fields Within a Row
Get Sundays for a Given Month Date in a Function SQL
Using Timestampdiff in a Derby Where Clause
Joining on Columns of Different Type
No Value Given for the Required Parameter
Elasticsearch Map Two SQL Tables with a Foreign Key
Rename a Column in All the Tables - SQL
In SQL, What Does Using Parentheses with an or Mean
Which Orm Frameworks Will Build and Execute the SQL Ddl for You
SQL Find Sets with Common Members (Relational Division)
How to Write Update Query to Update Two Tables with SQL Data Source
Insert or Select Strategy to Always Return a Row
How to Pass Column Name as Parameter in Select Statement SQL Server
Mysql: Union of a Left Join with a Right Join
Print Start and End Date in One Row for Continous or Overlaping Date Ranges in Oracle SQL