How to Generate All N-Grams in Hive

How to generate all n-grams in Hive

This is maybe not the most optimal but quite working solution. Split sentence by delimiter (in my example it is one or more space or comma), then explode and join to get n-grams, then assemble array of n-grams using collect_set (if you need unique n-grams) or collect_list:

with src as 
(
select source_data.sentence, words.pos, words.word
from
(--Replace this subquery (source_data) with your table
select stack (2,
'This is my sentence',
'This is another sentence'
) as sentence
) source_data
--split and explode words
lateral view posexplode(split(sentence, '[ ,]+')) words as pos, word
)

select s1.sentence, collect_set(concat_ws(' ',s1.word, s2.word)) as ngrams
from src s1
inner join src s2 on s1.sentence=s2.sentence and s1.pos+1=s2.pos
group by s1.sentence;

Result:

OK
This is another sentence ["This is","is another","another sentence"]
This is my sentence ["This is","is my","my sentence"]
Time taken: 67.832 seconds, Fetched: 2 row(s)

ngrams combination of words hive

with        cte as
(
select t.ID
,t.COUNTRY
,pe.pos
,pe.val
from mytable t
lateral view posexplode (split(VAR1,'\\s+')) pe
)

select t1.COUNTRY
,concat_ws(' ',t1.val,t2.val) as combination
,count (*) as cnt

from cte t1

join cte t2

on t2.id =
t1.id

where t1.pos < t2.pos

group by t1.COUNTRY
,t1.val
,t2.val
;

+----------+--------------+------+
| country | combination | cnt |
+----------+--------------+------+
| CANADA | dad child | 1 |
| CANADA | dad dog | 1 |
| CANADA | dad mum | 2 |
| CANADA | mum child | 1 |
| CANADA | mum dog | 1 |
| USA | bird car | 1 |
| USA | bird lion | 1 |
| USA | lion car | 1 |
+----------+--------------+------+

How do I get ngrams array string and estfrequency as seperate elements in a hive table using HiveQL?

To change your code from 1-grams, to 2-grams, change array(null) to array(null,null).

Import data to HVE (Finding trending topics using Google Books n-grams data )

I'm not clear with your problem, but the table you create is english_1grams but your hql is select * from ngrams.

Here is my step, and I am sure I can see the data.

  1. I download the data from the site google book and decompress to my home folder.
  2. Enter hive copy data from local to hadoop dfs -copyFromLocal ${env:HOME}/googlebooks-eng-all-1gram-20120701-2 /user/hive/warehose/googlebook
  3. I create the table

    create external table english_gram(
    gram string,
    year int,
    page int,
    book int)
    row format delimited
    fields terminated by '\t'
    stored as textfile
    location 'hdfs://ubuntu:9005/user/hive/warehouse/googlebook';
  4. select * from english_gram; you will see the data.



Related Topics



Leave a reply



Submit