How to Generate All N-Grams in Hive

How to generate all n-grams in Hive

This is maybe not the most optimal but quite working solution. Split sentence by delimiter (in my example it is one or more space or comma), then explode and join to get n-grams, then assemble array of n-grams using collect_set (if you need unique n-grams) or collect_list:

with src as 
(
select source_data.sentence, words.pos, words.word
  from
      (--Replace this subquery (source_data) with your table
       select stack (2,
                     'This is my sentence', 
                     'This is another sentence'
                     ) as sentence
      ) source_data 
        --split and explode words
        lateral view posexplode(split(sentence, '[ ,]+')) words as pos, word
)

select s1.sentence, collect_set(concat_ws(' ',s1.word, s2.word)) as ngrams 
      from src s1 
           inner join src s2 on s1.sentence=s2.sentence and s1.pos+1=s2.pos              
group by s1.sentence;

Result:

OK
This is another sentence        ["This is","is another","another sentence"]
This is my sentence             ["This is","is my","my sentence"]
Time taken: 67.832 seconds, Fetched: 2 row(s)

ngrams combination of words hive

with        cte as
            (
                select  t.ID
                       ,t.COUNTRY
                       ,pe.pos
                       ,pe.val
                from    mytable t
                        lateral view posexplode (split(VAR1,'\\s+')) pe
            )

select      t1.COUNTRY  
           ,concat_ws(' ',t1.val,t2.val)    as combination
           ,count (*)                       as cnt

from                cte t1

            join    cte t2

            on      t2.id   =
                    t1.id

where       t1.pos < t2.pos

group by    t1.COUNTRY  
           ,t1.val 
           ,t2.val
;

+----------+--------------+------+
| country  | combination  | cnt  |
+----------+--------------+------+
| CANADA   | dad child    | 1    |
| CANADA   | dad dog      | 1    |
| CANADA   | dad mum      | 2    |
| CANADA   | mum child    | 1    |
| CANADA   | mum dog      | 1    |
| USA      | bird car     | 1    |
| USA      | bird lion    | 1    |
| USA      | lion car     | 1    |
+----------+--------------+------+

How do I get ngrams array string and estfrequency as seperate elements in a hive table using HiveQL?

To change your code from 1-grams, to 2-grams, change array(null) to array(null,null).

Import data to HVE (Finding trending topics using Google Books n-grams data )

I'm not clear with your problem, but the table you create is english_1grams but your hql is select * from ngrams.

Here is my step, and I am sure I can see the data.

I download the data from the site google book and decompress to my home folder.
Enter hive copy data from local to hadoop dfs -copyFromLocal ${env:HOME}/googlebooks-eng-all-1gram-20120701-2 /user/hive/warehose/googlebook

I create the table

create external table english_gram(
    gram string,
    year int,
    page int,
    book int)
    row format delimited
    fields terminated by '\t'
    stored as textfile
    location 'hdfs://ubuntu:9005/user/hive/warehouse/googlebook';

select * from english_gram; you will see the data.

How to Generate All N-Grams in Hive