How to Extract Part of a String in Hive

how to extract a part of a string in hive

Using Hive regexp_extract(string subject, string pattern, int index) function:

SELECT regexp_extract(desc, '.*? (\\d+) .*$', 1) AS Revenue
FROM table1

See other examples in:

  • "Hive QL selecting numeric substring of string"
  • "extracting a substring from a text column in hive"

Extract substring with a specific pattern in Hive SQL

Try using:

SELECT colname FROM tableName WHERE REGEXP_EXTRACT(colname, ".*(M6[^_]*).*",1)

Regex used:

.*(M6[^_]*).*

Regex Demo

Explanation:

  • .* - matches 0+ occurrences of any character that is not a newline character
  • (M6[^_]*) - matches M6 followed by 0+ occurrences of any character that is not a _. So, after M6, it keeps on matching everything until it finds the next _. The parenthesis is used to store this sub-match in Group 1
  • .* - matches 0+ occurrences of any character that is not a newline character

Get a substring in hive

There are several ways you can extract hours from timestamp value.

1.Using Substring function:

select substring(string("2017-06-05 09:06:32.0"),12,2);
+------+--+
| _c0 |
+------+--+
| 09 |
+------+--+

2.Using Regexp_Extract:

select regexp_Extract(string("2017-06-05 09:06:32.0"),"\\s(\\d\\d)",1);
+------+--+
| _c0 |
+------+--+
| 09 |
+------+--+

3.Using Hour:

select hour(timestamp("2017-06-05 09:06:32.0"));
+------+--+
| _c0 |
+------+--+
| 9 |
+------+--+

4.Using from_unixtime:

select from_unixtime(unix_timestamp('2017-06-05 09:06:32.0'),'HH');
+------+--+
| _c0 |
+------+--+
| 09 |
+------+--+

5.Using date_format:

select date_format(string('2017-06-05 09:06:32.0'),'hh');
+------+--+
| _c0 |
+------+--+
| 09 |
+------+--+

6.Using Split:

select split(split(string('2017-06-05 09:06:32.0'),' ')[1],':')[0];
+------+--+
| _c0 |
+------+--+
| 09 |
+------+--+

extracting a substring from a text column in hive

Use regexp_extract function with the matching regex to capture only the displayName from your title field value.

Ex:

hive> with tb as(select string('"id":"S-1-98-13474422323-33566802",
"name":"uid=Xzdpr0,ou=people,dc=vm,dc=com","shortName":"XZDPR0",
"displayName":"Jund Lee","emailAddress":"jund.lee@bm.com",
"title":"Leading Product Investor"')title)
select regexp_extract(title,'"displayName":"(.*?)"',1) title from tb;

+-----------+--+
| title |
+-----------+--+
| Jund Lee |
+-----------+--+

Single hive query to extract a piece of string

You could start from the first slash and take everything until the next space:

regexp_extract(testdata, '(/[^\\s]+)', 0)

Impala/Hive function to get the substring of a string

Try:

REGEXP_EXTRACT('your string', ':abd: ([^:]+)', 1)

The regexp :abd: ([^:]+) means match ':abd: ' folowed by any characters that are not ':'.

This regexp assumes that ':' does not appears withing the "value" strings. As such, it would fail on this input:

:abd: 5768:92034 :erg: 94856023MXCI :oute: A RF WERS YUT :oowpo: 649217349GBT GB


Related Topics



Leave a reply



Submit