Hive Query Performance for High Cardinality Field

Hive Partitions using Column Keys with High Cardinality

Yes, i would say it will be bad for HDFS. It will not literally blow up your HDFS but it will slow down table, create 133,225 individual files of same huge size.

I would say, choose something realistic like month+year. This will give you better control, even distribution. In case of date partition, weekends and holidays can have 0 data.

So please analyze what is 'the' evenly distributed count combination and then choose that one.

Hive multiple distinct on query running slow?

Alternative approach if you do not have too big counts (too big arrays will cause OOM). size(collect_set()) will give you the distinct count.

select count(*), size(collect_set(a)), size(collect_set(b)) from test.tablename;

Ad hoc queries against high cardinality columns

I assume you're stuck with the design, so a few thoughts that I'd probably look at -

1) use partitions - if you have partitioning option

2) use some triggers to denormalise (or normalise in this case) a query table which is more optimised for the query usage

3) make some snapshots

4) look at having a current table or set of tables which has the days records (or some suitable subset), and roll them over to a big table to store hsitory.

It depends on usage patterns and all the other constraints the system has - this may get you started, if you have more details a better solution is probably out there.

Hive - Bucketing and Partitioning

Bucketing and partitioning are not exclusive, you can use both.

My short answer from my fairly long hive experience is "you should ALWAYS use partitioning, and sometimes you may want to bucket too".

If you have a big table, partitioning helps reducing the amount of data you query. A partition is usually represented as a directory on HDFS. A common usage is to partition by year/month/day, since most people query by date.
The only drawback is that you should not partition on columns with a big cardinality.
Cardinality is a fundamental concept in big data, it's the number of possible values a column may have. 'US state' for instance has a low cardinality (around 50), while for instance 'ip_number' has a large cardinality (2^32 possible numbers).
If you partition on a field with a high cardinality, hive will create a very large number of directories in HDFS, which is not good (extra memory load on namenode).

Bucketing can be useful, but you also have to be disciplined when inserting data into a table. Hive won't check that the data you're inserting is bucketed the way it's supposed to.
A bucketed table has to do a CLUSTER BY, which may add an extra step in your processing.
But if you do lots of joins, they can be greatly sped up if both tables are bucketed the same way (on the same field and the same number of buckets). Also, once you decide the number of buckets, you can't easily change it.



Related Topics



Leave a reply



Submit