Is the in Relation in Cassandra Bad for Queries

Is the IN relation in Cassandra bad for queries?

I remembered seeing someone answer this question in the Cassandra user mailing list a short while back, but I cannot find the exact message right now. Ironically, Cassandra Evangelist Rebecca Mills just posted an article that addresses this issue (Things you should be doing when using Cassandra drivers...points #13 and #22). But the answer is "yes" that in some cases, multiple, parallel queries would be faster than using an IN. The underlying reason can be found in the DataStax SELECT documentation.

When not to use IN

...Using IN can degrade performance because
usually many nodes must be queried. For example, in a single, local
data center cluster with 30 nodes, a replication factor of 3, and a
consistency level of LOCAL_QUORUM, a single key query goes out to two
nodes, but if the query uses the IN condition, the number of nodes
being queried are most likely even higher, up to 20 nodes depending on
where the keys fall in the token range.

So based on that, it would seem that this becomes more of a problem as your cluster gets larger.

Therefore, the best way to solve this problem (and not have to use IN at all) would be to rethink your data model for this query. Without knowing too much about your schema, perhaps there are attributes (column values) that are shared by ticket IDs 1, 2, 3, and 4. Maybe using something like level or group (if tickets are for a particular venue) or maybe even an event (id), instead.

Basically, while using a unique, high-cardinality identifier to partition your data sounds like a good idea, it actually makes it harder to query your data (in Cassandra) later on. If you could come up with a different column to partition your data on, that would certainly help you in this case. Regardless, creating a new, specific column family (table) to handle queries for those rows is going to be a better approach than using IN or multiple queries.

Is Cassandra suitable for Aggregate Queries?

It's a common misconception that Cassandra is a columnar database. I think it comes from the old terminology "column family" for tables. Data is stored in rows containing columns of key-value pairs which is why the tables used to be called column families.

A major difference compared to traditional relational databases is that Cassandra tables can be 2-dimensional (each record contains exactly one row) or multi-dimensional (each record can contain ONE OR MORE rows).

On the other hand, columnar databases flips a 2-dimensional table such that data is stored in columns instead of rows, specifically optimised for analytics-type queries such as aggregations -- this is NOT Cassandra.

Going back to your question, counting the rows within a single partition is ok to do for most data models. The key is to restrict the query to just one partition like:

    SELECT COUNT(some_column) FROM table_name
WHERE pk = ?

It's also OK to count the rows in a range query as long as they're restricted to one partition like:

    SELECT COUNT(some_column) FROM table_name
WHERE pk = ?
AND clustering_col >= ?
AND clustering_col <= ?

If you don't restrict the query to a single partition, it might work for (a) very small datasets and (b) clusters with a very low number of nodes but it doesn't scale as (c) the dataset grows, and (d) the number of nodes increases. I've explained why performing aggregates such as COUNT() is bad in Cassandra in this post -- https://community.datastax.com/questions/6897/.

This is not to say that Cassandra isn't a good fit. Cassandra is a good choice if your primary use case is for storing real-time data for OLTP workloads. For analytics queries, you just need to use other software like Apache Spark since the spark-cassandra-connector will optimise the queries to Cassandra. Cheers!

what is the impact of limit in cassandra cql

The LIMIT option puts an upper-bound on the maximum number of rows returned by a query but it doesn't prevent the query from performing a full table scan.

Cassandra has internal mechanisms such as request timeouts which prevent bad queries from causing the cluster to crash so queries are more likely to timeout rather than overloading the cluster with scans on all nodes/replicas.

As a side note, the LIMIT option is irrelevant when used with SELECT COUNT() since the count function returns just 1 row (by design). COUNT() needs to do a full table scan regardless of the limit set. I've explained it in a bit more detail in this post -- https://community.datastax.com/questions/6897/. Cheers!

Manage many to many relationship in Cassandra

Using collections for this is probably impractical because of the size limit on collections (although that shouldn't be a concern for a system with just a few users), chances are high that the set of users in a group will be too large.

It's also worth noting that your solution based on the user_group table won't work as it won't support querying by group. You would need to maintain another table to support this query (and always maintain the two records):

CREATE TABLE group_user (
user UUID,
group UUID,
PRIMARY KEY (group, user)
)

This will allow querying by group.


Additional options:

Add a secondary index to user_group:


Another approach is to expand the user_group solution: if you have a secondary index on the group field, you'll be able to perform lookups in both ways:

CREATE INDEX ON user_group (group);

Use a materialized view

You can also use a materialized view instead of a group_user table. The data between user_group and this view will be kept in sync by cassandra (eventually):

CREATE MATERIALIZED VIEW group_user
AS SELECT group, user
FROM user_group
WHERE user IS NOT NULL AND group IS NOT NULL
PRIMARY KEY (group, user);

With this, you'll have to add a record to user_group only and the view will take care of searches by group.

As you noted, each has pros and cons that can't be detailed here. Please check the docs on limitations of each option.

Would you use Cassandra for aggregate queries?

Quick answer: No :)

Cassandra doesn't support rich SQL queries. Technically Cassandra has some aggregations but this functionality is very limited.

There are several ways to do aggregation if your data is too large for RDBMS.

  1. NoSql storage + query engine. You can store data in Cassandra, Hbase or even in files on S3 and use such software like Hive, Spark SQL or Apache Drill for executing complex SQL queries on NoSQL storage.

  2. Elasticsearch now has a rich functionality for making aggregations.

  3. If you are on AWS the relatively simple and cheap solution is to put your data on S3 in Parquet format and use Athena to do aggregations.

Performing range queries for cassandra table

The first key in the primary key (a composite in your case) is responsible for scattering the data across different partitions. Each partition is held in its entirety on a single node (and its replicas). Even if the query you request were possible (it would be if you had only the date as a primary and used a byteorderedpartitioner - the default is murmur3), it would effectively do a full scan across your cluster. Think of this being similar to a full table scan in an rdbms on a column without an index, only now the full table scan spans multiple machines.

The goal with the composite partition key here is to ensure no partition gets unmanageably big. It also takes away your ability to do the range query across dates. If you think your data for an asset can fit in a single partition, you can make the date the first clustering key. That would enable the query. However, all rows for an asset would be on a single partition. This may be an issue (it's typically good to target around 100MB as a max partition size - though there are exceptions), and hotspots may arise in your cluster (nodes holding partitions for very busy stuff will be busy, while other nodes less so). Another way around this is to maintain manual buckets - add a bucketid int as part of the partition key [i.e. (asset_id, bucket_id)], have date as the first clustering key, and maintain the bucketing from application code. This would distribute the data for an asset across multiple partitions that you control. This will need a bit of calculation, and will need you to query each bucket yourself - but will prevent hotspots, and allow your date range queries. You'd obviously only do this if the data for a particular asset is beyond single partition size, but manageable via buckets.

If you absolutely must partition based on date, consider things like Spark and Shark to do efficient post aggregation.

Hope that helps.



Related Topics



Leave a reply



Submit