How to Read Data from Cassandra with R

Paging in the right way to import data in R from Cassandra

As I thought, you are using a clustering column. In your primary key, user_id is your partition key and order_id is your clustering column. And you created an index on your table to be able to query over your clustering column.

In most cases, indexes are not a good idea !

Basically, NoSql database is not well designed to manage indexes over large table. So try to avoid that. A good alternative is to create sort of an index table by hand. Each time you insert in adv, dont forget to insert in adv_idx as well (in two different queries). It would look like this :

CREATE TABLE adv (
user_id text,
order_id text,
advertiser_id text,
PRIMARY KEY (user_id, order_id)
) WITH CLUSTERING ORDER BY (order_id ASC);

CREATE TABLE adv_idx (
order_id text,
user_id text,
PRIMARY KEY (order_id, user_id)
)

To retrieve information relative to order_id you query adv_idx and then foreach returned user_id you query adv. Cassandra dont have performance issue any more. Nevertheless the number of queries from the client side is bigger and it is longer to process.

Another solution is to add some redundancy :

CREATE TABLE adv (
user_id text,
order_id text,
advertiser_id text,
PRIMARY KEY (user_id, order_id)
) WITH CLUSTERING ORDER BY (order_id ASC);

CREATE TABLE adv_by_order (
order_id text,
user_id text,
advertiser_id text,
PRIMARY KEY (order_id, user_id)
)

Ok, your database is twice bigger now. But your performance is much better !

I tend to say that redundancy is fine !

Spark-Cassandra , How to get data based on Query

Yes, Spark Cassandra Connector will perform so-called "predicate pushdown" if you're doing the query on the partition key, and will load data only from specific query (the .load function will just load the metadata, actual data load will happen first time when you really need data to perform an action). There are well documented rules on when predicate pushdown happens in Spark Cassandra connector. You can also check this by running table_df.explain(), and look for PushedFilters part for filters marked with asteric *.

If you need to lookup multiple IDs, then you can either use .isin filter, but it's really not recommended with Cassandra. It's better to create a dataframe with IDs, and perform so-called Direct Join with Cassandra dataframe (it's available since SCC 2.5 for dataframes, or earlier for RDDs). I have a lengthy blog post on the joining with data in Cassandra

Importing cassandra table into spark via sparklyr - possible to select only some columns?

You can skip eager cache and select columns of interest:

session <- spark_session(sc)

# Some columns to select
cols <- list("x", "y", "z")

cass_df <- session %>%
invoke("read") %>%
invoke("format", "org.apache.spark.sql.cassandra") %>%
invoke("options", as.environment(list(keyspace="test"))) %>%
invoke("load") %>%
# We use select(col: String, cols* String) so the first column
# has to be used separately. If you want only one column the third argument
# has to be an empty list
invoke("select", cols[[1]], cols[2:length(cols)]) %>%
# Standard lazy cache if you need one
invoke("cache")

If you use predicates which can significantly reduce amount of fetched data set pushdown option to "true" (default) and use filter before caching.

If you want to pass more complex query you register temporary view and sql method:

session %>%
invoke("read") %>%
...
invoke("load") %>%
invoke("createOrReplaceTempView", "some_name")

cass_df <- session %>%
invoke("sql", "SELECT id FROM some_name WHERE foo = 'bar'") %>%
invoke("cache")


Related Topics



Leave a reply



Submit