MySQL and Nosql: Help Me to Choose the Right One

MySQL and NoSQL: Help me to choose the right one

You should read the following and learn a little bit about the advantages of a well designed innodb table and how best to use clustered indexes - only available with innodb !

http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/

then design your system something along the lines of the following simplified example:

Example schema (simplified)

The important features are that the tables use the innodb engine and the primary key for the threads table is no longer a single auto_incrementing key but a composite clustered key based on a combination of forum_id and thread_id. e.g.

threads - primary key (forum_id, thread_id)

forum_id    thread_id
========    =========
1                   1
1                   2
1                   3
1                 ...
1             2058300  
2                   1
2                   2
2                   3
2                  ...
2              2352141
...

Each forum row includes a counter called next_thread_id (unsigned int) which is maintained by a trigger and increments every time a thread is added to a given forum. This also means we can store 4 billion threads per forum rather than 4 billion threads in total if using a single auto_increment primary key for thread_id.

forum_id    title   next_thread_id
========    =====   ==============
1          forum 1        2058300
2          forum 2        2352141
3          forum 3        2482805
4          forum 4        3740957
...
64        forum 64       3243097
65        forum 65      15000000 -- ooh a big one
66        forum 66       5038900
67        forum 67       4449764
...
247      forum 247            0 -- still loading data for half the forums !
248      forum 248            0
249      forum 249            0
250      forum 250            0

The disadvantage of using a composite key is that you can no longer just select a thread by a single key value as follows:

select * from threads where thread_id = y;

you have to do:

select * from threads where forum_id = x and thread_id = y;

However, your application code should be aware of which forum a user is browsing so it's not exactly difficult to implement - store the currently viewed forum_id in a session variable or hidden form field etc...

Here's the simplified schema:

drop table if exists forums;
create table forums
(
forum_id smallint unsigned not null auto_increment primary key,
title varchar(255) unique not null,
next_thread_id int unsigned not null default 0 -- count of threads in each forum
)engine=innodb;


drop table if exists threads;
create table threads
(
forum_id smallint unsigned not null,
thread_id int unsigned not null default 0,
reply_count int unsigned not null default 0,
hash char(32) not null,
created_date datetime not null,
primary key (forum_id, thread_id, reply_count) -- composite clustered index
)engine=innodb;

delimiter #

create trigger threads_before_ins_trig before insert on threads
for each row
begin
declare v_id int unsigned default 0;

  select next_thread_id + 1 into v_id from forums where forum_id = new.forum_id;
  set new.thread_id = v_id;
  update forums set next_thread_id = v_id where forum_id = new.forum_id;
end#

delimiter ;

You may have noticed I've included reply_count as part of the primary key which is a bit strange as (forum_id, thread_id) composite is unique in itself. This is just an index optimisation which saves some I/O when queries that use reply_count are executed. Please refer to the 2 links above for further info on this.

Example queries

I'm still loading data into my example tables and so far I have a loaded approx. 500 million rows (half as many as your system). When the load process is complete I should expect to have approx:

250 forums * 5 million threads = 1250 000 000 (1.2 billion rows)

I've deliberately made some of the forums contain more than 5 million threads for example, forum 65 has 15 million threads:

forum_id    title   next_thread_id
========    =====   ==============
65        forum 65      15000000 -- ooh a big one

Query runtimes

select sum(next_thread_id) from forums;

sum(next_thread_id)
===================
539,155,433 (500 million threads so far and still growing...)

under innodb summing the next_thread_ids to give a total thread count is much faster than the usual:

select count(*) from threads;

How many threads does forum 65 have:

select next_thread_id from forums where forum_id = 65

next_thread_id
==============
15,000,000 (15 million)

again this is faster than the usual:

select count(*) from threads where forum_id = 65

Ok now we know we have about 500 million threads so far and forum 65 has 15 million threads - let's see how the schema performs :)

select forum_id, thread_id from threads where forum_id = 65 and reply_count > 64 order by thread_id desc limit 32;

runtime = 0.022 secs

select forum_id, thread_id from threads where forum_id = 65 and reply_count > 1 order by thread_id desc limit 10000, 100;

runtime = 0.027 secs

Looks pretty performant to me - so that's a single table with 500+ million rows (and growing) with a query that covers 15 million rows in 0.02 seconds (while under load !)

Further optimisations

These would include:

partitioning by range
sharding
throwing money and hardware at it

etc...

hope you find this answer helpful :)

mysql query performance index

What is the best Index to improve the performance for my query?

a clustered primary key

http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/

MySQL and NoSQL: Help me to choose the right one

60 million entries, select entries from a certain month. How to optimize database?

How to avoid "Using temporary" in many-to-many queries?

however your current clustered PK sid wont be much help so try something along the lines of:

create table forwarding
(
f_date date not null,
f_shop int unsigned not null,
sid int unsigned not null, -- added for uniqueness
...
primary key (f_date, f_shop, sid) -- clustered primary key
)
engine=innodb;

hope this helps :)

What is the fundmental difference between MongoDB / NoSQL which allows faster aggregation (MapReduce) compared to MySQL

There are essentially 4 types of "NoSQL", but three of the four are actually similar enough that an SQL syntax could be written on top of it (including MongoDB and it's crazy query syntax [and I say that even though Javascript is one of my favorite languages]).

Key-Value Storage

These are simple NoSQL systems like Redis, that are basically a really fancy hash table. You have a value you want to get later, so you assign it a key and stuff it into the database, you can only query a single object at a time and only by a single key.

You definitely don't want this.

Document Storage

This is one step up above Key-Value Storage and is what most people talk about when they say NoSQL (such as MongoDB).

Basically, these are objects with a hierarchical structure (like XML files, JSON files, and any other sort of tree structure in computer science), but the values of different nodes on the tree can be indexed. They have a higher "speed" relative to traditional row-based SQL databases on lookup because they sacrifice performance on joining.

If you're looking up data in your MySQL database from a single table with tons of columns (assuming it's not a view/virtual table), and assuming you have it indexed properly for your query (that may be you real problem, here), Document Databases like MongoDB won't give you any Big-O benefit over MySQL, so you probably don't want to migrate over for just this reason.

Columnar Storage

These are the most like SQL databases. In fact, some (like Sybase) implement an SQL syntax while others (Cassandra) do not. They store the data in columns rather than rows, so adding and updating are expensive, but most queries are cheap because each column is essentially implicitly indexed.

But, if your query can't use an index, you're in no better shape with a Columnar Store than a regular SQL database.

Graph Storage

Graph Databases expand beyond SQL. Anything that can be represented by Graph theory, including Key-Value, Document Database, and SQL database can be represented by a Graph Database, like neo4j.

Graph Databases make joins as cheap as possible (as opposed to Document Databases) to do this, but they have to, because even a simple "row" query would require many joins to retrieve.

A table-scan type query would probably be slower than a standard SQL database because of all of the extra joins to retrieve the data (which is stored in a disjointed fashion).

So what's the solution?

You've probably noticed that I haven't answered your question, exactly. I'm not saying "you're finished," but the real problem is how the query is being performed.

Are you absolutely sure you can't better index your data? There are things such as Multiple Column Keys that could improve the performance of your particular query. Microsoft's SQL Server has a full text key type that would be applicable to the example you provided, and PostgreSQL can emulate it.
The real advantage most NoSQL databases have over SQL databases is Map-Reduce -- specifically, the integration of a full Turing-complete language that runs at high speed that query constraints can be written in. The querying function can be written to quickly "fail out" of non-matching queries or quickly return with a success on records that meet "priority" requirements, while doing the same in SQL is a bit more cumbersome.

Finally, however, the exact problem you're trying to solve: text search with optional filtering parameters, is more generally known as a search engine, and there are very specialized engines to handle this particular problem. I'd recommend Apache Solr to perform these queries.

Basically, dump the text field, the "filter" fields, and the primary key of the table into Solr, let it index the text field, run the queries through it, and if you need the full record after that, query your SQL database for the specific index you got from Solr. It uses some more memory and requires a second process, but will probably best suite your needs, here.

Why all of this text to get to this answer?

Because the title of your question doesn't really have anything to do with the content of your question, so I answered both. :)

What should be indexed to improve performance?

In general, the selection filters can use indexes on user_id or activity_type_id or both (in either order).

The ordering operation might be able to use a filter on created_at.

It is likely that for this query, a composite index on (user_id, activity_type_id) would give the best result, assuming that MySQL can actually make use of it. Failing that, it is likely to be better to index user_id than activity_type_id because it is likely to provide better selectivity. One reason for thinking that is that there would be 4 subsections of the index to scan if it uses an index on activity_type_id, compared with just one subsection to scan if it uses an index on user_id alone.

Trying to rely on an index for the sort order is likely to mean a full table scan, so it is less likely to be beneficial. I would not create an index on created_at to support this query; there might be other queries where it would be beneficial.

Mysql improve SELECT speed

take time to read my answer here: (has similar volumes to yours)

500 millions rows, 15 million row range scan in 0.02 seconds.

MySQL and NoSQL: Help me to choose the right one

then amend your table engine to innodb as follows:

create table tag_date_value
(
tag_id smallint unsigned not null, -- i prefer ints to chars
tag_date datetime not null, -- can we make this date vs datetime ?
value int unsigned not null default 0, -- or whatever datatype you require
primary key (tag_id, tag_date) -- clustered composite PK
)
engine=innodb;

you might consider the following as the primary key instead:

primary key (tag_id, tag_date, value) -- added value save some I/O

but only if value isnt some LARGE varchar type !

query as before:

select
 tag_date, 
 value
from
 tag_date_value
where
 tag_id = 1 and
 tag_date between 'x' and 'y'
order by
 tag_date;

hope this helps :)

EDIT

oh forgot to mention - dont use alter table to change engine type from mysiam to innodb but rather dump the data out into csv files and re-import into a newly created and empty innodb table.

note i'm ordering the data during the export process - clustered indexes are the KEY !

Export

select * into outfile 'tag_dat_value_001.dat' 
fields terminated by '|' optionally enclosed by '"'
lines terminated by '\r\n'
from
 tag_date_value
where
 tag_id between 1 and 50
order by
 tag_id, tag_date;

select * into outfile 'tag_dat_value_002.dat' 
fields terminated by '|' optionally enclosed by '"'
lines terminated by '\r\n'
from
 tag_date_value
where
 tag_id between 51 and 100
order by
 tag_id, tag_date;

-- etc...

Import

import back into the table in correct order !

start transaction;

load data infile 'tag_dat_value_001.dat' 
into table tag_date_value
fields terminated by '|' optionally enclosed by '"'
lines terminated by '\r\n'
(
tag_id,
tag_date,
value
);

commit;

-- etc...

MySQL and Nosql: Help Me to Choose the Right One