Resources for Database Sharding and Partitioning

Resources for Database Sharding and Partitioning

I agree with the other answers that you should look at your schema and indexes before resorting to sharding. 10 million rows is well within the capabilities of any of the major database engines.

However if you want some resources for learning about the subject of sharding then try these:

  • Scalability Best Practices: Lessons from eBay
  • Randy Shoup on eBay's Architectural Principles - Video and Presentation
  • High Scalability Site
  • Mr. Moore gets to punt on sharding (when not to do it)

Database sharding vs partitioning

Partitioning is more a generic term for dividing data across tables or databases. Sharding is one specific type of partitioning, part of what is called horizontal partitioning.

Here you replicate the schema across (typically) multiple instances or servers, using some kind of logic or identifier to know which instance or server to look for the data. An identifier of this kind is often called a "Shard Key".

A common, key-less logic is to use the alphabet to divide the data. A-D is instance 1, E-G is instance 2 etc. Customer data is well suited for this, but will be somewhat misrepresented in size across instances if the partitioning does not take in to account that some letters are more common than others.

Another common technique is to use a key-synchronization system or logic that ensures unique keys across the instances.

A well known example you can study is how Instagram solved their partitioning in the early days (see link below). They started out partitioned on very few servers, using Postgres to divide the data from the get-go. I believe it was several thousand logical shards on those few physical shards. Read their awesome writeup from 2012 here: Instagram Engineering - Sharding & IDs

See here as well: http://www.quora.com/Whats-the-difference-between-sharding-and-partition

What is sharding and why is it important?

Sharding is just another name for "horizontal partitioning" of a database. You might want to search for that term to get it clearer.

From Wikipedia:

Horizontal partitioning is a design principle whereby rows of a database table are held separately, rather than splitting by columns (as for normalization). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location. The advantage is the number of rows in each table is reduced (this reduces index size, thus improves search performance). If the sharding is based on some real-world aspect of the data (e.g. European customers vs. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.

Some more information about sharding:

Firstly, each database server is identical, having the same table structure. Secondly, the data records are logically split up in a sharded database. Unlike the partitioned database, each complete data record exists in only one shard (unless there's mirroring for backup/redundancy) with all CRUD operations performed just in that database. You may not like the terminology used, but this does represent a different way of organizing a logical database into smaller parts.

Update: You wont break MVC. The work of determining the correct shard where to store the data would be transparently done by your data access layer. There you would have to determine the correct shard based on the criteria which you used to shard your database. (As you have to manually shard the database into some different shards based on some concrete aspects of your application.) Then you have to take care when loading and storing the data from/into the database to use the correct shard.

Maybe this example with Java code makes it somewhat clearer (it's about the Hibernate Shards project), how this would work in a real world scenario.

To address the "why sharding": It's mainly only for very large scale applications, with lots of data. First, it helps minimizing response times for database queries. Second, you can use more cheaper, "lower-end" machines to host your data on, instead of one big server, which might not suffice anymore.

Sharding based on timestamp

Notice that your link called it an "anti-pattern". I have similar thoughts...

That seems like an odd way to shard. It implies that writes will be pounding on one server for a while (a day, or whatever the shard range is). This makes the "recent" data hard to SELECT from because of all the writes going on. Meanwhile, the "old" data is sitting idle??

Usually, Sharding is based on "user_id" or "company_id". This spreads the load--both read and write--across the shards. After all, sharding is to spread the load.

But, sharding should not be done until you have so much activity that the traffic cannot be handled in a single machine. Sharding is complex because of having to redirect traffic to the appropriate machine, and because of the really messy code needed if a single action needs to look in multiple shards.

If you do have a lot of traffic, I will be happy to advise further. But I would start by seeing if the traffic can be made efficient enough to live on a single server.

Another thing to note: There is essentially no parallelism in MySQL.



Related Topics



Leave a reply



Submit