Nosql VS. SQL When Scalability Is Irrelevant

NoSQL vs. SQL when scalability is irrelevant

Yes, sometimes RDBMS are not the best solution, although there are ways to accomodate user defined fields (see XML Datatype, EAV design pattern, or just have spare generic columns) sometimes a schema free database is a good choice.

However, you need to nail down your requirements before choosing to go with a document database, as you will loose a lot of the power you may be used to with the relational model

eg...

If you would otherwise have multiple tables in your RDBMS database, you will need to research the features MongoDB affords you to accomodate these needs.

If you will need to query the data in specific ways, again you need to research what MongoDB offers you.

I wouldnt think of NoSQL as replacement for RDBMS, rather a slightly different tool that brings its own sets of advantages and disadvantages making it more suitable for some projects than others.

(Both databases may be used in some circumstances. Also if you decide to go down the route of possibly using MongoDB, once you have researched the websites out there and have more specific questions, you can visit Freenode IRC #mongodb channel)

Would relational databases scale as well (or better) than their NoSQL counterparts if we drop the relationships?

I don't find that inter-table relationships are the main limiter for scalability. I use queries with joins regularly and get good scalability if indexes are defined well.

The greater limiter for scalability is the cost of synchronous I/O. The requirements of consistency and durability -- that the DBMS actually and reliably saves data when it tells you it saved data -- is expensive.

Several NoSQL products that are currently in vogue achieve great performance by weakening their consistency and durability guarantees in their default configuration. There are many reports of CouchDB or MongoDB losing data.

There are ways you can configure those NoSQL products to be more strict about durability, but then you sacrifice their impressive performance numbers.

Likewise, you can make an SQL database achieve high performance like the NoSQL products, by disabling the default features that ensure data safety. See RunningWithScissorsDB.

PS: If you think document-oriented databases are "cutting edge", I invite you to read about MUMPS. Everything old is new again. :-)

What scalability problems have you encountered using a NoSQL data store?

I've switched a small subproject from MySQL to CouchDB, to be able to handle the load. The result was amazing.

About 2 years ago, we've released a self written software on http://www.ubuntuusers.de/ (which is probably the biggest German Linux community website). The site is written in Python and we've added a WSGI middleware which was able to catch all exceptions and send them to another small MySQL powered website. This small website used a hash to determine different bugs and stored the number of occurrences and the last occurrence as well.

Unfortunately, shortly after the release, the traceback-logger website wasn't responding anymore. We had some locking issues with the production db of our main site which was throwing exceptions nearly every request, as well as several other bugs, which we haven't explored during the testing stage. The server cluster of our main site, called the traceback-logger submit page several k times per second. And that was a way too much for the small server which hosted the traceback logger (it was already an old server, which was only used for development purposes).

At this time CouchDB was rather popular, and so I decided to try it out and write a small traceback-logger with it. The new logger only consisted of a single python file, which provided a bug list with sorting and filter options and a submit page. And in the background I've started a CouchDB process. The new software responded extremely quickly to all requests and we were able to view the massive amount of automatic bug reports.

One interesting thing is, that the solution before, was running on an old dedicated server, where the new CouchDB based site on the other hand was only running on a shared xen instance with very limited resources. And I haven't even used the strength of key-values stores to scale horizontally. The ability of CouchDB / Erlang OTP to handle concurrent requests without locking anything was already enough to serve the needs.

Now, the quickly written CouchDB-traceback logger is still running and is a helpful way to explore bugs on the main website. Anyway, about once a month the database becomes too big and the CouchDB process gets killed. But then, the compact-db command of CouchDB reduces the size from several GBs to some KBs again and the database is up and running again (maybe i should consider adding a cronjob there... 0o).

In a summary, CouchDB was surely the best choice (or at least a better choice than MySQL) for this subproject and it does its job well.

NoSQL (Cassandra/Mongo) vs RDBMS

read that a lot of big companies use shard rmdb

In my last job, my org supported manually-sharded Postgres solution. It was a source of tremendous pain for us, as it was difficult to manage and maintain due to its size. Remember that RDBMSs weren't really designed to work that way.

only advantage of nosql ... is because it's an off-the shelf distributed solution

The decision is really all about tradeoffs. When your data workloads cannot be handled by a single DB instance or you require uptime without a single point of failure, NoSQL can help you. Databases that sacrifice consistency for partition tolerance and availability ("AP" databases) are often found to be able to process large workloads with little latency because of their ability to spread the data (and thus, the queries) to multiple server instances.

Also, if your data needs to be geographically or data center aware, you'll want a database which supports that. Trying to make database products work in ways that their original design did not account for, is a recipe for pain.

cheap to maintain

Ask anyone who runs an enterprise Database organization, and they will tell you that NoSQL is not cheap (or easy) to maintain. Sure, you might be getting an open source product that you don't have to "buy," but you're going to need (often highly-paid) database engineers to maintain it.

scalability comes along with nosql by design, but it also seems that sql can also achieve the same and the primary issue is only the maintenance/configuration when scaling it up.

It depends. How big are you planning to scale to, and how many data replicas would need to be supported? Large enterprises like Apple have thousands of servers running Apache Cassandra. They do that because iCloud needs to scale to support the needs of 900 million iPhone users. They can easily add (scale up) or remove (scale down) based on their needs for compute resources.

Achieving that level of scalability with a relational database requires a LOT more work than it does with Cassandra (NoSQL). And when you find out that you need to scale up even more, you're basically looking at a data reload scenario (because shard sizes essentially change) to get the data to the new instances. A database team will reach the point (very quickly) where the amount of work it takes to scale out a RDBMS is impractical.

The other aspect, is that how does the client application know which of the relational database servers to send the query? For most relational databases, you'll end up having to build out or augment that logic layer somehow. And when the number of database instances changes, you'll need the application to know about that, too. NoSQL databases account for node discovery, and (most) abstract that so the client application doesn't need to worry about it.

Also remember that not all NoSQL databases are created equally. On some products, only certain nodes will accept writes. On some products, any node can handle a read or a write. Relational databases don't have any concept of that, so you would have to account for that as well.

tl;dr;

It's much more complicated than only maintenance due to scaling. If it wasn't, every major relational database would have a simple way to handle that, and NoSQL DBs would be irrelevant. But that hasn't happened.

NoSQL or Relational or Both

edit
I see that others propose starting with SQL. I'd like to change my proposition and say - experiment with small scale project like "small twitter clone" or "store with video tapes". Keep database on many nodes and write scripts wich will flood you with data. Do it with Riak/Cassandra and then with some SQL solution. You'll find yourself what is easier and quicker.
/edit

I would go with NoSQL (this is what I'm doing now. Previously I used MySQL in large scale projects). Why? It is much simpler to use so you can pay more attention to other important things (NoSQL takes care of most data storage problems):

You don't have to define schema which also means you don't have to upgrade it. In MySQL I had long downtimes due to system upgrade. Adding single column/index took a lot of time. Tables had only few millions of rows.
You get running, distributed environment in few minutes. In MySQL you have to manually split data between few machines (unless you keep everything on one which is not a good idea).
You get much better performance. With MySQL performance is really bad. It just does not work without memcached. Memcached is a distributed key-value store (simple NoSQL database). Obviously using memcached costs you additional time spent on optimizing queries
You don't have to think about normalization / denormalization
Queries are simple (at least in key-value stores). You just don't care about something like: should I use "where UserId = 12345" or "where UserId = '12345'" (in MySQL one of them will not use indexes!).
If one machine with NoSQL fails you don't care about that in your application. The query will be executed on another replica (you don't have to implement this!)

There are also downsides to using NoSQL

You don't get ACID. In most cases you just don't need that!
Also there is more developers familiar with SQL solutions. On the other hand NoSQL solutions are much simpler (at least in my experience) so you don't need certified database administrator (a magician who solves your db problems and only he knows why it works)
You can't do certain queries - for example joins are not there, but if you don't normalize the data then joins are useless (and you save time as you don't have to think about normalization).

Great article:
http://labs.mudynamics.com/2010/04/01/why-nosql-is-bad-for-startups/

My advice would be to start with NoSQL and stick with it. You should look at dynamo based databases like Riak and Cassandra. Also try CouchDB (CoachBase). This is for most of the data. For friends relationships graph database is good option.

With the recent prevelance of NoSQL databases why would I use a SQL database?

My key question was where would a SQL database really outshine a document database and from all the responses there really doesn't seem to be much.

Given that NoSQL databases come in just as many variations of types of databases as relational that both match all or some parts of ACID depending on which database you use that at this point they are basically the equitable for solving problems.

After this the key differences would be tooling and maturity which SQL databases have a much larger grasp in for being the established player but this is how it is for all new technology.

NoSQL or SQL database for thousands of rows / items in array every day?

Use SQL database like postgres. Postgres is faster for processing because of sql support and indices

730k is nothing btw

Nosql VS. SQL When Scalability Is Irrelevant