Need Advice: Is This a Good Use Case for a 'Nosql' Database? If So, Which One

Need Advice: Is this a good use case for a 'NoSQL' Database? If so, which one?

The simple answer is that there is no simple answer to these sort of problems, the only way to find out what works for your scenario is to invest R&D time into it.

The question is hard to answer because the performance requirements aren't spelled out by the OP. It appears to be 75M/year records over a number of customers with a write rate of num_customers*1minute (which is low), but I don't have figures for the required read / query performance.

Effectively you have already a sharded database using horizontal partitioning because you're storing each customer in a seperate table. This is good and will increase performance. However you haven't yet established that you have a performance problem, so this needs to be measured and the problem size assessed before you can fix it.

A NoSQL database is indeed a good way of fixing performance problems with traditional RDBMS, but it will not provide automatic scalabity and is not a general solution. You need to find your performance problem fix and then design the (nosqL) data model to provide the solution.

Depending on what you're trying to achieve I'd look at MongoDB, Apache Cassandra, Apache HBase or Hibari.

Remember that NoSQL is a vague term typically encompassing

Applications that are either performance intensive in read or write. Often sacrificing read or write performance at the expense of the other.
Distribution and scalability
Different methods of persistency (RAM/Disk)
A more structured/defined access pattern making ad-hoc queries harder.

So, in the first instance I'd see if a traditional RDBMS can achieve the required performance, using all available techniques, get a copy of High Performance MySQL and read MySQL Performance Blog.

Rev1:

In light of your comments I think it is fair to say that you could achieve what you want with one of the above NOSQL engines.

My primary recommendation would be to get your data model designed and implemented, what you're using at the moment isn't really right.

So look at Entity-attribute-value model as I think it is exactly right for what you need.

You need to get your data model right before you can consider which technology to use, being honest modifying schemas dynamically isn't a datamodel.

I'd use a traditional SQL database to validate and test the new datamodel as the management tools are better and it's generally easier to work with the schemas as you refine the datamodel.

NoSQL Use Case Scenarios or WHEN to use NoSQL

It really is an "it depends" kinda question. Some general points:

NoSQL is typically good for unstructured/"schemaless" data - usually, you don't need to explicitly define your schema up front and can just include new fields without any ceremony
NoSQL typically favours a denormalised schema due to no support for JOINs per the RDBMS world. So you would usually have a flattened, denormalized representation of your data.
Using NoSQL doesn't mean you could lose data. Different DBs have different strategies. e.g. MongoDB - you can essentially choose what level to trade off performance vs potential for data loss - best performance = greater scope for data loss.
It's often very easy to scale out NoSQL solutions. Adding more nodes to replicate data to is one way to a) offer more scalability and b) offer more protection against data loss if one node goes down. But again, depends on the NoSQL DB/configuration. NoSQL does not necessarily mean "data loss" like you infer.
IMHO, complex/dynamic queries/reporting are best served from an RDBMS. Often the query functionality for a NoSQL DB is limited.
It doesn't have to be a 1 or the other choice. My experience has been using RDBMS in conjunction with NoSQL for certain use cases.
NoSQL DBs often lack the ability to perform atomic operations across multiple "tables".

You really need to look at and understand what the various types of NoSQL stores are, and how they go about providing scalability/data security etc. It's difficult to give an across-the-board answer as they really are all different and tackle things differently.

For MongoDb as an example, check out their Use Cases to see what they suggest as being "well suited" and "less well suited" uses of MongoDb.

Use cases for NoSQL

Some great use-cases - for MongoDB anyway - are mentioned on the MongoDB site. The examples given are real-time analytics, Logging and Full Text search. These articles are all well worth a read http://www.mongodb.com/use-cases

There's also a great write-up on which NoSQL database is best suited to which type of project: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis

When should I use a NoSQL database instead of a relational database? Is it okay to use both on the same site?

Relational databases enforces ACID. So, you will have schema based transaction oriented data stores. It's proven and suitable for 99% of the real world applications. You can practically do anything with relational databases.

But, there are limitations on speed and scaling when it comes to massive high availability data stores. For example, Google and Amazon have terabytes of data stored in big data centers. Querying and inserting is not performant in these scenarios because of the blocking/schema/transaction nature of the RDBMs. That's the reason they have implemented their own databases (actually, key-value stores) for massive performance gain and scalability.

NoSQL databases have been around for a long time - just the term is new. Some examples are graph, object, column, XML and document databases.

For your 2nd question: Is it okay to use both on the same site?

Why not? Both serves different purposes right?

Trouble picking a DB for my use case. Any help/insight would be appreaciated?

If this is one of your first full-stack apps then maybe focus on learning first and world domination scale-up/out later :) (I totally agree with what @Hans said).

NoSQL and relational database approaches are different and suit different use cases, so if you want the best fit I suggest researching those differences first, and then apply what you learn to your solution and what you think is best.

I also suggest you put some kind of dependency inversion (DI) in between the database and your logic, so that you can swap out different databases without affecting the rest of your application too much. One of the biggest learnings you'll get with a project like this is not so much the initial design and build - but how to deal with the changes you'll want to make later. Making non-trivial system changes is a great way to test your architecture :) Having DI in place will help you massively here.

Personally I find SQLite a great option for hobby apps - it's free, there's good documentation and a strong user community, and you will be able to implement relational and NoSQL databases using it.

I need advise choosing a NoSQL database for a project with a lot of minute related information

Obviously a graph database fits 100% your problem. My advice here is to go for some geo spatial module over neo4j or orientdb, althought you have some others free and open source implementation.

I think the best one right now, with all the geo spatial thing implemented is neo4j-spatial package. But as far as I know, you can also reproduce most of the geo spatial thing on your own if necessary.

BTW talking about splitting, if the amount of data/queries will be high, I strongly recommend you to share the load and think the model in this terms. Sure you can do something.

Practical example for each type of database (real cases)

I found two impressive articles about this subject. All credits to highscalability.com. The information in this answer is transcribed from these articles:

35+ Use Cases For Choosing Your Next NoSQL Database

What The Heck Are You Actually Using NoSQL For?

If Your Application Needs...

• complex transactions because you can't afford to lose data or if you would like a simple transaction programming model then look at a Relational or Grid database.

• Example: an inventory system that might want full ACID. I was very unhappy when I bought a product and they said later they were out of stock. I did not want a compensated transaction. I wanted my item!

• to scale then NoSQL or SQL can work. Look for systems that support scale-out, partitioning, live addition and removal of machines, load balancing, automatic sharding and rebalancing, and fault tolerance.

• to always be able to write to a database because you need high availability then look at Bigtable Clones which feature eventual consistency.

• to handle lots of small continuous reads and writes, that may be volatile, then look at Document or Key-value or databases offering fast in-memory access. Also, consider SSD.

• to implement social network operations then you first may want a Graph database or second, a database like Riak that supports relationships. An in-memory relational database with simple SQL joins might suffice for small data sets. Redis' set and list operations could work too.

• to operate over a wide variety of access patterns and data types then look at a Document database, they generally are flexible and perform well.

• powerful offline reporting with large datasets then look at Hadoop first and second, products that support MapReduce. Supporting MapReduce isn't the same as being good at it.

• to span multiple data-centers then look at Bigtable Clones and other products that offer a distributed option that can handle the long latencies and are partition tolerant.

• to build CRUD apps then look at a Document database, they make it easy to access complex data without joins.

• built-in search then look at Riak.

• to operate on data structures like lists, sets, queues, publish-subscribe then look at Redis. Useful for distributed locking, capped logs, and a lot more.

• programmer friendliness in the form of programmer-friendly data types like JSON, HTTP, REST, Javascript then first look at Document databases and then Key-value Databases.

• transactions combined with materialized views for real-time data feeds then look at VoltDB. Great for data-rollups and time windowing.

• enterprise-level support and SLAs then look for a product that makes a point of catering to that market. Membase is an example.

• to log continuous streams of data that may have no consistency guarantees necessary at all then look at Bigtable Clones because they generally work on distributed file systems that can handle a lot of writes.

• to be as simple as possible to operate then look for a hosted or PaaS solution because they will do all the work for you.

• to be sold to enterprise customers then consider a Relational Database because they are used to relational technology.

• to dynamically build relationships between objects that have dynamic properties then consider a Graph Database because often they will not require a schema and models can be built incrementally through programming.

• to support large media then look storage services like S3. NoSQL systems tend not to handle large BLOBS, though MongoDB has a file service.

• to bulk upload lots of data quickly and efficiently then look for a product that supports that scenario. Most will not because they don't support bulk operations.

• an easier upgrade path then use a fluid schema system like a Document Database or a Key-value Database because it supports optional fields, adding fields, and field deletions without the need to build an entire schema migration framework.

• to implement integrity constraints then pick a database that supports SQL DDL, implement them in stored procedures, or implement them in application code.

• a very deep join depth then use a Graph Database because they support blisteringly fast navigation between entities.

• to move behavior close to the data so the data doesn't have to be moved over the network then look at stored procedures of one kind or another. These can be found in Relational, Grid, Document, and even Key-value databases.

• to cache or store BLOB data then look at a Key-value store. Caching can for bits of web pages, or to save complex objects that were expensive to join in a relational database, to reduce latency, and so on.

• a proven track record like not corrupting data and just generally working then pick an established product and when you hit scaling (or other issues) use one of the common workarounds (scale-up, tuning, memcached, sharding, denormalization, etc).

• fluid data types because your data isn't tabular in nature, or requires a flexible number of columns, or has a complex structure, or varies by user (or whatever), then look at Document, Key-value, and Bigtable Clone databases. Each has a lot of flexibility in their data types.

• other business units to run quick relational queries so you don't have to reimplement everything then use a database that supports SQL.

• to operate in the cloud and automatically take full advantage of cloud features then we may not be there yet.

• support for secondary indexes so you can look up data by different keys then look at relational databases and Cassandra's new secondary index support.

• create an ever-growing set of data (really BigData) that rarely gets accessed then look at Bigtable Clone which will spread the data over a distributed file system.

• to integrate with other services then check if the database provides some sort of write-behind syncing feature so you can capture database changes and feed them into other systems to ensure consistency.

• fault tolerance check how durable writes are in the face power failures, partitions, and other failure scenarios.

• to push the technological envelope in a direction nobody seems to be going then build it yourself because that's what it takes to be great sometimes.

• to work on a mobile platform then look at CouchDB/Mobile couchbase.

General Use Cases (NoSQL)

• Bigness. NoSQL is seen as a key part of a new data stack supporting: big data, big numbers of users, big numbers of computers, big supply chains, big science, and so on. When something becomes so massive that it must become massively distributed, NoSQL is there, though not all NoSQL systems are targeting big. Bigness can be across many different dimensions, not just using a lot of disk space.

• Massive write performance. This is probably the canonical usage based on Google's influence. High volume. Facebook needs to store 135 billion messages a month ^{(in 2010)}. Twitter, for example, has the problem of storing 7 TB/data per day ^{(in 2010)} with the prospect of this requirement doubling multiple times per year. This is the data is too big to fit on one node problem. At 80 MB/s it takes a day to store 7TB so writes need to be distributed over a cluster, which implies key-value access, MapReduce, replication, fault tolerance, consistency issues, and all the rest. For faster writes in-memory systems can be used.

• Fast key-value access. This is probably the second most cited virtue of NoSQL in the general mind set. When latency is important it's hard to beat hashing on a key and reading the value directly from memory or in as little as one disk seek. Not every NoSQL product is about fast access, some are more about reliability, for example. but what people have wanted for a long time was a better memcached and many NoSQL systems offer that.

• Flexible schema and flexible datatypes. NoSQL products support a whole range of new data types, and this is a major area of innovation in NoSQL. We have: column-oriented, graph, advanced data structures, document-oriented, and key-value. Complex objects can be easily stored without a lot of mapping. Developers love avoiding complex schemas and ORM frameworks. Lack of structure allows for much more flexibility. We also have program- and programmer-friendly compatible datatypes like JSON.

• Schema migration. Schemalessness makes it easier to deal with schema migrations without so much worrying. Schemas are in a sense dynamic because they are imposed by the application at run-time, so different parts of an application can have a different view of the schema.

• Write availability. Do your writes need to succeed no matter what? Then we can get into partitioning, CAP, eventual consistency and all that jazz.

• Easier maintainability, administration and operations. This is very product specific, but many NoSQL vendors are trying to gain adoption by making it easy for developers to adopt them. They are spending a lot of effort on ease of use, minimal administration, and automated operations. This can lead to lower operations costs as special code doesn't have to be written to scale a system that was never intended to be used that way.

• No single point of failure. Not every product is delivering on this, but we are seeing a definite convergence on relatively easy to configure and manage high availability with automatic load balancing and cluster sizing. A perfect cloud partner.

• Generally available parallel computing. We are seeing MapReduce baked into products, which makes parallel computing something that will be a normal part of development in the future.

• Programmer ease of use. Accessing your data should be easy. While the relational model is intuitive for end users, like accountants, it's not very intuitive for developers. Programmers grok keys, values, JSON, Javascript stored procedures, HTTP, and so on. NoSQL is for programmers. This is a developer-led coup. The response to a database problem can't always be to hire a really knowledgeable DBA, get your schema right, denormalize a little, etc., programmers would prefer a system that they can make work for themselves. It shouldn't be so hard to make a product perform. Money is part of the issue. If it costs a lot to scale a product then won't you go with the cheaper product, that you control, that's easier to use, and that's easier to scale?

• Use the right data model for the right problem. Different data models are used to solve different problems. Much effort has been put into, for example, wedging graph operations into a relational model, but it doesn't work. Isn't it better to solve a graph problem in a graph database? We are now seeing a general strategy of trying to find the best fit between a problem and solution.

• Avoid hitting the wall. Many projects hit some type of wall in their project. They've exhausted all options to make their system scale or perform properly and are wondering what next? It's comforting to select a product and an approach that can jump over the wall by linearly scaling using incrementally added resources. At one time this wasn't possible. It took custom built everything, but that's changed. We are now seeing usable out-of-the-box products that a project can readily adopt.

• Distributed systems support. Not everyone is worried about scale or performance over and above that which can be achieved by non-NoSQL systems. What they need is a distributed system that can span datacenters while handling failure scenarios without a hiccup. NoSQL systems, because they have focussed on scale, tend to exploit partitions, tend not use heavy strict consistency protocols, and so are well positioned to operate in distributed scenarios.

• Tunable CAP tradeoffs. NoSQL systems are generally the only products with a "slider" for choosing where they want to land on the CAP spectrum. Relational databases pick strong consistency which means they can't tolerate a partition failure. In the end, this is a business decision and should be decided on a case by case basis. Does your app even care about consistency? Are a few drops OK? Does your app need strong or weak consistency? Is availability more important or is consistency? Will being down be more costly than being wrong? It's nice to have products that give you a choice.

• More Specific Use Cases

• Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, clickstreams, etc.

• Syncing online and offline data. This is a niche CouchDB has targeted.

• Fast response times under all loads.

• Avoiding heavy joins for when the query load for complex joins become too large for an RDBMS.

• Soft real-time systems where low latency is critical. Games are one example.

• Applications where a wide variety of different write, read, query, and consistency patterns need to be supported. There are systems optimized for 50% reads 50% writes, 95% writes, or 95% reads. Read-only applications needing extreme speed and resiliency, simple queries, and can tolerate slightly stale data. Applications requiring moderate performance, read/write access, simple queries, completely authoritative data. A read-only application which complex query requirements.

• Load balance to accommodate data and usage concentrations and to help keep microprocessors busy.

• Real-time inserts, updates, and queries.

• Hierarchical data like threaded discussions and parts explosion.

• Dynamic table creation.

• Two-tier applications where low latency data is made available through a fast NoSQL interface, but the data itself can be calculated and updated by high latency Hadoop apps or other low priority apps.

• Sequential data reading. The right underlying data storage model needs to be selected. A B-tree may not be the best model for sequential reads.

• Slicing off part of service that may need better performance/scalability onto its own system. For example, user logins may need to be high performance and this feature could use a dedicated service to meet those goals.

• Caching. A high performance caching tier for websites and other applications. Example is a cache for the Data Aggregation System used by the Large Hadron Collider.
Voting.

• Real-time page view counters.

• User registration, profile, and session data.

• Document, catalog management and content management systems. These are facilitated by the ability to store complex documents has a whole rather than organized as relational tables. Similar logic applies to inventory, shopping carts, and other structured data types.

• Archiving. Storing a large continual stream of data that is still accessible on-line. Document-oriented databases with a flexible schema that can handle schema changes over time.

• Analytics. Use MapReduce, Hive, or Pig to perform analytical queries and scale-out systems that support high write loads.

• Working with heterogeneous types of data, for example, different media types at a generic level.

• Embedded systems. They don’t want the overhead of SQL and servers, so they use something simpler for storage.

• A "market" game, where you own buildings in a town. You want the building list of someone to pop up quickly, so you partition on the owner column of the building table, so that the select is single-partitioned. But when someone buys the building of someone else you update the owner column along with price.

• JPL is using SimpleDB to store rover plan attributes. References are kept to a full plan blob in S3. ^(source)

• Federal law enforcement agencies tracking Americans in real-time using credit cards, loyalty cards and travel reservations.

• Fraud detection by comparing transactions to known patterns in real-time.

• Helping diagnose the typology of tumors by integrating the history of every patient.

• In-memory database for high update situations, like a website that displays everyone's "last active" time (for chat maybe). If users are performing some activity once every 30 sec, then you will be pretty much be at your limit with about 5000 simultaneous users.

• Handling lower-frequency multi-partition queries using materialized views while continuing to process high-frequency streaming data.

• Priority queues.

• Running calculations on cached data, using a program friendly interface, without having to go through an ORM.

• Uniq a large dataset using simple key-value columns.

• To keep querying fast, values can be rolled-up into different time slices.

• Computing the intersection of two massive sets, where a join would be too slow.

• A timeline ala Twitter.

Redis use cases, VoltDB use cases and more find here.

Why choose an relational DB over a NoSQL one?

Here are the reasons, you can consider Relational databases over NoSQL

Transactions (for eg: database for bank accounts, most of the operations result in updating multiple tables as unit)
Data integrity (check constraints, referential integrity constraints etc)
Zero data loss recovery
Security and compliance

You should focus on both. If your project requires transactions, security, zero data loss etc then better to go with traditional RDBMS. For projects like recommendation engines, endorsement engines where scalability is the major challenge you can go with NoSQL databases.

In short, RDBMS databases are useful for running applications that are critical to the business, NoSQL databases are useful for running applications that complement your business.

Need Advice: Is This a Good Use Case for a 'Nosql' Database? If So, Which One