The Next-Gen Databases

The Next-gen Databases

I would say next-gen database, not next-gen SQL.

SQL is a language for querying and manipulating relational databases. SQL is dictated by an international standard. While the standard is revised, it seems to always work within the relational database paradigm.

Here are a few new data storage technologies that are getting attention currently:

  • CouchDB is a non-relational database. They call it a document-oriented database.
  • Amazon SimpleDB is also a non-relational database accessed in a distributed manner through a web service. Amazon also has a distributed key-value store called Dynamo, which powers some of its S3 services.
  • Dynomite and Kai are open source solutions inspired by Amazon Dynamo.
  • BigTable is a proprietary data storage solution used by Google, and implemented using their Google File System technology. Google's MapReduce framework uses BigTable.
  • Hadoop is an open-source technology inspired by Google's MapReduce, and serving a similar need, to distribute the work of very large scale data stores.
  • Scalaris is a distributed transactional key/value store. Also not relational, and does not use SQL. It's a research project from the Zuse Institute in Berlin, Germany.
  • RDF is a standard for storing semantic data, in which data and metadata are interchangeable. It has its own query language SPARQL, which resembles SQL superficially, but is actually totally different.
  • Vertica is a highly scalable column-oriented analytic database designed for distributed (grid) architecture. It does claim to be relational and SQL-compliant. It can be used through Amazon's Elastic Compute Cloud.
  • Greenplum is a high-scale data warehousing DBMS, which implements both MapReduce and SQL.
  • XML isn't a DBMS at all, it's an interchange format. But some DBMS products work with data in XML format.
  • ODBMS, or Object Databases, are for managing complex data. There don't seem to be any dominant ODBMS products in the mainstream, perhaps because of lack of standardization. Standard SQL is gradually gaining some OO features (e.g. extensible data types and tables).
  • Drizzle is a relational database, drawing a lot of its code from MySQL. It includes various architectural changes designed to manage data in a scalable "cloud computing" system architecture. Presumably it will continue to use standard SQL with some MySQL enhancements.
  • Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store, developed at Facebook by one of the authors of Amazon Dynamo, and contributed to the Apache project.
  • Project Voldemort is a non-relational, distributed, key-value storage system. It is used at LinkedIn.com
  • Berkeley DB deserves some mention too. It's not "next-gen" because it dates back to the early 1990's. It's a popular key-value store that is easy to embed in a variety of applications. The technology is currently owned by Oracle Corp.

Also see this nice article by Richard Jones: "Anti-RDBMS: A list of distributed key-value stores." He goes into more detail describing some of these technologies.

Relational databases have weaknesses, to be sure. People have been arguing that they don't handle all data modeling requirements since the day it was first introduced.

Year after year, researchers come up with new ways of managing data to satisfy special requirements: either requirements to handle data relationships that don't fit into the relational model, or else requirements of high-scale volume or speed that demand data processing be done on distributed collections of servers, instead of central database servers.

Even though these advanced technologies do great things to solve the specialized problem they were designed for, relational databases are still a good general-purpose solution for most business needs. SQL isn't going away.


I've written an article in php|Architect magazine about the innovation of non-relational databases, and data modeling in relational vs. non-relational databases. http://www.phparch.com/magazine/2010-2/september/

What is the production ready NonSQL database?

I think you need to start out from your project requirements to see what kind of database you really need. There are many non-relational DBMS:s out there and they differ a lot in what kind of problems they are good at solving. I think the article Should you go Beyond Relational Databases? by Martin Kleppmann is a good starting point for finding out what you need. There's also a lot of stackoverflow threads on similar topics, these are my favorites:

  • The Next-gen Databases
  • Non-Relational Database Design
  • When shouldn’t you use a relational
    database?
  • Good reasons NOT to use a relational
    database?

When you have narrowed down what you actually need you can take a deeper look into the alternatives to see which DBMS are production ready for your use case. Production readiness isn't a yes/no thing: people may successfully deploy some solution that for example lacks in tool support - in another project this could be a no-go.

As for version numbers different projects have a different take on this, so you can't just compare the version numbers. I'm involved in the graph database project Neo4j and even if it has been in production use for 5+ years by now we still haven't released a version 1.0 final yet.

What does 'relational' in 'relational database' mean for us?

If you want to learn about what relational means, I recommend the book "SQL and Relational Theory" by C. J. Date.

Relational in this context doesn't refer to relationships. It refers to relations which are basically what tables are called in the mathematical theories that led to the relational model.

The reason that relational databases have become ubiquitous is that they are the most general-purpose solution for organizing data with minimum redundancy.

There are valid reasons to use non-relational solutions. They often solve specific tasks of data-management extremely well, but are weak in other areas. Whereas SQL and relational databases strike a compromise, solving a larger set of problems adequately, with fewer areas of weakness.

Other technologies currently available that are not based on the relational model are listed in "The Next-Gen Databases."

Non-Relational Database Design

I think you have to consider that the non-relational DBMS differ a lot regarding their data model and therefore the conceptual data design will also differ a lot. In the thread Data Design in Non-Relational Databases of the NOSQL Google group the different paradigms are categorized like this:

  1. Bigtable-like systems (HBase,
    Hypertable, etc)
  2. Key-value stores (Tokyo, Voldemort,
    etc)
  3. Document databases (CouchDB,
    MongoDB, etc)
  4. Graph databases (AllegroGraph,
    Neo4j, Sesame, etc)

I'm mostly into graph databases, and the elegance of data design using this paradigm was what brought me there, tired of the shortcomings of RDBMS. I have put a few examples of data design using a graph database on this wiki page and there's an example of how to model the basic IMDB movie/actor/role data too.

The presentation slides (slideshare) Graph Databases and the Future of Large-Scale Knowledge Management by Marko Rodriguez contains a very nice introduction to data design using a graph database as well.

Answering the specific questions from a graphdb point of view:

Alternate design: adding relationships between many different kinds of entities without any worries or a need to predefine which entities can get connected.

Bridging the gap: I tend to do this different for every case, based on the domain itself, as I don't want a "table-oriented graph" and the like. However, here's some information on automatic translation from RDBMS to graphdb.

Explicit data models: I do these all the time (whiteboard style), and then use the model as it is in the DB as well.

Miss from RDBMS world: easy ways to create reports. Update: maybe it's not that hard to create reports from a graph database, see Creating a Report for a Neo4J Sample Database.

What means distributed in NoSQL definition?

NoSQL systems employ a distributed architecture, with the data held in a redundant manner on several servers. In this way, the system can easily scale out by adding more servers, and failure of a server can be tolerated.

Horizontal scalability is the ability to increase the speed or availability of the server by adding more servers, typically using clustering and load balancing.

So yes both are similar terms. horizontal scalability increases the distributed architecture.

data replication if the same data is stored on multiple storage devices

SCHEMA FREE: In Oracle you would define your table structure first and then insert/delete your data but thats not the case with Schema free.No schema migrations. your code defines your schema. So no more Alter table statements.

Ref: http://en.wikipedia.org/wiki/NoSQL

http://searchcio.techtarget.com/definition/horizontal-scalability

Timeseries NoSQL databases

So to make it short, you can find both pure time series database, or engine that runs at the top of a more generic engine like Redis, Hbase, Cassandra, Elasticsearch ...

TimeSeries Databases (TSDBs) are data engines that are focusing on saving and retrieving time-based information very efficiently.

Very often since these databases will capture "events" (systems, devices/iot, applications ticks) they have support highly concurrent writes, and they usually do a lot more writes than reads.

TSDBs are storing data points within a time series, and the timestamp is usually the main index/key; allowing very efficient time range queries (give me data point from this time to this time).
Data points can be multi dimensionnal, and add tags/label.

TSDBs provide mathematical operations on datapoint: SUM, DIV AVG, ... to combine data over time.

So based on these characteristics you can find databases that offer this. As you mention you can use specialized solutions like Influx DB, Druid, Prometheus; or you can find more generic databases engines that provide native time series support, or extension, let me list some of them:

  • Redis TimeSeries
  • Elastisearch
  • OpenTSDB: that runs at the top of Apache HBase
  • Warp10: that runs at the top of Apache HBase

With the recent prevelance of NoSQL databases why would I use a SQL database?

My key question was where would a SQL database really outshine a document database and from all the responses there really doesn't seem to be much.

Given that NoSQL databases come in just as many variations of types of databases as relational that both match all or some parts of ACID depending on which database you use that at this point they are basically the equitable for solving problems.

After this the key differences would be tooling and maturity which SQL databases have a much larger grasp in for being the established player but this is how it is for all new technology.

Database alternative to MySQL made for millions of TABLES

You could always look towards a NoSql DB:
From: http://nosql-database.org/

"NoSQL DEFINITION: Next Generation Databases mostly addressing some of
the points: being non-relational, distributed, open-source and
horizontally scalable."

Edit: Scalable is what I was shooting for..

Suggestion:
http://www.mongodb.org/

Edit: Interesting idea about data versioning:
Ways to implement data versioning in MongoDB

When shouldn't you use a relational database?

In my experience, you shouldn't use a relational database when any one of these criteria are true:

  • your data is structured as a hierarchy or a graph (network) of arbitrary depth,
  • the typical access pattern emphasizes reading over writing, or
  • there’s no requirement for ad-hoc queries.

Deep hierarchies and graphs do not translate well to relational tables. Even with the assistance of proprietary extensions like Oracle's CONNECT BY, chasing down trees is a mighty pain using SQL.

Relational databases add a lot of overhead for simple read access. Transactional and referential integrity are powerful, but overkill for some applications. So for read-mostly applications, a file metaphor is good enough.

Finally, you simply don’t need a relational database with its full-blown query language if there are no unexpected queries anticipated. If there are no suits asking questions like "how many 5%-discounted blue widgets did we sell in on the east coast grouped by salesperson?", and there never will be, then you, sir, can live free of DB.



Related Topics



Leave a reply



Submit