Why Google's Bigtable Referred as a Nosql Database

Why Google's BigTable referred as a NoSQL database?

NoSQL is an umbrella term for all the databases that are different from 'the standard' SQL databases, such as MySQL, Microsoft SQL Server and PostgreSQL.

These 'standard' SQL databases are all relational databases, feature the SQL query language and adhere to the ACID properties. These properties basically boil down to consistency.

A NoSQL database is different because it doesn't support one or more of these key features of the so-called 'SQL databases':

  • Consistency
  • Relational data
  • SQL language

Most of these features go hand in hand.

Consistency

Consistency is where most NoSQL databases differ from SQL databases. You can pull the plug from a SQL database and it will make sure your data is still consistent and uncorrupted. NoSQL databases tend to sacrifice this consistency for better scalability. Google's Bigtable also does this.

Relational data

SQL databases revolve around normalized, relational data. The database ensures that these relations stay valid and consistent, no matter what you throw at it. NoSQL databases usually don't support relations, because they don't support the consistency to enforce these relations. Also, relational data is bad for performance when the data is distributed across several servers.

An exception are graph databases. These are considered NoSQL databases, but do feature relational data. In fact, that's what they're all about!

SQL language

The SQL language was designed especially for relational databases, the so-called 'SQL databases'. Since most NoSQL databases are very different from relational databases, they don't have the need for SQL. Also, some NoSQL databases have features that simply cannot be expressed in SQL, thus requiring a different query language.

Last, but not least, NoSQL is simply a buzzword. It basically means 'anything but the old and trusty MySQL server in the attic', which includes a lot of alternative storage mechanisms. Even a simple text file can be considered a NoSQL solution :)

What database does Google use?

Bigtable

A Distributed Storage System for Structured Data

Bigtable is a distributed storage
system (built by Google) for managing structured data
that is designed to scale to a very
large size: petabytes of data across
thousands of commodity servers.

Many projects at Google store data in
Bigtable, including web indexing,
Google Earth, and Google Finance.
These applications place very
different demands on Bigtable, both in
terms of data size (from URLs to web
pages to satellite imagery) and
latency requirements (from backend
bulk processing to real-time data
serving).

Despite these varied
demands, Bigtable has successfully
provided a flexible, high-performance
solution for all of these Google
products.

Some features

  • fast and extremely large-scale DBMS
  • a sparse, distributed multi-dimensional sorted map, sharing characteristics of both row-oriented and column-oriented databases.
  • designed to scale into the petabyte range
  • it works across hundreds or thousands of machines
  • it is easy to add more machines to the system and automatically start taking advantage of those resources without any reconfiguration
  • each table has multiple dimensions (one of which is a field for time, allowing versioning)
  • tables are optimized for GFS (Google File System) by being split into multiple tablets - segments of the table as split along a row chosen such that the tablet will be ~200 megabytes in size.

Architecture

BigTable is not a relational database. It does not support joins nor does it support rich SQL-like queries. Each table is a multidimensional sparse map. Tables consist of rows and columns, and each cell has a time stamp. There can be multiple versions of a cell with different time stamps. The time stamp allows for operations such as "select 'n' versions of this Web page" or "delete cells that are older than a specific date/time."

In order to manage the huge tables, Bigtable splits tables at row boundaries and saves them as tablets. A tablet is around 200 MB, and each machine saves about 100 tablets. This setup allows tablets from a single table to be spread among many servers. It also allows for fine-grained load balancing. If one table is receiving many queries, it can shed other tablets or move the busy table to another machine that is not so busy. Also, if a machine goes down, a tablet may be spread across many other servers so that the performance impact on any given machine is minimal.

Tables are stored as immutable SSTables and a tail of logs (one log per machine). When a machine runs out of system memory, it compresses some tablets using Google proprietary compression techniques (BMDiff and Zippy). Minor compactions involve only a few tablets, while major compactions involve the whole table system and recover hard-disk space.

The locations of Bigtable tablets are stored in cells. The lookup of any particular tablet is handled by a three-tiered system. The clients get a point to a META0 table, of which there is only one. The META0 table keeps track of many META1 tablets that contain the locations of the tablets being looked up. Both META0 and META1 make heavy use of pre-fetching and caching to minimize bottlenecks in the system.

Implementation

BigTable is built on Google File System (GFS), which is used as a backing store for log and data files. GFS provides reliable storage for SSTables, a Google-proprietary file format used to persist table data.

Another service that BigTable makes heavy use of is Chubby, a highly-available, reliable distributed lock service. Chubby allows clients to take a lock, possibly associating it with some metadata, which it can renew by sending keep alive messages back to Chubby. The locks are stored in a filesystem-like hierarchical naming structure.

There are three primary server types of interest in the Bigtable system:

  1. Master servers: assign tablets to tablet servers, keeps track of where tablets are located and redistributes tasks as needed.
  2. Tablet servers: handle read/write requests for tablets and split tablets when they exceed size limits (usually 100MB - 200MB). If a tablet server fails, then a 100 tablet servers each pickup 1 new tablet and the system recovers.
  3. Lock servers: instances of the Chubby distributed lock service. Lots of actions within BigTable require acquisition of locks including opening tablets for writing, ensuring that there is no more than one active Master at a time, and access control checking.

Example from Google's research paper:

alt text

A slice of an example table that
stores Web pages. The row name is a
reversed URL. The contents column
family contains the page contents, and
the anchor column family contains the
text of any anchors that reference the
page. CNN's home page is referenced by
both the Sports Illustrated and the
MY-look home pages, so the row
contains columns named
anchor:cnnsi.com and
anchor:my.look.ca. Each anchor cell
has one version; the contents column
has three versions, at timestamps
t3, t5, and t6.

API

Typical operations to BigTable are creation and deletion of tables and column families, writing data and deleting columns from a row. BigTable provides this functions to application developers in an API. Transactions are supported at the row level, but not across several row keys.


Here is the link to the PDF of the research paper.

And here you can find a video showing Google's Jeff Dean in a lecture at the University of Washington, discussing the Bigtable content storage system used in Google's backend.

Google's Bigtable vs. A Relational Database

Bigtable is Google's invention to deal with the massive amounts of information that the company regularly deals in. A Bigtable dataset can grow to immense size (many petabytes) with storage distributed across a large number of servers. The systems using Bigtable include projects like Google's web index and Google Earth.

According to Google whitepaper on the subject:

A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

The internal mechanics of Bigtable versus, say, MySQL are so dissimilar as to make comparison difficult, and the intended goals don't overlap much either. But you can think of Bigtable a bit like a single-table database. Imagine, for example, the difficulties you would run into if you tried to implement Google's entire web search system with a MySQL database -- Bigtable was built around solving those problems.

Bigtable datasets can be queried from services like AppEngine using a language called GQL ("gee-kwal") which is a based on a subset of SQL. Conspicuously missing from GQL is any sort of JOIN command. Because of the distributed nature of a Bigtable database, performing a join between two tables would be terribly inefficient. Instead, the programmer has to implement such logic in his application, or design his application so as to not need it.

Price aside, why ever choose Google Cloud Bigtable over Google Cloud Datastore?

Bigtable and Datastore are optimized for slightly different use-cases, and offer different tradeoffs. The main ones are:

Data model:

  • Bigtable is a wide-column database -- think HBase and Cassandra
  • Datastore is a document database -- think MongoDB
  • Note that both of these can be used for key-value use cases

Cost model:

  • Bigtable charges per provisioned nodes
  • Datastore is serverless and charges per operation

In general, Bigtable is a good choice if you need:

  • Fast point-reads and range scans (especially at scale). Bigtable will offer lower latency for key-value lookups, as well as fast scans of contiguous rows - a powerful tool since rows are stored in lexicographic order. If you have simple, predictable query patterns and design your schema well, reading from Bigtable can be incredibly efficient.
  • High throughput writes (again, especially at scale). This is possible in part because Bigtable is eventually consistent - in exchange you can see big wins in price/performance.

Example use-cases that are great for Bigtable include time series data (for IoT, monitoring, and more - think extremely write heavy workloads and massive amounts of data generated over x units of time), analytics (think fraud detection, personalization, recommendations), and ad-serving (every microsecond counts).

Datastore (or Firestore) is a good choice if you need:

  • Query flexibility: Datastore offers document support and secondary indexes.
  • Strong consistency and/or transactions: Bigtable has eventually consistent replication and does not support multi-row transactions.
  • Mobile SDKs: Datastore and Firestore are incredibly well-integrated with firebase ecosystem.

Example use-cases include mobile and web applications, game state, user profiles, and product catalogs.

To answer a few of your questions explicitly:

  • Why is Bigtable used for analytics? It's mostly about performance: analytics use-cases are more likely to have large datasets and require high write throughput. It's a lot easier to run into the limits of a database if you're storing clickstream data, as opposed to something like user account information. Fast scans are also important for analytics use-cases: Bigtable allows you to retrieve all of the information you need about a user or a device extremely quickly, which you can process in a batch job or use to create recommendations and analysis on the fly.
  • Is Bigtable strictly worse than Datastore? Datastore definitely provides more built-in functionality like secondary indexes and document support, and if you need those features, Datastore is a fantastic choice. But that functionality comes with tradeoffs. Bigtable provides perhaps lower-level, but incredibly performant APIs that allow users to make those tradeoffs for themselves: If a user values, say, write performance over secondary indexes, Bigtable is an excellent option. You can think of it as an extremely versatile and powerful infrastructural building block. I actually like the wheel/car analogy: sometimes you don't want the car -- if what you really need is a dirt bike, a set of solid wheels is much more useful :)

Is there any relation between Google BigQuery and Google Bigtable

BigQuery is a Data Warehouse solution in Google Cloud Platform. In BigQuery you can have two kinds of persistent table:

  1. Native/Internal: In this kind of table, you load data from some source (can be some file in GCS, some file that you upload in the load job or you can even create an empty table). After created, this table will store data on a specific BigQuery's storage system.
  2. External: This kind of table is basically a pointer to some external storage like GCS, Google Drive and BigTable

Furthermore, you can also create temporary tables that exists only during the query execution.

Both BigQuery and BigTable can be used as storage for data.
The point is that BigQuery is more like an engine to run queries and make aggregations on huge amounts of data that can be on external sources or even inside BigQuery's own storage system. That makes it good for data analysis.

BigTable is more like a NoSQL database that can deal with petabytes of data and gives you mechanisms to perform data analysis as well. As you can see in the image below, BigTable should be chosen intead of BigQuery if you need low-latency.

Sample Image

What is NoSQL, how does it work, and what benefits does it provide?

  1. What exactly is it?

    On one hand, a specific system, but it has also become a generic word for a variety of new data storage backends that do not follow the relational DB model.

  2. How does it work?

    Each of the systems labelled with the generic name works differently, but the basic idea is to offer better scalability and performance by using DB models that don't support all the functionality of a generic RDBMS, but still enough functionality to be useful. In a way it's like MySQL, which at one time lacked support for transactions but, exactly because of that, managed to outperform other DB systems. If you could write your app in a way that didn't require transactions, it was great.

  3. Why would it be better than using a SQL Database? And how much better is it?

    It would be better when your site needs to scale so massively that the best RDBMS running on the best hardware you can afford and optimized as much as possible simply can't keep up with the load. How much better it is depends on the specific use case (lots of update activity combined with lots of joins is very hard on "traditional" RDBMSs) - could well be a factor of 1000 in extreme cases.

  4. Is the technology too new to start implementing yet or is it worth taking a look into?

    Depends mainly on what you're trying to achieve. It's certainly mature enough to use. But few applications really need to scale that massively. For most, a traditional RDBMS is sufficient. However, with internet usage becoming more ubiquitous all the time, it's quite likely that applications that do will become more common (though probably not dominant).

NoSQL vs. Relational Databases vs. Possible Hybrid

MongoDB has the ability to have documents which include arrays of other documents. This solves many cases where you would have relations in reational databases.

When an invoice has multiple positions, you wouldn't put these positions into a separate collection. You would embed them as an array.

It makes me wonder about real life examples of when to use NoSQL versus when not to?

There are many different NoSQL databases, each one designed with different use-cases in mind. But you tagged this question as MongoDB, so I assume that you mean MongoDB in particular.

MongoDB has two main advantages over relational databases.

First, it scales well.

When the database is too slow or too big, you can easily add more servers by creating a cluster or replica-set of multiple shards. This doesn't work nearly as well with most relational databases.

Second, it allows heterogeneous data.

Imagine, for example, the product database of a computer hardware store. What properties do products have? All products have a price and a vendor. But CPUs have a clock rate, hard drives and RAM chips have a capacity (and these capacities aren't comparable), monitors have a resolution and so on. How would you design this in a relational database? You would either create a very long productID-property-value table or you would create a very wide and sparse product table with every property you can imagine, but most of them being NULL for most products. Both solutions aren't really elegant. But MongoDB can solve this much better because it allows each document in a collection to have a different set of properties.

What can't it do?

As a rather new technology, there isn't that much literature about it. The software ecosystem around it isn't that well either. The tools you can get for relational databases are often much more shiny.

There are also some use-cases MongoDB isn't well-suited for.

  • MongoDB doesn't do JOINs. When your data is very relational and denormalizing it would be counter-productive, it might be a poor choice for your product. But you might want to take a look at graph databases like Neo4j, which focus even more on relations than relational databases. Update 2016: MongoDB 3.2 now has rudimentary JOIN support with the $lookup aggregation stage, but it's still very limited in functionality compared to relational and graph databases.
  • MongoDB doesn't do transactions. At least not complex transactions. Certain actions which only affect a single document are guaranteed to be atomic, but as soon as you affect more than one document, you can't guarantee that no other query will happen in-between and find an inconsistent state.
  • MongoDB is bad for ad-hoc reporting. Its options for data-mining are severely limited. The rather new aggregation functions help and MapReduce can also solve some surprisingly complex problems when you learn to use it smart, but SQL has usually the better tools for things like that.

By denormalizing the data, you should be able to solve all of the same problems that relational databases do... But there are rules on how to normalize data with relational databases. Are there rules that one can use to help them denormalize the data to use a NoSQL solution?

Relational databases are around for about 40 years. Their theory is a well-researched topic in computer science. There are whole libraries of books written about the theory behind them. There is a by-the-book solution for every imaginable corner-case by now.

But NoSQL databases, on the other hand, are a rather new technology. We are still figuring out the best practices. The most frequent advise is: "Use your own head. Think about what queries are performed most often, and optimize your data schema for them."

Any examples on when you might want to consider using both a NoSQL solution in parallel with a relational database?

When possible I would advise against using two different database technologies in the same product:

  • Anyone who maintains and supports the product must be familiar with both technologies
  • Troubleshooting gets a lot harder
  • The sysadmins need to keep an additional database running and updated
  • You have an additional point of failure which can lead to downtime

I would only recommend to mix database technologies when fulfilling your requirements without it doesn't just become hard but physically impossible. Otherwise, make your pick and stay with it.

Bigtable database design theory

... are there books or academic research
papers on designing databases for
bigtable and similar database
paradigms?

Well Bigtable is essentially a database itself, so I take it that your question is more on how to model and to some extent design your schema in these Bigtable like databases. More specifically you would like to know how to do this on Google's App Engine.

With GAE you will be using the Datastore API, which adds a significant layer of abstraction to Bigtable, so to some extent you don't have to worry about low level details as you would if you were using something like HBase. There are a few posts on SO (here's a great answer by a Google Engineer who I think is part of GAE team) that will guide you and offer hints on how to approach this new type of Database system.

Helpful Info:

  1. HBase was inspired by Google's Bigtable (Alternate Link) paper
  2. Hypertable was also inspired by Bigtable paper
  3. Cassandra's Data Model was inspired by Bigtable paper
  4. Hadoop was inspired by Google's GFS and MapReduce papers

Google Bigtable vs BigQuery for storing large number of events

Bigtable is great for large (>= 1TB) mutable data sets. It has low latency under load and is managed by Google. In your case, I think you're on the right track with BigQuery.



Related Topics



Leave a reply



Submit