Why Are Joins Bad When Considering Scalability

Why are joins bad when considering scalability?

Scalability is all about pre-computing (caching), spreading out, or paring down the repeated work to the bare essentials, in order to minimize resource use per work unit. To scale well, you don't do anything you don't need to in volume, and the things you actually do you make sure are done as efficiently as possible.

In that context, of course joining two separate data sources is relatively slow, at least compared to not joining them, because it's work you need to do live at the point where the user requests it.

But remember the alternative is no longer having two separate pieces of data at all; you have to put the two disparate data points in the same record. You can't combine two different pieces of data without a consequence somewhere, so make sure you understand the trade-off.

The good news is modern relational databases are good at joins. You shouldn't really think of joins as slow with a good database used well. There are a number of scalability-friendly ways to take raw joins and make them much faster:

Join on a surrogate key (autonumer/identity column) rather than a natural key. This means smaller (and therefore faster) comparisons during the join operation
Indexes
Materialized/indexed views (think of this as a pre-computed join or managed de-normalization)
Computed columns. You can use this to hash or otherwise pre-compute the key columns of a join, such that what would be a complicated comparison for a join is now much smaller and potentially pre-indexed.
Table partitions (helps with large data sets by spreading the load out to multiple disks, or limiting what might have been a table scan down to a partition scan)
OLAP (pre-computes results of certain kinds of queries/joins. It's not quite true, but you can think of this as generic denormalization)
Replication, Availability Groups, Log shipping, or other mechanisms to let multiple servers answer read queries for the same database, and thus scale your workload out among several servers.
Use of a caching layer like Redis to avoid re-running queries which need complex joins.

I would go as far as saying the main reason relational databases exist at all is to allow you do joins efficiently^*. It's certainly not just to store structured data (you could do that with flat file constructs like csv or xml). A few of the options I listed will even let you completely build your join in advance, so the results are already done before you issue the query — just as if you had denormalized the data (admittedly at the cost of slower write operations).

If you have a slow join, you're probably not using your database correctly.

De-normalization should be done only after these other techniques have failed. And the only way you can truly judge "failure" is to set meaningful performance goals and measure against those goals. If you haven't measured, it's too soon to even think about de-normalization.

_{* That is, exist as entities distinct from mere collections of tables. An additional reason for a real rdbms is safe concurrent access.}

Why do joins reduce scalability in large-scale distributed database system?

As a general consideration, there is significant overhead (e.g. non-user computation) in a distributed system that present a 'coherent' and 'unified' facade.

Simply consider these factors:

distinct nodes (e.g. servers) are distinct machines. This means the probability of having n nodes participating in a distributed action -- e.g. a join -- being in an optimal state (e.g. having just the right tables in cache, or having the appropriate locks acquired) is low. So here is some of the overhead for each node to get in the appropriate state.
naturally they need to communicate to coordinate. So there is network chatter between nodes and those latencies are not insignificant.
above overheads, in turn, increase the average time of servicing requests, and thus reduce availability (in terms of system capacity).

Scalability becomes an issue as none of the above are O(1). At the very best you can expect O(log n) and it could be as bad as O(n^2). That does wonders for killing scalability (which by definition means the ability of the system to scale to a larger number of nodes).

The above are a part of the motivation for noSQL systems, e.g. if one does not require coordination across nodes to service queries, then the performance is substantially better. (As you can see, it is not magic -- we're merely sacrificing systemic correctness for performance.)

When and why are database joins expensive?

Denormalising to improve performance? It sounds convincing, but it doesn't hold water.

Chris Date, who in company with Dr Ted Codd was the original proponent of the relational data model, ran out of patience with misinformed arguments against normalisation and systematically demolished them using scientific method: he got large databases and tested these assertions.

I think he wrote it up in Relational Database Writings 1988-1991 but this book was later rolled into edition six of Introduction to Database Systems, which is the definitive text on database theory and design, in its eighth edition as I write and likely to remain in print for decades to come. Chris Date was an expert in this field when most of us were still running around barefoot.

He found that:

Some of them hold for special cases
All of them fail to pay off for general use
All of them are significantly worse for other special cases

It all comes back to mitigating the size of the working set. Joins involving properly selected keys with correctly set up indexes are cheap, not expensive, because they allow significant pruning of the result before the rows are materialised.

Materialising the result involves bulk disk reads which are the most expensive aspect of the exercise by an order of magnitude. Performing a join, by contrast, logically requires retrieval of only the keys. In practice, not even the key values are fetched: the key hash values are used for join comparisons, mitigating the cost of multi-column joins and radically reducing the cost of joins involving string comparisons. Not only will vastly more fit in cache, there's a lot less disk reading to do.

Moreover, a good optimiser will choose the most restrictive condition and apply it before it performs a join, very effectively leveraging the high selectivity of joins on indexes with high cardinality.

Admittedly this type of optimisation can also be applied to denormalised databases, but the sort of people who want to denormalise a schema typically don't think about cardinality when (if) they set up indexes.

It is important to understand that table scans (examination of every row in a table in the course of producing a join) are rare in practice. A query optimiser will choose a table scan only when one or more of the following holds.

There are fewer than 200 rows in the relation (in this case a scan will be cheaper)
There are no suitable indexes on the join columns (if it's meaningful to join on these columns then why aren't they indexed? fix it)
A type coercion is required before the columns can be compared (WTF?! fix it or go home) SEE END NOTES FOR ADO.NET ISSUE
One of the arguments of the comparison is an expression (no index)

Performing an operation is more expensive than not performing it. However, performing the wrong operation, being forced into pointless disk I/O and then discarding the dross prior to performing the join you really need, is much more expensive. Even when the "wrong" operation is precomputed and indexes have been sensibly applied, there remains significant penalty. Denormalising to precompute a join - notwithstanding the update anomalies entailed - is a commitment to a particular join. If you need a different join, that commitment is going to cost you big.

If anyone wants to remind me that it's a changing world, I think you'll find that bigger datasets on gruntier hardware just exaggerates the spread of Date's findings.

For all of you who work on billing systems or junk mail generators (shame on you) and are indignantly setting hand to keyboard to tell me that you know for a fact that denormalisation is faster, sorry but you're living in one of the special cases - specifically, the case where you process all of the data, in-order. It's not a general case, and you are justified in your strategy.

You are not justified in falsely generalising it. See the end of the notes section for more information on appropriate use of denormalisation in data warehousing scenarios.

I'd also like to respond to

Joins are just cartesian products with some lipgloss

What a load of bollocks. Restrictions are applied as early as possible, most restrictive first. You've read the theory, but you haven't understood it. Joins are treated as "cartesian products to which predicates apply" only by the query optimiser. This is a symbolic representation (a normalisation, in fact) to facilitate symbolic decomposition so the optimiser can produce all the equivalent transformations and rank them by cost and selectivity so that it can select the best query plan.

The only way you will ever get the optimiser to produce a cartesian product is to fail to supply a predicate: SELECT * FROM A,B

Notes

David Aldridge provides some important additional information.

There is indeed a variety of other strategies besides indexes and table scans, and a modern optimiser will cost them all before producing an execution plan.

A practical piece of advice: if it can be used as a foreign key then index it, so that an index strategy is available to the optimiser.

I used to be smarter than the MSSQL optimiser. That changed two versions ago. Now it generally teaches me. It is, in a very real sense, an expert system, codifying all the wisdom of many very clever people in a domain sufficiently closed that a rule-based system is effective.

"Bollocks" may have been tactless. I am asked to be less haughty and reminded that math doesn't lie. This is true, but not all of the implications of mathematical models should necessarily be taken literally. Square roots of negative numbers are very handy if you carefully avoid examining their absurdity (pun there) and make damn sure you cancel them all out before you try to interpret your equation.

The reason that I responded so savagely was that the statement as worded says that

Joins are cartesian products...

This may not be what was meant but it is what was written, and it's categorically untrue. A cartesian product is a relation. A join is a function. More specifically, a join is a relation-valued function. With an empty predicate it will produce a cartesian product, and checking that it does so is one correctness check for a database query engine, but nobody writes unconstrained joins in practice because they have no practical value outside a classroom.

I called this out because I don't want readers falling into the ancient trap of confusing the model with the thing modelled. A model is an approximation, deliberately simplified for convenient manipulation.

The cut-off for selection of a table-scan join strategy may vary between database engines. It is affected by a number of implementation decisions such as tree-node fill-factor, key-value size and subtleties of algorithm, but broadly speaking high-performance indexing has an execution time of k log n + c. The C term is a fixed overhead mostly made of setup time, and the shape of the curve means you don't get a payoff (compared to a linear search) until n is in the hundreds.

Sometimes denormalisation is a good idea

Denormalisation is a commitment to a particular join strategy. As mentioned earlier, this interferes with other join strategies. But if you have buckets of disk space, predictable patterns of access, and a tendency to process much or all of it, then precomputing a join can be very worthwhile.

You can also figure out the access paths your operation typically uses and precompute all the joins for those access paths. This is the premise behind data warehouses, or at least it is when they're built by people who know why they're doing what they're doing, and not just for the sake of buzzword compliance.

A properly designed data warehouse is produced periodically by a bulk transformation out of a normalised transaction processing system. This separation of the operations and reporting databases has the very desirable effect of eliminating the clash between OLTP and OLAP (online transaction processing ie data entry, and online analytical processing ie reporting).

An important point here is that apart from the periodic updates, the data warehouse is read only. This renders moot the question of update anomalies.

Don't make the mistake of denormalising your OLTP database (the database on which data entry happens). It might be faster for billing runs but if you do that you will get update anomalies. Ever tried to get Reader's Digest to stop sending you stuff?

Disk space is cheap these days, so knock yourself out. But denormalising is only part of the story for data warehouses. Much bigger performance gains are derived from precomputed rolled-up values: monthly totals, that sort of thing. It's always about reducing the working set.

ADO.NET problem with type mismatches

Suppose you have a SQL Server table containing an indexed column of type varchar, and you use AddWithValue to pass a parameter constraining a query on this column. C# strings are Unicode, so the inferred parameter type will be NVARCHAR, which doesn't match VARCHAR.

VARCHAR to NVARCHAR is a widening conversion so it happens implicitly - but say goodbye to indexing, and good luck working out why.

"Count the disk hits" (Rick James)

If everything is cached in RAM, JOINs are rather cheap. That is, normalization does not have much performance penalty.

If a "normalized" schema causes JOINs to hit the disk a lot, but the equivalent "denormalized" schema would not have to hit the disk, then denormalization wins a performance competition.

Comment from original author: Modern database engines are very good at organising access sequencing to minimise cache misses during join operations. The above, while true, might be miscontrued as implying that joins are necessarily problematically expensive on large data. This would lead to cause poor decision-making on the part of inexperienced developers.

Isn't using unnormalized design better when there are multiple JOINS?

The second approach uses two JOINs. I guess it will be slower than
using REGEXP in huge dataset.

Your intuition is simply wrong. Databases are designed to do JOINs. They can take advantage of indexing and partitioning to speed queries. More advanced databases (than MySQL) use statistics on tables to choose optimal algorithms for executing the query.

Your first query always requires a full table scan of posts. Your second query can be optimized in various ways.

Further, maintaining the consistency of the data in the data is much more difficult with the first approach. You probably need to implement triggers to handle updates and inserts on all the tables. That slows things down.

There are some cases where it is worth the effort to do this -- think about summary counts or totals of dollars or time. Putting tags into a delimited string is much less likely to be beneficial, because parsing the string in SQL is not likely to be a really big benefit relative to the other costs.

[MySQL] Joins are evil - Cal Henderson

I'm somewhat exaggerating when I say they're evil.

For very large data sets, even when they fit within a single database, joining is an expensive operation (lots of non-sequential IO). With a typical web-app load (90/10 read/write), your reads need to be as cheap as possible, while you can spend more time on writes (and lazily replicate writes in many cases). In a typical high-performance web-app, you're going to want to perform all database IO within a couple of hundred milliseconds, so that's your first limit. Secondly, you want to be able to do plenty of concurrent requests. This tends to point to being able to collect records straight from index for large tables. Someone already mentioned that you don't need to send a ton of data to the browser, so performing the join across the whole dataset isn't needed, but consider ordering: if you can't get the records in the correct order straight from index, you're going to need to perform the entire join before ordering the results.

For multi-machine partitioned data, the same problems apply but on a larger scale. The usual solution is materialized views (data flattening) to enable join-like queries by performing multiple writes at insert/update/delete time (or lazily afterward) and using very simple indexed selects.

It's obviously the case that joins are useful and are perfectly good most of the time. But for large datasets in a database that doesn't natively support materialized views, this falls down at high concurrency on large datasets.

And the specific complaint about Django is that because of the inflexibility in changing models on existing data, people are encouraged to create 1-to-1 mapped tables which are only ever joined against, rather than adding columns to existing tables.

Need to understand use cases for JOINS - why join when you are interested in only part of the join?

Often the JOIN is being used merely to determine the rows in one table that you're interested in. For instance, "tell me the names of all the people who live in green houses". If you have a table of people and a table of houses, you'll join them to determine who lives in each house, and use a WHERE clause that selects the green houses. But you'll only return names from the People table.

Why NoSQL is better at scaling out than RDBMS?

RDBMS have ACID ( http://en.wikipedia.org/wiki/ACID ) and supports transactions. Scaling "out" with RDBMS is harder to implement due to these concepts.

NoSQL solutions usually offer record-level atomicity, but cannot guarantee a series of operations will succeed (transaction).

It comes down to: to keep data integrity and support transactions, a multi-server RDBMS would need to have a fast backend communication channel to synchronize all possible transactions and writes, while preventing/handling deadlock.

This is why you usually only see 1 master (writer) and multiple slaves (readers).

Why Are Joins Bad When Considering Scalability