Which Database Design Gives Better Performance

Which database design gives better performance?

This question is very similar to the one you made yestarday:

Create many tables or just one

The answer is also similar - that depends on what you want to achieve. Both solutions could work and both have pros and cons and one should do a little trade-off analysis in the light of the specific situation. Out of this context is not possible to answer your question.

The only difference I see in both version is foreign key id_country in Person table:

Person(id, name, profession, ****id_country****, id_city)

cities (id, city, id_country)

countries (id, country)

The question is "do we need it?"

So, the pros and cons of both solutions:

1. Solution: With id_contry:

pros: easier retrival of a Person based on land (simpler query) and better performance of this query
cons: more complex underlaying DB and more redundancy, more chance of inreoducing inconsiscenties in the DB, harder updates

2. Solution: Without id_country:

pros: simpler and cleaner model, no redundancy, easier maintenance
cons: slower performance and more complex query for retrival of a Person based on land (simpler query)

So, the 1st solution effectivelly gives you easier query structure and better performance for retrieving Persons by Country (what you wanted), but it has its cost (see pros and cons). On the other side, pragmatic thinking says that country-city data are quite stabile and not often changed and this fact goes in favor of the 1st solution.

If this denormalization and slight chance of inconsistencies something you can live with, you can take the 1st solution.

Which Database Design is more effective in this scenario?

That's called vertical partitioning or "row splitting" and is no silver bullet (nothing is). You are not getting "better performance" you are just getting "different performance". Whether one set of performance characteristics is better than the other is a matter of engineering tradeoff and varies from one case to another.

In your case, 1 million rows will fit comfortably into DBMS cache on today's hardware, producing excellent performance. So unless some of the other reasons apply, keep it simple, in a single table.

And if its 1 billion rows (or 1 trillion or whatever number is too large for the memory standards of the day), keep in mind that if you have indexed your data correctly, the performance will remain excellent long after it became bigger than the cache.

Only in the most extreme of cases will you need to vertically partition the table for performance reasons - in which case you'll have to measure in your own environment with your own access patterns, and determine if it brings any performance benefit at all; and is it large enough to make up for the increased JOINing.

Which is better database design?

Definitely to use COUNT. Storing the number of comments is a classic de-normalization that produces headaches. It's slightly more efficient for retrieval but makes inserts much more expensive: each new comment requires not only an insert into the comments table, but a write lock on the row containing the comment count.

What's the better database design: more tables or more columns?

I have a few fairly simple rules of thumb I follow when designing databases, which I think can be used to help make decisions like this....

Favor normalization. Denormalization is a form of optimization, with all the requisite tradeoffs, and as such it should be approached with a YAGNI attitude.
Make sure that client code referencing the database is decoupled enough from the schema that reworking it doesn't necessitate a major redesign of the client(s).
Don't be afraid to denormalize when it provides a clear benefit to performance or query complexity.
Use views or downstream tables to implement denormalization rather than denormalizing the core of the schema, when data volume and usage scenarios allow for it.

The usual result of these rules is that the initial design will favor tables over columns, with a focus on eliminating redundancy. As the project progresses and denormalization points are identified, the overall structure will evolve toward a balance that compromises with limited redundancy and column proliferation in exchange for other valuable benefits.

Database performance: filtering on column vs. separate table

Or should I create another table, e.g. 'Orders_Archive', so that the Orders table would only contain open orders that I use for the calculations?

Yes. They call that data warehousing. Folks do this because it speeds up the transaction system to eliminate the hardly-used history. First, tables are physically smaller and process faster. Second, a long-running history report doesn't interfere with transactional processing.

Is there any (clear) performance difference in these approaches?

Yes. Bonus. You can restructure your history so that it's no longer in 3NF (for updating) but in a Star Schema (for reporting). The advantages are huge.

Buy Kimball's The Data Warehouse Toolkit book to learn more about star schema design and migrating history out of active tables into warehouse tables.

Database modeling (best design possible for performance of this specific case)

You'll want to use the database for storage only, and write a caching layer over it that manages the ordered lists in memory. It can read from the database when it starts up, write changes to the database after applying them in memory, and manage the ordering in memory.

When you need to present a leaderboard, it comes from memory immediately and is very fast.

Performance in database design

As you've explicitly mentioned OO and that you're using EntityFramework, perhaps its worth approaching the problem instead from how the framework is intended to work - rather than just building a database structure and then trying to model it?

Entity Framework Code First Inheritance : Table Per Hierarchy and Table Per Type is a nice introduction to the various strategies that you could pick from.

As for the note on adding unnecessary overhead to the database - I wouldn't worry about it just yet. EF is generally about getting a product built more rapidly and as it has to cope with a more general case, doesn't always produce the most efficient SQL. If the performance is a problem after your application is built, working and correct you can revisit and fix up the most inefficient stuff then.

better performance on database design

Do you really need to focus on performance to evaluate design here? A single instance of MySQL can probably run more than 10k select queries a second on a commodity hw with right indexes whichever design you choose. I think you should consider other non-functional attributes like maintainability, extensibility, expressiveness.

So, I would go with number 1. student_id, exam_type, exam_score. This makes it easier to add another type of exam later on. Also if you want to do aggregation (min, max, avg), select queries will be much easier.

High availabity and Database Design

Facebook and other large companies that have huge databases employ database partitioning.

Partitioning is the distribution of a table over multiple subtables that may reside on different databases or servers in order to improve read/write performance. SQL Server partitioning is typically done at the table level, and a database is considered partitioned when groups of related tables have been distributed. Tables are normally partitioned horizontally or vertically.

Horizontal Partitioning (also known as sharding) improves overall read/write performance
Horizontal partitioning involves putting different rows into different tables. Perhaps customers with ZIP codes less than 50000 are stored in CustomersEast, while customers with ZIP codes greater than or equal to 50000 are stored in CustomersWest. The two partition tables are then CustomersEast and CustomersWest, while a view with a union might be created over both of them to provide a complete view of all customers.
Horizontal partitioning is a database design principle whereby rows of a database table are held separately, rather than splitting by columns (as for normalization). Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.
There are numerous advantages to this partitioning approach. The total number of rows in each table is reduced. This reduces index size, which generally improves search performance. A database shard can be placed on separate hardware, and multiple shards can be placed on multiple machines. This enables a distribution of the database over a large number of machines, which means that the database performance can be spread out over multiple machines, greatly improving performance. In addition, if the database shard is based on some real-world segmentation of the data (e.g. European customers vs. American customers) then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.
Sharding is in practice far more difficult than this. Although it has been done for a long time by hand-coding (especially where rows have an obvious grouping, as per the example above), this is often inflexible. There is a desire to support sharding automatically, both in terms of adding code support for it, and for identifying candidates to be sharded separately.
Where distributed computing is used to separate load between multiple servers (either for performance or reliability reasons) a shard approach may also be useful.
Shards compared to horizontal partitioning
Horizontal partitioning splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which table a particular row will be found, without first needing to search the index, e.g. the classic example of the 'CustomersEast' and 'CustomersWest' tables, where their zip code already indicates where they will be found.
Sharding goes beyond this: it partitions the problematic table(s) in the same way, but it does this across potentially multiple instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server.
Splitting shards across multiple isolated instances requires more than simple horizontal partitioning. The hoped-for gains in efficiency would be lost, if querying the database required both instances to be queried, just to retrieve a simple dimension table. Beyond partitioning, sharding thus splits large partitionable tables across the servers, whilst smaller tables are replicated into them en masse.
This is also why sharding is related to a shared nothing architecture - once sharded, each shard can live in a totally separate logical schema instance / physical database server / data center / continent. There is no ongoing need to retain shared access (from between shards) to the other unpartitioned tables in other shards.
This makes replication across multiple servers easy (simple horizontal partitioning can't). It is also useful for worldwide distribution of applications, where communications links between data centers would otherwise be a bottleneck.
Obviously there is also a need for some notification and replication mechanism between schema instances, so that the unpartitioned tables remain as closely synchronized as the application demands. This is a complex choice in the architecture of sharded systems: approaches range from making these effectively read-only (updates are rare and batched), to dynamically replicated tables (at the cost of reducing some of the distribution benefits of sharding) and many options in between.
Vertical Partitioning improves access to data
In a vertically partitioned table, columns are removed from the main table and placed in child tables through a process called denormalization. This type of partitioning allows you to fit more rows on a database page, making tables narrower to improve data-access performance. Therefore, a single I/O operation will return more rows. By vertically partitioning your data, you may have to resort to joins to return the denormalized columns.

In addition to partitioning, of course, there is replication, making multiple copies of the data available.

Effect on Relational Database Schemas

Sharding does destroy your relational database – which is a good thing. The idea behind sharding is to distribute data to several databases based on certain criterias. This could for example be the primary key. All entities that keys begin with 1 go to one database, with 2 to another and so on (often modulo functions on the key are used, or groups based on business data like customer location, or function). Several reasons exists for sharding, the main two being better performance and lower impact of crashed databases – only persons with a name that starts with S will be affected by a database crash.

Relational databases were the tool of choice for several decades when it comes to data storage. But they do more than store data. Even reading operations can be split into several functions. There are at least three kinds of database read queries:

Data graph building queries: With these you get your data out of the database, customers together with adresses etc.
Aggregation queries: How many orders have been stored in the August, aggregated by product category
Search queries: Give me all customers who live in New York

Sharding now does away with the second and third query and reduces databases to data storage. Because the shards are different databases on different systems you can’t aggregate queries (compared to a cluster) without custom code across systems and you cannot search with one query (only several ones – one to each database). Databases have lead to the notion that search and retrieval are linked together and should be dealt together. Most people think as retrieval and search as the same thing. This has blocked development on technologies. Sharding, S3, Dynamo, Memcached have changed this preception recently. Rickard from Qi4j fame said this:

Entities are really cool. We have
decided to split the storage from the
indexing/querying, sort of like how
the internet works with websites vs
Google, which makes it possible to
implement really simple storages. Not
having to deal with queries makes
things a whole lot easier.

Thus, storage and search are two different things and any sizable web related company handles them differently.

People talked about splitting storage and search for some time now. Search engines like Lucene have driven searching out of databases. But mainly the notion of store & search is prevalent. Sharding as a mechanism for more perfomance and lower risk will move into many web companies and reduce databases to storage mechanism and drop the aggreation (data warehouse and reporting) and search parts. Those can be better filled with real data warehouse servers like Mondrian and search services based on Lucene or semantic enginse like Sesame. And storage might move from relational databases to simple storages like Amazon Simple DB or JDBM or NoSQL.

Which Database Design Gives Better Performance