Why Is Select * Considered Harmful

Why is SELECT * considered harmful?

There are really three major reasons:

  • Inefficiency in moving data to the consumer. When you SELECT *, you're often retrieving more columns from the database than your application really needs to function. This causes more data to move from the database server to the client, slowing access and increasing load on your machines, as well as taking more time to travel across the network. This is especially true when someone adds new columns to underlying tables that didn't exist and weren't needed when the original consumers coded their data access.

  • Indexing issues. Consider a scenario where you want to tune a query to a high level of performance. If you were to use *, and it returned more columns than you actually needed, the server would often have to perform more expensive methods to retrieve your data than it otherwise might. For example, you wouldn't be able to create an index which simply covered the columns in your SELECT list, and even if you did (including all columns [shudder]), the next guy who came around and added a column to the underlying table would cause the optimizer to ignore your optimized covering index, and you'd likely find that the performance of your query would drop substantially for no readily apparent reason.

  • Binding Problems. When you SELECT *, it's possible to retrieve two columns of the same name from two different tables. This can often crash your data consumer. Imagine a query that joins two tables, both of which contain a column called "ID". How would a consumer know which was which? SELECT * can also confuse views (at least in some versions SQL Server) when underlying table structures change -- the view is not rebuilt, and the data which comes back can be nonsense. And the worst part of it is that you can take care to name your columns whatever you want, but the next guy who comes along might have no way of knowing that he has to worry about adding a column which will collide with your already-developed names.

But it's not all bad for SELECT *. I use it liberally for these use cases:

  • Ad-hoc queries. When trying to debug something, especially off a narrow table I might not be familiar with, SELECT * is often my best friend. It helps me just see what's going on without having to do a boatload of research as to what the underlying column names are. This gets to be a bigger "plus" the longer the column names get.

  • When * means "a row". In the following use cases, SELECT * is just fine, and rumors that it's a performance killer are just urban legends which may have had some validity many years ago, but don't now:

    SELECT COUNT(*) FROM table;

    in this case, * means "count the rows". If you were to use a column name instead of * , it would count the rows where that column's value was not null. COUNT(*), to me, really drives home the concept that you're counting rows, and you avoid strange edge-cases caused by NULLs being eliminated from your aggregates.

    Same goes with this type of query:

    SELECT a.ID FROM TableA a
    WHERE EXISTS (
    SELECT *
    FROM TableB b
    WHERE b.ID = a.B_ID);

    in any database worth its salt, * just means "a row". It doesn't matter what you put in the subquery. Some people use b's ID in the SELECT list, or they'll use the number 1, but IMO those conventions are pretty much nonsensical. What you mean is "count the row", and that's what * signifies. Most query optimizers out there are smart enough to know this. (Though to be honest, I only know this to be true with SQL Server and Oracle.)

Is SELECT * a good practice if I do not need any data?

If you are writing application code, then select * is a poor practice. You want the application to be specific about the columns it is using. And you don't want to return unnecessary data to the application.

If you are writing code that you expect to run over time -- say as a scheduled job -- then select * is also not recommended for similar reasons.

MySQL tends to materialize subqueries (although it is getting better at avoiding this). So in MySQL, using select * in a subquery in the from clause (a "derived table") is not recommended for performance reasons. This is not true in other databases, which have smarter compilers.

But for daily query writing, particularly in the outermost select, select * or select t.* is a great convenience.

For EXISTS and NOT EXISTS, it makes no difference at all. I usually write:

EXISTS (SELECT 1 FROM `table` WHERE ~~STATEMENT~~)

EXISTS and NOT EXISTS are only looking for the existence of rows, not at the values in columns. So, what is selected should make no difference on performance. I use SELECT 1 because it is easy to type and I think pretty clearly conveys what it needs to.

What is the reason not to use select *?

The essence of the quote of not prematurely optimizing is to go for simple and straightforward code and then use a profiler to point out the hot spots, which you can then optimize to be efficient.

When you use select * you're make it impossible to profile, therefore you're not writing clear & straightforward code and you are going against the spirit of the quote. select * is an anti-pattern.


So selecting columns is not a premature optimization. A few things off the top of my head ....

  1. If you specify columns in a SQL statement, the SQL execution engine will error if that column is removed from the table and the query is executed.
  2. You can more easily scan code where that column is being used.
  3. You should always write queries to bring back the least amount of information.
  4. As others mention if you use ordinal column access you should never use select *
  5. If your SQL statement joins tables, select * gives you all columns from all tables in the join

The corollary is that using select * ...

  1. The columns used by the application is opaque
  2. DBA's and their query profilers are unable to help your application's poor performance
  3. The code is more brittle when changes occur
  4. Your database and network are suffering because they are bringing back too much data (I/O)
  5. Database engine optimizations are minimal as you're bringing back all data regardless (logical).

Writing correct SQL is just as easy as writing Select *. So the real lazy person writes proper SQL because they don't want to revisit the code and try to remember what they were doing when they did it. They don't want to explain to the DBA's about every bit of code. They don't want to explain to their clients why the application runs like a dog.

Can select * usage ever be justified?

I'm quite happy using * in audit triggers.

In that case it can actually prove a benefit because it will ensure that if additional columns are added to the base table it will raise an error so it cannot be forgotten to deal with this in the audit trigger and/or audit table structure.

(Like dotjoe) I am also happy using it in derived tables and column table expressions. Though I habitually do it the other way round.

WITH t
AS (SELECT *,
ROW_NUMBER() OVER (ORDER BY a) AS RN
FROM foo)
SELECT a,
b,
c,
RN
FROM t;

I'm mostly familiar with SQL Server and there at least the optimiser has no problem recognising that only columns a,b,c will be required and the use of * in the inner table expression does not cause any unnecessary overhead retrieving and discarding unneeded columns.

In principle SELECT * ought to be fine in a view as well as it is the final SELECT from the view where it ought to be avoided however in SQL Server this can cause problems as it stores column metadata for views which is not automatically updated when the underlying tables change and the use of * can lead to confusing and incorrect results unless sp_refreshview is run to update this metadata.

Is using select * for a cursor in PL/SQL considered bad programming?

Using select * in your code is what I would call lazy programming, with several nasty side effects. How much you experience those side effects, will differ, but it's never positive.

I'll use some of the points already mentioned in other answers, but feel free to edit my answer and add some more negative points about using select *.

  1. You are shipping more data from the SQL engine to your code than necessary, which has a negative effect on performance.

  2. The information you get back needs to be placed in variables (a record variable for example). This will take more PGA memory than necessary.

  3. By using select * you will never use an index alone to retrieve the wanted information, you'll always have to visit the table as well (provided no index exists which holds all columns of the table). Again, with a negative effect on performance.

  4. Less clear for people maintaining your code what your intention is. They need to delve into the code to spot all occurrences of your record variable to know what is being retrieved.

  5. You will not use SQL functions to perform calculations, but always rely on PL/SQL or Java calculations. You are possibly missing out on some great SQL improvements like analytic functions, model clause, recursive subquery factoring and the like.

  6. From Oracle11 onwards, dependencies are being tracked on column level, meaning that when you use select *, your code is being marked in the data dictionary as "dependent on all columns" of that table. Your procedure will be invalidated when something happens to one of those columns. So using select * means your code will be invalidated more often than necessary.

Again, feel free to add your own points.

are performance/code-maintainability concerns surrounding SELECT * on MS SQL still relevant today, with modern ORMs?

This question is out some time now, and noone seems to be able to find, what Ben is looking for...

I think this is, because the answer is "it depends".

There just NOT IS THE ONE answer to this.

Examples

  • As i pointed out before, if a database is not yours, and it may be altered often, you cannot guarantee performance, because with select * the amount of data per row may explode
  • If you write an application using ITS OWN database, noone alters your DB (hopefully) and you need your columns, so whats wrong with select *
  • If you build some kind of lazy loading with "main properties" beeing loaded instantly and others beeing loaded later (of same entity), you cannot go with select * because you get all
  • If you use select * other developers will every time think about "did he think about select *" as they will try to optimize. So you should add enough comments...
  • If you build 3-Tier-Application building large caches in the middle-Tier and performance is a theme beeing done by cache, you may use select *
  • Expanding 3Tier: If you have many many concurrent users and/or really big data, you should consider every single byte, because you have to scale up your middle-Tier with every byte beeing wasted (as someone pointed out in the comments before)
  • If you build a small app for 3 users and some thousands of records, the budget may not give time to optimize speed/db-layout/something
  • Speak to your dba... HE will advice you WHICH statement has to be changed/optimized/stripped down/...

I could go on. There just is not ONE answer. It just depends on to many factors.



Related Topics



Leave a reply



Submit