What Is the Big-O for SQL Select

What is the Big-O for SQL select?

As you don't control the algorithm selected, there is no way to know directly. However, without indexes a SELECT should be O(n) (a table scan has to inspect every record which means it will scale with the size of the table).

With an index a SELECT is probably O(log(n)) (although it would depend on the algorithm used for indexing and the properties of the data itself if that holds true for any real table). To determine your results for any table or query you have to resort to profiling real world data to be sure.

INSERT without indexes should be very quick (close to O(1)) while UPDATE needs to find the records first and so will be slower (slightly) than the SELECT that gets you there.

INSERT with indexes will probably again be in the ballpark of O(log(n)^2) when the index tree needs to be rebalanced, closer to O(log(n)) otherwise. The same slowdown will occur with an UPDATE if it affects indexed rows, on top of the SELECT costs.

Edit: O(log(n^2)) = O(2log(n)) = O(log(n)) did you mean O(log(n)^2)?

All bets are off once you are talking about JOIN in the mix: you will have to profile and use your databases query estimation tools to get a read on it. Also note that if this query is performance critical you should reprofile from time to time as the algorithms used by your query optimizer will change as the data load changes.

Another thing to keep in mind... big-O doesn't tell you about fixed costs for each transaction. For smaller tables these are probably higher than the actual work costs. As an example: the setup, tear down and communication costs of a cross network query for a single row will surely be more than the lookup of an indexed record in a small table.

Because of this I found that being able to bundle a group of related queries in one batch can have vastly more impact on performance than any optimization I did to the database proper.

Time complexity of select in sql

The primary key has an index on it, which is typically a b-tree. The time complexity would be O(log(n)) where "n" is the size of the table. THere is an additional fetch for the data from the page. In practice, the data fetch could be much more expensive than the index lookup.

But, performance in databases is much more complicated than this. You have to deal with multiple levels of memory hierarchy, different implementations of algorithms, and issues related to grid computing.

Is there any general rule on SQL query complexity Vs performance?

This depends on the query plan used.

Even without indexes, modern servers can use HASH JOIN and MERGE JOIN which are faster than O(N * M)

More specifically, complexity of a HASH JOIN is O(N + M), where N is the hashed table and M the is lookup table. Hashing and hash lookups have constant complexity.

Complexity of a MERGE JOIN is O(N*Log(N) + M*Log(M)): it's the sum of times to sort both tables plus time to scan them.

SELECT  T1.name, T2.date
FROM    T1, T2
WHERE   T1.id=T2.id
        AND T1.color='red'
        AND T2.type='CAR'

If there are no indexes defined, the engine will select either a HASH JOIN or a MERGE JOIN.

The HASH JOIN works as follows:

The hashed table is chosen (usually it's the table with fewer records). Say it's t1
All records from t1 are scanned. If the records holds color='red', this record goes into the hash table with id as a key and name as a value.
All records from t2 are scanned. If the record holds type='CAR', its id is searched in the hash table and the values of name from all hash hits are returned along with the current value of data.

The MERGE JOIN works as follows:

The copy of t1 (id, name) is created, sorted on id
The copy of t2 (id, data) is created, sorted on id
The pointers are set to the minimal values in both tables:
```
>1  2<
 2  3
 2  4
 3  5
```

The pointers are compared in a loop, and if they match, the records are returned. If they don't match, the pointer with the minimal value is advanced:

>1  2<  - no match, left pointer is less. Advance left pointer
 2  3
 2  4
 3  5

 1  2<  - match, return records and advance both pointers
>2  3
 2  4
 3  5

 1  2  - match, return records and advance both pointers
 2  3< 
 2  4
>3  5

 1  2 - the left pointer is out of range, the query is over.
 2  3
 2  4<
 3  5
>

In such a case, making the query more complex could make it faster because less rows are subjected to the join-level tests?

Sure.

Your query without the WHERE clause:

SELECT  T1.name, T2.date
FROM    T1, T2

is more simple but returns more results and runs longer.

Database indexes and their Big-O notation

Most relational databases structure indices as B-trees.

If a table has a clustering index, the data pages are stored as the leaf nodes of the B-tree. Essentially, the clustering index becomes the table.

For tables w/o a clustering index, the data pages of the table are stored in a heap. Any non-clustered indices are B-trees where the leaf node of the B-tree identifies a particular page in the heap.

The worst case height of a B-tree is O(log n), and since a search is dependent on height, B-tree lookups run in something like (on the average)

O(log_t n)

where t is the minimization factor ( each node must have at least t-1 keys and at most 2*t* -1 keys (e.g., 2*t* children).

That's the way I understand it.

And different database systems, of course, may well use different data structures under the hood.

And if the query does not use an index, of course, then the search is an iteration over the heap or B-tree containing the data pages.

Searches are a little cheaper if the index used can satisfy the query; otherwise, a lookaside to fetch the corresponding datapage in memory is required.

What is the Big-O complexity of a MySQL SELECT statement that uses BETWEEN clause?

type = ALL means that this query performs a full scan on the table. key = NULL means that no index is used. In this case it is O(n), where n is the number of rows.

As for BETWEEN, it is the same as performing two compare opearations (>= and <=). If those are executed on indexes (which are B-Trees in MySQL), it is O(log n) in both average and worst cases. Because B-Tree stores values sequentially, such things as range searches are very fast.

As I know for secondary indexes, InnoDB stores primary ID together with secondary index values, so if you do SELECT id FROM MyTable WHERE lat ... AND lon ... (selecting only id), it wouldn't even look inside the actual rows.

Find out more here:

http://dev.mysql.com/doc/refman/5.1/en/explain-output.html
http://dev.mysql.com/doc/refman/5.1/en/mysql-indexes.html

For your case, I recommend you to set some index on lat and lon fields (separately) and experiment which works best for your data. Maybe it is even worth to add extra fields which will contain rouded lat and lon values (as SMALL INTs) to speed up index - in that case you can add multicolumn index on that fields.

What is the big O of Oracles MAX function?

If you have a B-tree index on the column then finding the maximum value is O(log(n)) because the answer will be the last (or first) row of the index. Values are stored in the deepest nodes of the B-tree which has a height O(log(n)).

Without an index it is O(n) because all rows must be read to determine the maximum value.

Note: The O(n) notation ignores constants but in the real world these constants cannot be ignored. The difference between reading from disk and reading from memory is several orders of magnitude. Accessing the first value of an index is likely to be performed mostly in RAM, whereas a full table scan of a huge table will need to read mostly from disk.

What is the complexity of an SQL query that randomly selects a subset of rows from a database?

SQL does not provide complexity guarantees. The best we can do is talk about the lower bound of what's theoretically possible, and keep in mind that other factors may dominate.

the complexity of (SELECT max(id) FROM MY_TABLE) is O(N).

or O(log N), depending on your index, and whether or not it's used. Or possibly O(1), if max(id) is treated specially.

The complexity of distinct is likewise opaque. It implies a sort, which we can take to be O(n log n). But it's only O(N) if the data are already sorted, and cheaper still if they're known not to contain duplicates.

Looking at your query, I would approach your question this way:

a binary search along an index on id, if extant
a binary search along an index (putative) for output tf_idf
N times, where N is a function of the cardinality of id and tf_idf

For example, suppose there is only 1 id and L is 2. If the cardinality of id to tf_idf is 1:1 -- with or without an index on id -- the system will have to read all the rows in MY_TABLE. If every id is unique, but they all map to the same tf_idf, an index would probably only add to the cost versus a linear scan. If the cardinality is 1:1 and id is unique, then N ~ L: as the number of distinct pairs grows, the probability of randomly selecting a duplicate declines.

What is the Big-O for WHERE NOT IN statement?

If n is the size of some_table and m is the maximum size of the sub-result, then the naive algorithm of checking each element in n against each element in m is O(mn).

In reality, as jpmc26 mentioned, the underlying implementation would decide this. If, for example, the id in m is indexed, it could be accessed in O(lg m) time, so n could be checked against m in O(nlg m) time. Since you have to at least check each element of n, any implementation would be lower-bounded at Ω(n).

What Is the Big-O for SQL Select