How to Think in SQL

How to think in SQL?

A few examples of what should come to your mind first if you're real SQL geek:

Bible concordance is a FULLTEXT index to the Bible
Luca Pacioli's Summa de arithmetica which describes double-entry bookkeeping is in fact a normalized database schema
When Xerxes I counted his army by walling an area that 10,000 of his men occupied and then marching the other men through this enclosure, he used HASH AGGREGATE method.
The House That Jack Built should be rewritten using a self-join.
The Twelve Days of Christmas should be rewritten using a self-join and a ROWNUM
There Was An Old Woman Who Swallowed a Fly should be rewritten using CTE's
If the European Union were called European Union All, we would see 27 spellings for the word euro on a Euro banknote, instead of 2.

And finally you can read a lame article in my blog on how I stopped worrying and learned to love SQL (I almost forgot I wrote it):

Click

And one more article just on the subject:

Double-thinking in SQL

How do you think while formulating Sql Queries. Is it an experience or a concept?

I have a somewhat methodical method of constructing queries in general, and it is something I use elsewhere with any problem solving I need to do.

The first step is ALWAYS listing out any bits of information I have in a request. Information is essentially anything that tells me something about something.

A table contain single column having
duplicate record. I need to remove
duplicate

I have a table (I'll call it table1)
I have a
column on table table1 (I'll call it col1)
I have
duplicates in col1 on table table1
I need to remove
duplicates.

The next step of my query construction is identifying the action I'll take from the information I have.
I'll look for certain keywords (e.g. remove, create, edit, show, etc...) along with the standard insert, update, delete to determine the action.
In the example this would be DELETE because of remove.

The next step is isolation.

Asnwer the question "the action determined above should only be valid for ______..?" This part is almost always the most difficult part of constructing any query because it's usually abstract.
In the above example you're listing "duplicate records" as a piece of information, but that's really an abstract concept of something (anything where a specific value is not unique in usage).
Isolation is also where I test my action using a SELECT statement.
Every new query I run gets thrown through a select first!

The next step is execution, or essentially the "how do I get this done" part of a request.

A lot of times you'll figure the how out during the isolation step, but in some instances (yours included) how you isolate something, and how you fix it is not the same thing.
Showing duplicated values is different than removing a specific duplicate.

The last step is implementation. This is just where I take everything and make the query...

Summing it all up... for me to construct a query I'll pick out all information that I have in the request. Using the information I'll figure out what I need to do (the action), and what I need to do it on (isolation). Once I know what I need to do with what I figure out the execution.

Every single time I'm starting a new "query" I'll run it through these general steps to get an idea for what I'm going to do at an abstract level.
For specific implementations of an actual request you'll have to have some knowledge (or access to google) to go further than this.

Kris

BigQuery - how to think about query optimisation when coming from a SQL Server background

It is absolutely a paradigm shift in how you think. You're right: you don't have hardly any control in execution. And you'll eventually come to appreciate that. You do have control over architecture, and that's where a lot of your wins will be. (As others mentioned in comments, the documentation is definitely helpful too.)

I've personally found that premature optimization is one of the biggest issues in BigQuery—often the things you do trying to make a query faster actually have a negative impact, because things like table scans are well optimized and there are internals that you can impact (like restructuring a query in a way that seems more optimal, but forces additional shuffles to disk for parallelization).

Some of the biggest areas our team HAS seem greatly improve performance are as follows:

Use semi-normalized (nested/repeated) schema when possible. By using nested STRUCT/ARRAY types in your schema, you ensure that the data is colocated with the parent record. You can basically think of these as tables within tables. The use of CROSS JOIN UNNEST() takes a little getting used to, but eliminating those joins makes a big difference (especially on large results).
Use partitioning/clustering on large datasets when possible. I know you mention this, just make sure that you're pruning what you can using _PARTITIONTIME when possible, and also using clutering keys that make sense for your data. Keep in mind that clustering basically sorts the storage order of the data, meaning that the optimizer knows it doesn't have to continue scanning if the criteria has been satisfied (so it doesn't help as much on low-cardinality values)
Use analytic window functions when possible. They're very well optimized, and you'll find that BigQuery's implementation is very mature. Often you can eliminate grouping this way, or filter our more of your data earlier in the process. Keep in mind that sometimes filtering data in derived tables or Common Table Expressions (CTEs/named WITH queries) earlier in the process can make a more deeply nested query perform better than trying to do everything in one flat layer.
Keep in mind that results for Views and Common Table Expressions (CTEs/named WITH queries) aren't materialized during execution. If you use the CTE multiple times, it will be executed multiple times. If you join the same View multiple times, it will be executed multiple times. This was hard for members of our team who came from the world of materialized views (although it looks like somethings in the works for that in BQ world since there's an unused materializedView property showing in the API).
Know how the query cache works. Unlike some platforms, the cache only stores the output of the outermost query, not its component parts. Because of this, only an identical query against unmodified tables/views will use the cache—and it will typically only persist for 24 hours. Note that if you use non-deterministic functions like NOW() and a host of other things, the results are non-cacheable. See details under the Limitations and Exceptions sections of the docs.
Materialize your own copies of expensive tables. We do this a lot, and use scheduled queries and scripts (API and CLI) to normalize and save a native table copy of our data. This allows very efficient processing and fast responses from our client dashboards as well as our own reporting queries. It's a pain, but it works well.

Hopefully that will give you some ideas, but also feel free to post queries on SO in the future that you're having a hard time optimizing. Folks around here are pretty helpful when you let them know what your data looks like and what you've already tried.

Good luck!

How to understand a database that is already developed?

I'm actually hoping you don't find much for answers here as I've made my career based on coming in on these large undocumented data models and trying to figure them out. But for what it's worth:

I don't like the automated data modeller / electronic modeller, though this might be personal opinion. My preference is to find a white board (or paper) and draw out your data model by hand. If you are a kinaesthetic learner (learn by hands on participation), I've found this to be the best way of familiarizing yourself with the new database...as nice as an automated system is to read the database, you won't learn what you will when you draw it by hand.
There is a limited number of data modelling techniques, however they can be combined in a lot of ways. My guess with a larger database like you have here, you will have multiple programmers creating it, which means you'll likely see multiple techniques used in the same database. In the past I have found a system that had it's circuit information stored as a single table that self joined onto itself repeatedly to store the information for a data circuit while the customer information section was a very straight forward star design...2 very separate programming styles, likely two separate developers. I later ventured into the phone circuit section of the app, which I recognized immediately as the same style (likely same programmer) as the data circuit section was. Usually, developers will be assigned to a logical division that correlates to a section of your business...watch for patterns in similar sections.
The physical database structure is only one section to understand...on the Left (prior to the database) is how the data is generated and loaded into your database (data warehouse?). Understanding what your data is and how it is created is the first step in knowing what you are looking for in the database after it's loaded.
Opposite side of above, after the data is in the database...understanding how the data is consumed (used by your users) will help you understand what they have been getting out of it and what they need from it. Extra points if you can get your hands on scripting used to generate existing reports as the from statement will help you see how existing tables are used.
Never forget to interview your users...especially if you can locate one that was around for the initial deploy of the system. If it's in-house designed, odds are it was these people that provided some of the initial requirements for the system and talking to them will give you an idea of what the people who designed the system first heard when they went requirement gathering. The logical division of your company (customer care vs operations vs billing vs etc...) is usually the same division your data model will follow.
And lastly...Play! If a dev or QA environment is available, start writing queries and see what comes back...alter your statement and try again.

I think the biggest folly you will want to avoid is focussing solely on how the tables are arranged. Understand the data thats in it, how it is generated and how it is consumed. Understand your company, how it's arranged and how it functions. The manner in which it's stored (the data modelling) is secondary to this understanding.

is this possible in SQL to locate where actual data sits?

There are quite few ways you can follow, here is quick and easy one I can suggest.

Run the SQL Profile, do the UI operations and see what all SQL Queries or stored proc use, stop the Profiler and pick those queries, and you can make sense out of it.

This is better than building a giant query using Sys.Tables or Sys.columns and running a heavy operation on Non Index columns.

And remember, sometimes in the UI you would see DataTransformation like in the Database, it will be "R" as status, but on UI it will be transformed to "Ready"; or this value will be derived on the fly all together based on other column value logical operation.

So, give SQL Profiler a try.

If you don't have access to SQL Profiler, then you will have to explore other heavy operational approaches.

General rules for simplifying SQL statements

To state it different: Having a (complex) query with JOINs, SUBSELECTs, UNIONs is it possible (or not) to reduce it to a simpler, equivalent SQL statement, which is producing the same result, by using some transformation rules?

This answer was written in 2009. Some of the query optimization tricks described here are obsolete by now, others can be made more efficient, yet others still apply. The statements about feature support by different database systems apply to versions that existed at the time of this writing.

That's exactly what optimizers do for a living (not that I'm saying they always do this well).

Since SQL is a set based language, there are usually more than one way to transform one query to other.

Like this query:

SELECT  *
FROM    mytable
WHERE   col1 > @value1 OR col2 < @value2

can be transformed into this one (provided that mytable has a primary key):

SELECT  *
FROM    mytable
WHERE   col1 > @value1
UNION
SELECT  *
FROM    mytable
WHERE   col2 < @value2

or this one:

SELECT  mo.*
FROM    (
        SELECT  id
        FROM    mytable
        WHERE   col1 > @value1
        UNION
        SELECT  id
        FROM    mytable
        WHERE   col2 < @value2
        ) mi
JOIN    mytable mo
ON      mo.id = mi.id

, which look uglier but can yield better execution plans.

One of the most common things to do is replacing this query:

SELECT  *
FROM    mytable
WHERE   col IN
        (
        SELECT  othercol
        FROM    othertable
        )

with this one:

SELECT  *
FROM    mytable mo
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    othertable o
        WHERE   o.othercol = mo.col
        )

In some RDBMS's (like PostgreSQL 8.4), DISTINCT and GROUP BY use different execution plans, so sometimes it's better to replace the one with the other:

SELECT  mo.grouper,
        (
        SELECT  SUM(col)
        FROM    mytable mi
        WHERE   mi.grouper = mo.grouper
        )
FROM    (
        SELECT  DISTINCT grouper
        FROM    mytable
        ) mo

vs.

SELECT  mo.grouper, SUM(col)
FROM    mytable
GROUP BY
        mo.grouper

In PostgreSQL, DISTINCT sorts and GROUP BY hashes.

MySQL 5.6 lacks FULL OUTER JOIN, so it can be rewritten as following:

SELECT  t1.col1, t2.col2
FROM    table1 t1
LEFT OUTER JOIN
        table2 t2
ON      t1.id = t2.id

vs.

SELECT  t1.col1, t2.col2
FROM    table1 t1
LEFT JOIN
        table2 t2
ON      t1.id = t2.id
UNION ALL
SELECT  NULL, t2.col2
FROM    table1 t1
RIGHT JOIN
        table2 t2
ON      t1.id = t2.id
WHERE   t1.id IS NULL

, but see this article in my blog on how to do this more efficiently in MySQL:

Emulating FULL OUTER JOIN in MySQL

This hierarchical query in Oracle 11g:

SELECT  DISTINCT(animal_id) AS animal_id
FROM    animal
START WITH
        animal_id = :id
CONNECT BY
        PRIOR animal_id IN (father, mother)
ORDER BY
        animal_id

can be transformed to this:

SELECT  DISTINCT(animal_id) AS animal_id
FROM    (
        SELECT  0 AS gender, animal_id, father AS parent
        FROM    animal
        UNION ALL
        SELECT  1, animal_id, mother
        FROM    animal
        )
START WITH
        animal_id = :id
CONNECT BY
        parent = PRIOR animal_id
ORDER BY
        animal_id

, the latter one being more efficient.

See this article in my blog for the execution plan details:

Genealogy query on both parents

To find all ranges that overlap the given range, you can use the following query:

SELECT  *
FROM    ranges
WHERE   end_date >= @start
        AND start_date <= @end

, but in SQL Server this more complex query yields same results faster:

SELECT  *
FROM    ranges
WHERE   (start_date > @start AND start_date <= @end)
        OR (@start BETWEEN start_date AND end_date)

, and believe it or not, I have an article in my blog on this too:

Overlapping ranges: SQL Server

SQL Server 2008 also lacks an efficient way to do cumulative aggregates, so this query:

SELECT  mi.id, SUM(mo.value) AS running_sum
FROM    mytable mi
JOIN    mytable mo
ON      mo.id <= mi.id
GROUP BY
        mi.id

can be more efficiently rewritten using, Lord help me, cursors (you heard me right: "cursors", "more efficiently" and "SQL Server" in one sentence).

See this article in my blog on how to do it:

Flattening timespans: SQL Server

There is a certain kind of query, commonly met in financial applications, that pulls effective exchange rate for a currency, like this one in Oracle 11g:

SELECT  TO_CHAR(SUM(xac_amount * rte_rate), 'FM999G999G999G999G999G999D999999')
FROM    t_transaction x
JOIN    t_rate r
ON      (rte_currency, rte_date) IN
        (
        SELECT  xac_currency, MAX(rte_date)
        FROM    t_rate
        WHERE   rte_currency = xac_currency
                AND rte_date <= xac_date
        )

This query can be heavily rewritten to use an equality condition which allows a HASH JOIN instead of NESTED LOOPS:

WITH v_rate AS
        (
        SELECT  cur_id AS eff_currency, dte_date AS eff_date, rte_rate AS eff_rate
        FROM    (
                SELECT  cur_id, dte_date,
                        (
                        SELECT  MAX(rte_date)
                        FROM    t_rate ri
                        WHERE   rte_currency = cur_id
                                AND rte_date <= dte_date
                        ) AS rte_effdate
                FROM    (
                        SELECT  (
                                SELECT  MAX(rte_date)
                                FROM    t_rate
                                ) - level + 1 AS dte_date
                        FROM    dual
                        CONNECT BY
                                level <=
                                (
                                SELECT  MAX(rte_date) - MIN(rte_date)
                                FROM    t_rate
                                )
                        ) v_date,
                        (
                        SELECT  1 AS cur_id
                        FROM    dual
                        UNION ALL
                        SELECT  2 AS cur_id
                        FROM    dual
                        ) v_currency
                ) v_eff
        LEFT JOIN
                t_rate
        ON      rte_currency = cur_id
                AND rte_date = rte_effdate
        )
SELECT  TO_CHAR(SUM(xac_amount * eff_rate), 'FM999G999G999G999G999G999D999999')
FROM    (
        SELECT  xac_currency, TRUNC(xac_date) AS xac_date, SUM(xac_amount) AS xac_amount, COUNT(*) AS cnt
        FROM    t_transaction x
        GROUP BY
                xac_currency, TRUNC(xac_date)
        )
JOIN    v_rate
ON      eff_currency = xac_currency
        AND eff_date = xac_date

Despite being bulky as hell, the latter query is six times as fast.

The main idea here is replacing <= with =, which requires building an in-memory calendar table to join with.

Converting currencies

How to Think in SQL