How to force evaluation of subquery before joining / pushing down to foreign server
Foreign data wrapper
Typically, joins or any derived tables from subqueries or CTEs are not available on the foreign server and have to be executed locally. I.e., all rows remaining after the simple WHERE
clause in your example have to be retrieved and processed locally like you observed.
If all else fails you can execute the subquery SELECT id FROM lookup_table WHERE x = 5
and concatenate results into the query string.
More conveniently, you can automate this with dynamic SQL and EXECUTE
in a PL/pgSQL function. Like:
CREATE OR REPLACE FUNCTION my_func(_c1 int, _l_id int)
RETURNS TABLE(id int, c1 int, c2 int, c3 int) AS
$func$
BEGIN
RETURN QUERY EXECUTE
'SELECT id,c1,c2,c3 FROM big_table
WHERE c1 = $1
AND id = ANY ($2)'
USING _c1
, ARRAY(SELECT l.id FROM lookup_table l WHERE l.x = _l_id);
END
$func$ LANGUAGE plpgsql;
Related:
- Table name as a PostgreSQL function parameter
Or try this search on SO.
Or you might use the meta-command \gexec
in psql. See:
- Filter column names from existing table for SQL DDL statement
Or this might work: (Feedback says does not work.)
SELECT id,c1,c2,c3
FROM big_table
WHERE c1 = 2
AND id = ANY (ARRAY(SELECT id FROM lookup_table WHERE x = 5));
Testing locally, I get a query plan like this:
Index Scan using big_table_idx on big_table (cost= ...)
Index Cond: (id = ANY ($0))
Filter: (c1 = 2)
InitPlan 1 (returns $0)
-> Seq Scan on lookup_table (cost= ...)
Filter: (x = 5)
Bold emphasis mine.
The parameter $0
in the plan inspires hope. The generated array might be something Postgres can pass on to be used remotely. I don't see a similar plan with any of your other attempts or some more I tried myself. Can you test with your fdw?
Related question concerning postgres_fdw
:
- postgres_fdw: possible to push data to foreign server for join?
General technique in SQL
That's a different story. Just use a CTE. But I don't expect that to help with the FDW.
WITH cte AS (SELECT id FROM lookup_table WHERE x = 5)
SELECT id,c1,c2,c3
FROM big_table b
JOIN cte USING (id)
WHERE b.c1 = 2;
PostgreSQL 12 changed (improved) behavior, so that CTEs can be inlined like subqueries, given some preconditions. But, quoting the manual:
You can override that decision by specifying
MATERIALIZED
to force separate calculation of the WITH query
So:
WITH cte AS MATERIALIZED (SELECT id FROM lookup_table WHERE x = 5)
...
Typically, none of this should be necessary if your DB server is configured properly and column statistics are up to date. But there are corner cases with uneven data distribution ...
How can I force a subquery to perform as well as a #temp table?
There are a few possible explanations as to why you see this behavior. Some common ones are
- The subquery or CTE may be being repeatedly re-evaluated.
- Materialising partial results into a
#temp
table may force a more optimum join order for that part of the plan by removing some possible options from the equation. - Materialising partial results into a
#temp
table may improve the rest of the plan by correcting poor cardinality estimates.
The most reliable method is simply to use a #temp
table and materialize it yourself.
Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY
can often encourage the result to be spooled rather than repeatedly re evaluated.
Even if that works however there are no statistics on the spool.
For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.
Get apps with the highest review count since a dynamic series of days
I think this is what you are looking for:
Postgres 13 or newer
WITH cte AS ( -- MATERIALIZED
SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
FROM reviews
GROUP BY 1
)
SELECT *
FROM (
SELECT generate_series(min(review_date)
, max(review_date)
, '1 day')::date
FROM reviews
) d(review_window_start)
LEFT JOIN LATERAL (
SELECT total_ct, array_agg(app_id) AS apps
FROM (
SELECT app_id, total_ct
FROM cte c
WHERE c.earliest_review >= d.review_window_start
ORDER BY total_ct DESC
FETCH FIRST 1 ROWS WITH TIES -- new & hot
) sub
GROUP BY 1
) a ON true;
WITH TIES
makes it a bit cheaper. Added in Postgres 13 (currently beta). See:
- Get top row(s) with highest value, with ties
Postgres 12 or older
WITH cte AS ( -- MATERIALIZED
SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
FROM reviews
GROUP BY 1
)
SELECT *
FROM (
SELECT generate_series(min(review_date)
, max(review_date)
, '1 day')::date
FROM reviews
) d(review_window_start)
LEFT JOIN LATERAL (
SELECT total_ct, array_agg(app_id) AS apps
FROM (
SELECT total_ct, app_id
, rank() OVER (ORDER BY total_ct DESC) AS rnk
FROM cte c
WHERE c.earliest_review >= d.review_window_start
) sub
WHERE rnk = 1
GROUP BY 1
) a ON true;
db<>fiddle here
Same as above, but without WITH TIES
.
We don't need to involve the table apps
at all. The table reviews
has all information we need.
The CTE cte
computes earliest review & current total count per app. The CTE avoids repeated computation. Should help quite a bit.
It is always materialized before Postgres 12, and should be materialized automatically in Postgres 12 since it is used many times in the main query. Else you could add the keyword MATERIALIZED
in Postgres 12 or later to force it. See:
- How to force evaluation of subquery before joining / pushing down to foreign server
The optimized generate_series()
call produces the series of days from earliest to latest review. See:
- Generating time series between two dates in PostgreSQL
- Join a count query on generate_series() and retrieve Null values as '0'
Finally, the LEFT JOIN LATERAL
you already discovered. But since multiple apps can tie for the most reviews, retrieve all winners, which can be 0 - n apps. The query aggregates all daily winners into an array, so we get a single result row per review_window_start
. Alternatively, define tiebreaker(s) to get at most one winner. See:
- What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
WHERE clause better execute before IN and JOIN or after
It doesn't matter
Logical processing order is always honoured: regardless of actual processing order
INNER JOINs and WHERE conditions are effectively associative and commutative (hence the ANSI-89 "join in the where" syntax) so actual order doesn't matter
Logical order becomes important with outer joins and more complex queries: applying WHERE on an OUTER table changes the logic completely.
Again, it doesn't matter how the optimiser does it internally so long as the query semantics are maintained by following logical processing order.
And the key word here is "optimiser": it does exactly what it says
Function to return dynamic set of columns for given table
Solution for the simple case
As explained in the referenced answers below, you can use registered (row) types, and thus implicitly declare the return type of a polymorphic function:
CREATE OR REPLACE FUNCTION public.get_table(_tbl_type anyelement)
RETURNS SETOF anyelement AS
$func$
BEGIN
RETURN QUERY EXECUTE format('TABLE %s', pg_typeof(_tbl_type));
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM public.get_table(NULL::public.users); -- note the syntax!
Returns the complete table (with all user columns).
Wait! How?
Detailed explanation in this related answer, chapter
"Various complete table types":
- Refactor a PL/pgSQL function to return the output of various SELECT queries
TABLE foo
is just short for SELECT * FROM foo
:
- Is there a shortcut for SELECT * FROM?
2 steps for completely dynamic return type
But what you are trying to do is strictly impossible in a single SQL command.
I want to pass
schema_name
andtable_name
as parameters to function and get record list, according tocolumn_visible
field inpublic.fields
table.
There is no direct way to return an arbitrary selection of columns (return type not known at call time) from a function - or any SQL command. SQL demands to know number, names and types of resulting columns at call time. More in the 2nd chapter of this related answer:
- How do I generate a pivoted CROSS JOIN where the resulting table definition is unknown?
There are various workarounds. You could wrap the result in one of the standard document types (json
, jsonb
, hstore
, xml
).
Or you generate the query with one function call and execute the result with the next:
CREATE OR REPLACE FUNCTION public.generate_get_table(_schema_name text, _table_name text)
RETURNS text AS
$func$
SELECT format('SELECT %s FROM %I.%I'
, string_agg(quote_ident(column_name), ', ')
, schema_name
, table_name)
FROM fields
WHERE column_visible
AND schema_name = _schema_name
AND table_name = _table_name
GROUP BY schema_name, table_name
ORDER BY schema_name, table_name;
$func$ LANGUAGE sql;
Call:
SELECT public.generate_get_table('public', 'users');
This create a query of the form:
SELECT usr_id, usr FROM public.users;
Execute it in the 2nd step. (You might want to add column numbers and order columns.)
Or append \gexec
in psql to execute the return value immediately. See:
How to force evaluation of subquery before joining / pushing down to foreign server
Be sure to defend against SQL injection:
- INSERT with dynamic table name in trigger function
- Define table and column names as arguments in a plpgsql function?
varchar(100)
does not make much sense for identifiers, which are limited to 63 characters in standard Postgres:
- Maximum characters in labels (table names, columns etc)
If you understand how the object identifier type regclass
works, you might replace schema and table name with a singe regclass
column.
Postgres_FDW not pushing down WHERE criteria
Any function that is not IMMUTABLE
will not be pushed down.
See function is_foreign_expr
in contrib/postgres_fdw/deparse.c
:
/*
* Returns true if given expr is safe to evaluate on the foreign server.
*/
bool
is_foreign_expr(PlannerInfo *root,
RelOptInfo *baserel,
Expr *expr)
{
...
/*
* An expression which includes any mutable functions can't be sent over
* because its result is not stable. For example, sending now() remote
* side could cause confusion from clock offsets. Future versions might
* be able to make this choice with more granularity. (We check this last
* because it requires a lot of expensive catalog lookups.)
*/
if (contain_mutable_functions((Node *) expr))
return false;
/* OK to evaluate on the remote server */
return true;
}
Related Topics
Use Plink to Execute Command (Oracle SQL Query) on Remote Server Over Ssh
I Have a Delete-Insert Cte That Fails in a Strange Manner
Sum of Digits of a Number in SQL Server Without Using Traditional Loops Like While
Tsql Aggregate String for Group By
Merge Duplicate Temporal Records in Database
SQL Select for All Records That May Holds Specific Value
How to Select Records That Don't Exist in SQL Server
Comma-Separated String into Table's Column in SQL Server
Oracle as Keyword and Subqueries
Time Based Priority in Active Record Query
Introducing Foreign Key Constraint May Cause Cycles or Multiple Cascade Paths
SQL Server:Find Duplicates in a Table Based on Values in a Single Column
Oracle (11.2.0.1):How to Identify the Row Which Is Currently Updated by the Update Statement
Spool Command Doesnt Save Result in Query
SQL Server - Give a Login Permission for Read Access to All Existing and Future Databases