How to Force Evaluation of Subquery Before Joining/Pushing Down to Foreign Server

How to force evaluation of subquery before joining / pushing down to foreign server

Foreign data wrapper

Typically, joins or any derived tables from subqueries or CTEs are not available on the foreign server and have to be executed locally. I.e., all rows remaining after the simple WHERE clause in your example have to be retrieved and processed locally like you observed.

If all else fails you can execute the subquery SELECT id FROM lookup_table WHERE x = 5 and concatenate results into the query string.

More conveniently, you can automate this with dynamic SQL and EXECUTE in a PL/pgSQL function. Like:

CREATE OR REPLACE FUNCTION my_func(_c1 int, _l_id int)
RETURNS TABLE(id int, c1 int, c2 int, c3 int) AS
$func$
BEGIN
RETURN QUERY EXECUTE
'SELECT id,c1,c2,c3 FROM big_table
WHERE c1 = $1
AND id = ANY ($2)'
USING _c1
, ARRAY(SELECT l.id FROM lookup_table l WHERE l.x = _l_id);
END
$func$ LANGUAGE plpgsql;

Related:

  • Table name as a PostgreSQL function parameter

Or try this search on SO.

Or you might use the meta-command \gexec in psql. See:

  • Filter column names from existing table for SQL DDL statement

Or this might work: (Feedback says does not work.)

SELECT id,c1,c2,c3
FROM big_table
WHERE c1 = 2
AND id = ANY (ARRAY(SELECT id FROM lookup_table WHERE x = 5));

Testing locally, I get a query plan like this:

Index Scan using big_table_idx on big_table (cost= ...)
Index Cond: (id = ANY ($0))
Filter: (c1 = 2)
InitPlan 1 (returns $0)
-> Seq Scan on lookup_table (cost= ...)
Filter: (x = 5)

Bold emphasis mine.

The parameter $0 in the plan inspires hope. The generated array might be something Postgres can pass on to be used remotely. I don't see a similar plan with any of your other attempts or some more I tried myself. Can you test with your fdw?

Related question concerning postgres_fdw:

  • postgres_fdw: possible to push data to foreign server for join?

General technique in SQL

That's a different story. Just use a CTE. But I don't expect that to help with the FDW.

WITH cte AS (SELECT id FROM lookup_table WHERE x = 5)
SELECT id,c1,c2,c3
FROM big_table b
JOIN cte USING (id)
WHERE b.c1 = 2;

PostgreSQL 12 changed (improved) behavior, so that CTEs can be inlined like subqueries, given some preconditions. But, quoting the manual:

You can override that decision by specifying MATERIALIZED to force separate calculation of the WITH query

So:

WITH cte AS MATERIALIZED (SELECT id FROM lookup_table WHERE x = 5)
...

Typically, none of this should be necessary if your DB server is configured properly and column statistics are up to date. But there are corner cases with uneven data distribution ...

How can I force a subquery to perform as well as a #temp table?

There are a few possible explanations as to why you see this behavior. Some common ones are

  1. The subquery or CTE may be being repeatedly re-evaluated.
  2. Materialising partial results into a #temp table may force a more optimum join order for that part of the plan by removing some possible options from the equation.
  3. Materialising partial results into a #temp table may improve the rest of the plan by correcting poor cardinality estimates.

The most reliable method is simply to use a #temp table and materialize it yourself.

Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY can often encourage the result to be spooled rather than repeatedly re evaluated.

Even if that works however there are no statistics on the spool.

For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.

Get apps with the highest review count since a dynamic series of days

I think this is what you are looking for:

Postgres 13 or newer

WITH cte AS (  -- MATERIALIZED
SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
FROM reviews
GROUP BY 1
)
SELECT *
FROM (
SELECT generate_series(min(review_date)
, max(review_date)
, '1 day')::date
FROM reviews
) d(review_window_start)
LEFT JOIN LATERAL (
SELECT total_ct, array_agg(app_id) AS apps
FROM (
SELECT app_id, total_ct
FROM cte c
WHERE c.earliest_review >= d.review_window_start
ORDER BY total_ct DESC
FETCH FIRST 1 ROWS WITH TIES -- new & hot
) sub
GROUP BY 1
) a ON true;

WITH TIES makes it a bit cheaper. Added in Postgres 13 (currently beta). See:

  • Get top row(s) with highest value, with ties

Postgres 12 or older

WITH cte AS (  -- MATERIALIZED
SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
FROM reviews
GROUP BY 1
)
SELECT *
FROM (
SELECT generate_series(min(review_date)
, max(review_date)
, '1 day')::date
FROM reviews
) d(review_window_start)
LEFT JOIN LATERAL (
SELECT total_ct, array_agg(app_id) AS apps
FROM (
SELECT total_ct, app_id
, rank() OVER (ORDER BY total_ct DESC) AS rnk
FROM cte c
WHERE c.earliest_review >= d.review_window_start
) sub
WHERE rnk = 1
GROUP BY 1
) a ON true;

db<>fiddle here

Same as above, but without WITH TIES.

We don't need to involve the table apps at all. The table reviews has all information we need.

The CTE cte computes earliest review & current total count per app. The CTE avoids repeated computation. Should help quite a bit.

It is always materialized before Postgres 12, and should be materialized automatically in Postgres 12 since it is used many times in the main query. Else you could add the keyword MATERIALIZED in Postgres 12 or later to force it. See:

  • How to force evaluation of subquery before joining / pushing down to foreign server

The optimized generate_series() call produces the series of days from earliest to latest review. See:

  • Generating time series between two dates in PostgreSQL
  • Join a count query on generate_series() and retrieve Null values as '0'

Finally, the LEFT JOIN LATERAL you already discovered. But since multiple apps can tie for the most reviews, retrieve all winners, which can be 0 - n apps. The query aggregates all daily winners into an array, so we get a single result row per review_window_start. Alternatively, define tiebreaker(s) to get at most one winner. See:

  • What is the difference between LATERAL JOIN and a subquery in PostgreSQL?

WHERE clause better execute before IN and JOIN or after

It doesn't matter

Logical processing order is always honoured: regardless of actual processing order

INNER JOINs and WHERE conditions are effectively associative and commutative (hence the ANSI-89 "join in the where" syntax) so actual order doesn't matter

Logical order becomes important with outer joins and more complex queries: applying WHERE on an OUTER table changes the logic completely.

Again, it doesn't matter how the optimiser does it internally so long as the query semantics are maintained by following logical processing order.

And the key word here is "optimiser": it does exactly what it says

Function to return dynamic set of columns for given table

Solution for the simple case

As explained in the referenced answers below, you can use registered (row) types, and thus implicitly declare the return type of a polymorphic function:

CREATE OR REPLACE FUNCTION public.get_table(_tbl_type anyelement)
RETURNS SETOF anyelement AS
$func$
BEGIN
RETURN QUERY EXECUTE format('TABLE %s', pg_typeof(_tbl_type));
END
$func$ LANGUAGE plpgsql;

Call:

SELECT * FROM public.get_table(NULL::public.users);  -- note the syntax!

Returns the complete table (with all user columns).

Wait! How?

Detailed explanation in this related answer, chapter
"Various complete table types":

  • Refactor a PL/pgSQL function to return the output of various SELECT queries

TABLE foo is just short for SELECT * FROM foo:

  • Is there a shortcut for SELECT * FROM?

2 steps for completely dynamic return type

But what you are trying to do is strictly impossible in a single SQL command.

I want to pass schema_name and table_name as parameters to function and get record list, according to column_visible field in
public.fields table.

There is no direct way to return an arbitrary selection of columns (return type not known at call time) from a function - or any SQL command. SQL demands to know number, names and types of resulting columns at call time. More in the 2nd chapter of this related answer:

  • How do I generate a pivoted CROSS JOIN where the resulting table definition is unknown?

There are various workarounds. You could wrap the result in one of the standard document types (json, jsonb, hstore, xml).

Or you generate the query with one function call and execute the result with the next:

CREATE OR REPLACE FUNCTION public.generate_get_table(_schema_name text, _table_name text)
RETURNS text AS
$func$
SELECT format('SELECT %s FROM %I.%I'
, string_agg(quote_ident(column_name), ', ')
, schema_name
, table_name)
FROM fields
WHERE column_visible
AND schema_name = _schema_name
AND table_name = _table_name
GROUP BY schema_name, table_name
ORDER BY schema_name, table_name;
$func$ LANGUAGE sql;

Call:

SELECT public.generate_get_table('public', 'users');

This create a query of the form:

SELECT usr_id, usr FROM public.users;

Execute it in the 2nd step. (You might want to add column numbers and order columns.)

Or append \gexec in psql to execute the return value immediately. See:

How to force evaluation of subquery before joining / pushing down to foreign server

Be sure to defend against SQL injection:

  • INSERT with dynamic table name in trigger function
  • Define table and column names as arguments in a plpgsql function?
Asides

varchar(100) does not make much sense for identifiers, which are limited to 63 characters in standard Postgres:

  • Maximum characters in labels (table names, columns etc)

If you understand how the object identifier type regclass works, you might replace schema and table name with a singe regclass column.

Postgres_FDW not pushing down WHERE criteria

Any function that is not IMMUTABLE will not be pushed down.

See function is_foreign_expr in contrib/postgres_fdw/deparse.c:

/*
* Returns true if given expr is safe to evaluate on the foreign server.
*/
bool
is_foreign_expr(PlannerInfo *root,
RelOptInfo *baserel,
Expr *expr)
{
...
/*
* An expression which includes any mutable functions can't be sent over
* because its result is not stable. For example, sending now() remote
* side could cause confusion from clock offsets. Future versions might
* be able to make this choice with more granularity. (We check this last
* because it requires a lot of expensive catalog lookups.)
*/
if (contain_mutable_functions((Node *) expr))
return false;

/* OK to evaluate on the remote server */
return true;
}


Related Topics



Leave a reply



Submit