Simple Graph Search Algorithm in SQL (Postgresql)

How to traverse nodes efficiently in postgresql?

It is not that simple. Within a single query, all recursive paths are completely independent. So, each path does not know about what's going on on the sibling path. It does not know that a certain node was already visited by a sibling.

Because SQL queries don't support some kind of global variables, it is impossible to share such information between the recursion paths that way.

I'd recommend to write a function where you can use plsql syntax which solves the problem in a more "common programmatical" way.

Limiting CTE search based on FK for a Directed Acyclic Graph in Postgresql

It seems like it should be a straightforward as adding a WHERE clause to the recursive part of the CTE and to the first select clause, assuming you want to limit both searches:

WITH RECURSIVE graph(id, depth) AS (
SELECT first.parent_id, 1
FROM edges_table AS first
LEFT OUTER JOIN edges_table AS second
ON first.parent_id = second.child_id
AND second.group_id = 15 --limit result for second node
WHERE first.child_id = 10 -- the node id we start from
UNION
SELECT DISTINCT parent_id, graph.depth + 1
FROM graph
INNER JOIN edges_table
ON edges_table.child_id = graph.id
WHERE edges_table.group_id = 15 --and all subsequent nodes
)
SELECT id FROM graph
GROUP BY id
ORDER BY MAX(depth) DESC, id ASC

Note that in the first expression we use the filter in the LEFT JOIN part rather than the WHERE clause so that it will return a row even if the first edge doesn't connect to a second edge within the designed edge group.

search recursively for dead-ends in topological network table

OK. Here's how I figured it out:

I decided that there was not a way, or least not an easy way to implement this in SQL alone. I ended up implementing Tarjan's Algorithm in PHP and SQL, creating a temporary nodes table which linked each node to a strongly connected subcomponent of the graph. Once that was done, I updated any segment that was touching a node which did not belong to the largest subcomponent, as 'dangling'. All edges therefor that started and ended at nodes belonging to the largest subcomponent belong to the main street network (not dangling).

Here's the code. Note that it can take a very long time to run on a large graph. It's also pretty hard on the working memory, but it worked for my purposes.

<?php
$username = '';
$password = '';
$database = '';

$edge_table = 'cincy_segments';
$v1 = 'target';
$v2 = 'source';

$dangling_boolean_field = 'dangling';
$edge_id_field = 'edge_id';

//global variables declared
$index = 0;
$component_index = 0;
$nodes = array();
$stack = array();

pg_connect("host=localhost dbname=$database user=$username password=$password");

// get vertices
echo "getting data from database\n";
$neighbors_query = pg_query("
WITH nodes AS (
SELECT DISTINCT $v1 AS node FROM $edge_table
UNION
SELECT DISTINCT $v2 AS node FROM $edge_table
),
edges AS (
SELECT
node,
$edge_id_field AS edge
FROM nodes JOIN $edge_table
ON node = $v1 OR node = $v2
)
SELECT
node,
array_agg(CASE WHEN node = $v2 THEN $v1
WHEN node = $v1 THEN $v2
ELSE NULL
END) AS neighbor
FROM edges JOIN $edge_table ON
(node = $v2 AND edge = $edge_id_field) OR
(node = $v1 AND edge = $edge_id_field)
GROUP BY node");

// now make the results into php results
echo "putting the results in an array\n";
while($r = pg_fetch_object($neighbors_query)){ // for each node record
$nodes[$r->node]['id'] = $r->node;
$nodes[$r->node]['neighbors'] = explode(',',trim($r->neighbor,'{}'));
}

// create a temporary table to store results
pg_query("
DROP TABLE IF EXISTS temp_nodes;
CREATE TABLE temp_nodes (node integer, component integer);
");

// the big traversal
echo "traversing graph (this part takes a while)\n";
foreach($nodes as $id => $values){
if(!isset($values['index'])){
tarjan($id, 'no parent');
}
}

// identify dangling edges
echo "identifying dangling edges\n";
pg_query("
UPDATE $edge_table SET $dangling_boolean_field = FALSE;
WITH dcn AS ( -- DisConnected Nodes
-- get nodes that are NOT in the primary component
SELECT node FROM temp_nodes WHERE component != (
-- select the number of the largest component
SELECT component
FROM temp_nodes
GROUP BY component
ORDER BY count(*) DESC
LIMIT 1)
),
edges AS (
SELECT DISTINCT e.$edge_id_field AS disconnected_edge_id
FROM
dcn JOIN $edge_table AS e ON dcn.node = e.$v1 OR dcn.node = e.$v2
)
UPDATE $edge_table SET $dangling_boolean_field = TRUE
FROM edges WHERE $edge_id_field = disconnected_edge_id;
");

// clean up after ourselves
echo "cleaning up\n";
pg_query("DROP TABLE IF EXISTS temp_nodes;");
pg_query("VACUUM ANALYZE;");

// the recursive function definition
//
function tarjan($id, $parent)
{
global $nodes;
global $index;
global $component_index;
global $stack;

// mark and push
$nodes[$id]['index'] = $index;
$nodes[$id]['lowlink'] = $index;
$index++;
array_push($stack, $id);

// go through neighbors
foreach ($nodes[$id]['neighbors'] as $child_id) {
if ( !isset($nodes[$child_id]['index']) ) { // if neighbor not yet visited
// recurse
tarjan($child_id, $id);
// find lowpoint
$nodes[$id]['lowlink'] = min(
$nodes[$id]['lowlink'],
$nodes[$child_id]['lowlink']
);
} else if ($child_id != $parent) { // if already visited and not parent
// assess lowpoint
$nodes[$id]['lowlink'] = min(
$nodes[$id]['lowlink'],
$nodes[$child_id]['index']
);
}
}
// was this a root node?
if ($nodes[$id]['lowlink'] == $nodes[$id]['index']) {
do {
$w = array_pop($stack);
$scc[] = $w;
} while($id != $w);
// record results in table
pg_query("
INSERT INTO temp_nodes (node, component)
VALUES (".implode(','.$component_index.'),(',$scc).",$component_index)
");
$component_index++;
}
return NULL;
}

?>

Visiting a directed graph as if it were an undirected one, using a recursive query

Could work like this:

WITH RECURSIVE graph AS (
SELECT parent
,child
,',' || parent::text || ',' || child::text || ',' AS path
,0 AS depth
FROM ownership
WHERE parent = 1

UNION ALL
SELECT o.parent
,o.child
,g.path || o.child || ','
,g.depth + 1
FROM graph g
JOIN ownership o ON o.parent = g.child
WHERE g.path !~~ ('%,' || o.parent::text || ',' || o.child::text || ',%')
)
SELECT *
FROM graph

You mentioned performance, so I optimized in that direction.

Major points:

  • Traverse the graph only in the defined direction.

  • No need for a column cycle, make it an exclusion condition instead. One less step to go. That is also the direct answer to:

How can I do to stop cycles one step before the node that closes the
cycle?

  • Use a string to record the path. Smaller and faster than an array of rows. Still contains all necessary information. Might change with very big bigint numbers, though.

  • Check for cycles with the LIKE operator (~~), should be much faster.

  • If you don't expect more that 2147483647 rows over the course of time, use plain integer columns instead of bigint. Smaller and faster.

  • Be sure to have an index on parent. Index on child is irrelevant for my query. (Other than in your original where you traverse edges in both directions.)

  • For huge graphs I would switch to a plpgsql procedure, where you can maintain the path as temp table with one row per step and a matching index. A bit of an overhead, that will pay off with huge graphs, though.


Problems in your original query:

WHERE (g.parent = o.child or g.child = o.parent) 

There is only one endpoint of your traversal at any point in the process. As you wlak the directed graph in both directions, the endpoint can be either parent or child - but not both of them. You have to save the endpoint of every step, and then:

WHERE g.child IN (o.parent, o.child) 

The violation of the direction also makes your starting condition questionable:

WHERE parent = 1

Would have to be

WHERE 1 IN (parent, child)

And the two rows (1,2) and (2,1) are effectively duplicates this way ...


Additional solution after comment

  • Ignore direction
  • Still walk any edge only once per path.
  • Use ARRAY for path
  • Save original direction in path, not actual direction.

Note, that this way (2,1) and (1,2) are effective duplicates, but both can be used in the same path.

I introduce the column leaf which saves the actual endpoint of every step.

WITH RECURSIVE graph AS (
SELECT CASE WHEN parent = 1 THEN child ELSE parent END AS leaf
,ARRAY[ROW(parent, child)] AS path
,0 AS depth
FROM ownership
WHERE 1 in (child, parent)

UNION ALL
SELECT CASE WHEN o.parent = g.leaf THEN o.child ELSE o.parent END -- AS leaf
,path || ROW(o.parent, o.child) -- AS path
,depth + 1 -- AS depth
FROM graph g
JOIN ownership o ON g.leaf in (o.parent, o.child)
AND ROW(o.parent, o.child) <> ALL(path)
)
SELECT *
FROM graph

Counting distinct undirected edges in a directed graph in SQL

select count(*) from (
select to_there from edges where from_here = 1
union
select from_here from edges where to_there = 1
) as whatever

SQL: find the longest path

You can to this recursive CTEs, like:

with recursive params as (
select 1 fromnode,
7 tonode
),
paths as (
select ARRAY[fromnode] pathnodes,
fromnode lastnode,
0 sumdistance
from params
union all
select pathnodes || e.tonode,
e.tonode,
sumdistance + e.distance
from paths
join edges e on e.fromnode = lastnode
cross join params p
where e.fromnode <> p.tonode
and e.tonode <> all(pathnodes)
)
select pathnodes, sumdistance
from paths, params
where lastnode = tonode
order by sumdistance desc
limit 1

(Casts may be required here and there, depending on the types of your columns.)

http://rextester.com/VRFHK43986

However, this will always calculate every possible paths (which starts at params.fromnode; without circles) & will choose the longest of them, after that.

Edit: the solution above assumes that the graph is directed. If it's undirected, you can modify it, to make use of edges from tonode to fromnode too:

with recursive params as (
select 7 fromnode,
1 tonode
),
paths as (
select ARRAY[fromnode] pathnodes,
fromnode lastnode,
0 sumdistance
from params
union all
select pathnodes || e.nodeb,
e.nodeb,
sumdistance + e.distance
from paths
cross join lateral (select fromnode nodea,
tonode nodeb,
distance
from edges
where fromnode = lastnode
union
select tonode nodea,
fromnode nodeb,
distance
from edges
where tonode = lastnode) e
cross join params p
where e.nodea <> p.tonode
and e.nodeb <> all(pathnodes)
)
select pathnodes, sumdistance
from paths, params
where lastnode = tonode
order by sumdistance desc
limit 1;

However, a unique index may be required on (LEAST(fromnode, tonode), GREATEST(fromnode, tonode)) to avoid duplication, where row1.fromnode = row2.tonode and row1.tonode = row2.fromnode. (The query does not handle such kind of duplication & will calculate with 2 separate paths in this case).

http://rextester.com/XEGAX60117



Related Topics



Leave a reply



Submit