Simple Graph Search Algorithm in SQL (Postgresql)

How to traverse nodes efficiently in postgresql?

It is not that simple. Within a single query, all recursive paths are completely independent. So, each path does not know about what's going on on the sibling path. It does not know that a certain node was already visited by a sibling.

Because SQL queries don't support some kind of global variables, it is impossible to share such information between the recursion paths that way.

I'd recommend to write a function where you can use plsql syntax which solves the problem in a more "common programmatical" way.

Limiting CTE search based on FK for a Directed Acyclic Graph in Postgresql

It seems like it should be a straightforward as adding a WHERE clause to the recursive part of the CTE and to the first select clause, assuming you want to limit both searches:

WITH RECURSIVE graph(id, depth) AS (
    SELECT first.parent_id, 1
        FROM edges_table AS first
        LEFT OUTER JOIN edges_table AS second
        ON first.parent_id = second.child_id
            AND second.group_id = 15 --limit result for second node
    WHERE first.child_id = 10 -- the node id we start from
UNION
    SELECT DISTINCT parent_id, graph.depth + 1
        FROM graph
        INNER JOIN edges_table
        ON edges_table.child_id = graph.id
        WHERE edges_table.group_id = 15 --and all subsequent nodes
)
SELECT id FROM graph
GROUP BY id
ORDER BY MAX(depth) DESC, id ASC

Note that in the first expression we use the filter in the LEFT JOIN part rather than the WHERE clause so that it will return a row even if the first edge doesn't connect to a second edge within the designed edge group.

search recursively for dead-ends in topological network table

OK. Here's how I figured it out:

I decided that there was not a way, or least not an easy way to implement this in SQL alone. I ended up implementing Tarjan's Algorithm in PHP and SQL, creating a temporary nodes table which linked each node to a strongly connected subcomponent of the graph. Once that was done, I updated any segment that was touching a node which did not belong to the largest subcomponent, as 'dangling'. All edges therefor that started and ended at nodes belonging to the largest subcomponent belong to the main street network (not dangling).

Here's the code. Note that it can take a very long time to run on a large graph. It's also pretty hard on the working memory, but it worked for my purposes.

<?php
$username = '';
$password = '';
$database = '';

$edge_table = 'cincy_segments';
$v1 = 'target';
$v2 = 'source';

$dangling_boolean_field = 'dangling';
$edge_id_field = 'edge_id';

//global variables declared
$index = 0;
$component_index = 0;
$nodes = array();
$stack = array();

pg_connect("host=localhost dbname=$database user=$username password=$password");

// get vertices
echo "getting data from database\n";
$neighbors_query = pg_query("
WITH nodes AS (
    SELECT DISTINCT $v1 AS node FROM $edge_table
    UNION
    SELECT DISTINCT $v2 AS node FROM $edge_table
), 
edges AS (
SELECT 
    node,
    $edge_id_field AS edge
FROM nodes JOIN $edge_table
    ON node = $v1 OR node = $v2
)
SELECT
    node,
    array_agg(CASE WHEN node = $v2 THEN $v1 
    WHEN node = $v1 THEN $v2
    ELSE NULL
    END) AS neighbor    
FROM edges JOIN $edge_table ON 
    (node = $v2 AND edge = $edge_id_field) OR 
    (node = $v1 AND edge = $edge_id_field)
GROUP BY node");

// now make the results into php results
echo "putting the results in an array\n";
while($r = pg_fetch_object($neighbors_query)){ // for each node record
    $nodes[$r->node]['id'] = $r->node;
    $nodes[$r->node]['neighbors'] = explode(',',trim($r->neighbor,'{}'));
}

// create a temporary table to store results
pg_query("
    DROP TABLE IF EXISTS temp_nodes;
    CREATE TABLE temp_nodes (node integer, component integer);
");

// the big traversal
echo "traversing graph (this part takes a while)\n";
foreach($nodes as $id => $values){
    if(!isset($values['index'])){
        tarjan($id, 'no parent');
    }
}

// identify dangling edges
echo "identifying dangling edges\n";
pg_query("
    UPDATE $edge_table SET $dangling_boolean_field = FALSE; 
    WITH dcn AS ( -- DisConnected Nodes
        -- get nodes that are NOT in the primary component
        SELECT node FROM temp_nodes WHERE component != (
            -- select the number of the largest component
            SELECT component
            FROM temp_nodes 
            GROUP BY component 
            ORDER BY count(*) DESC
            LIMIT 1)
    ),
    edges AS (
        SELECT DISTINCT e.$edge_id_field AS disconnected_edge_id
        FROM 
            dcn JOIN $edge_table AS e ON dcn.node = e.$v1 OR dcn.node = e.$v2
    )
    UPDATE $edge_table SET $dangling_boolean_field = TRUE
    FROM edges WHERE $edge_id_field = disconnected_edge_id;
");

// clean up after ourselves
echo "cleaning up\n";
pg_query("DROP TABLE IF EXISTS temp_nodes;");
pg_query("VACUUM ANALYZE;");

 // the recursive function definition
//
function tarjan($id, $parent)
{
    global $nodes;
    global $index;
    global $component_index;
    global $stack;

    // mark and push
    $nodes[$id]['index'] = $index;
    $nodes[$id]['lowlink'] = $index;
    $index++;
    array_push($stack, $id);

    // go through neighbors
    foreach ($nodes[$id]['neighbors'] as $child_id) {
        if ( !isset($nodes[$child_id]['index']) ) { // if neighbor not yet visited
            // recurse
            tarjan($child_id, $id);
            // find lowpoint
            $nodes[$id]['lowlink'] = min(
                $nodes[$id]['lowlink'],
                $nodes[$child_id]['lowlink']
            );
        } else if ($child_id != $parent) { // if already visited and not parent
            // assess lowpoint
            $nodes[$id]['lowlink'] = min(
                $nodes[$id]['lowlink'],
                $nodes[$child_id]['index']
            );
        }
    }
    // was this a root node?
    if ($nodes[$id]['lowlink'] == $nodes[$id]['index']) {
        do {
            $w = array_pop($stack);
            $scc[] = $w;
        } while($id != $w);
        // record results in table
        pg_query("
            INSERT INTO temp_nodes (node, component)
            VALUES (".implode(','.$component_index.'),(',$scc).",$component_index) 
        ");
        $component_index++;
    }
    return NULL;
}

?>

Visiting a directed graph as if it were an undirected one, using a recursive query

Could work like this:

WITH RECURSIVE graph AS (
    SELECT parent
          ,child
          ,',' || parent::text || ',' || child::text || ',' AS path
          ,0 AS depth
    FROM   ownership
    WHERE  parent = 1

    UNION ALL
    SELECT o.parent
          ,o.child
          ,g.path || o.child || ','
          ,g.depth + 1
    FROM   graph g
    JOIN   ownership o ON o.parent = g.child
    WHERE  g.path !~~ ('%,' || o.parent::text || ',' || o.child::text || ',%')
    )
SELECT  *
FROM    graph

You mentioned performance, so I optimized in that direction.

Major points:

Traverse the graph only in the defined direction.
No need for a column cycle, make it an exclusion condition instead. One less step to go. That is also the direct answer to:

How can I do to stop cycles one step before the node that closes the
cycle?

Use a string to record the path. Smaller and faster than an array of rows. Still contains all necessary information. Might change with very big bigint numbers, though.
Check for cycles with the LIKE operator (~~), should be much faster.
If you don't expect more that 2147483647 rows over the course of time, use plain integer columns instead of bigint. Smaller and faster.
Be sure to have an index on parent. Index on child is irrelevant for my query. (Other than in your original where you traverse edges in both directions.)
For huge graphs I would switch to a plpgsql procedure, where you can maintain the path as temp table with one row per step and a matching index. A bit of an overhead, that will pay off with huge graphs, though.

Problems in your original query:

WHERE (g.parent = o.child or g.child = o.parent)

There is only one endpoint of your traversal at any point in the process. As you wlak the directed graph in both directions, the endpoint can be either parent or child - but not both of them. You have to save the endpoint of every step, and then:

WHERE g.child IN (o.parent, o.child)

The violation of the direction also makes your starting condition questionable:

WHERE parent = 1

Would have to be

WHERE 1 IN (parent, child)

And the two rows (1,2) and (2,1) are effectively duplicates this way ...

Additional solution after comment

Ignore direction
Still walk any edge only once per path.
Use ARRAY for path
Save original direction in path, not actual direction.

Note, that this way (2,1) and (1,2) are effective duplicates, but both can be used in the same path.

I introduce the column leaf which saves the actual endpoint of every step.

WITH RECURSIVE graph AS (
    SELECT CASE WHEN parent = 1 THEN child ELSE parent END AS leaf
          ,ARRAY[ROW(parent, child)] AS path
          ,0 AS depth
    FROM   ownership
    WHERE  1 in (child, parent)

    UNION ALL
    SELECT CASE WHEN o.parent = g.leaf THEN o.child ELSE o.parent END -- AS leaf
          ,path || ROW(o.parent, o.child) -- AS path
          ,depth + 1 -- AS depth
    FROM   graph g
    JOIN   ownership o ON g.leaf in (o.parent, o.child) 
    AND    ROW(o.parent, o.child) <> ALL(path)
    )
SELECT *
FROM   graph

Counting distinct undirected edges in a directed graph in SQL

select count(*) from (
  select to_there from edges where from_here = 1
  union
  select from_here from edges where to_there = 1
) as whatever

SQL: find the longest path

You can to this recursive CTEs, like:

with recursive params as (
  select 1 fromnode,
         7 tonode
),
paths as (
    select ARRAY[fromnode] pathnodes,
           fromnode lastnode,
           0 sumdistance
    from   params
  union all
    select     pathnodes || e.tonode,
               e.tonode,
               sumdistance + e.distance
    from       paths
    join       edges e on e.fromnode = lastnode
    cross join params p
    where      e.fromnode <> p.tonode
    and        e.tonode <> all(pathnodes)
)
select   pathnodes, sumdistance
from     paths, params
where    lastnode = tonode
order by sumdistance desc
limit    1

(Casts may be required here and there, depending on the types of your columns.)

http://rextester.com/VRFHK43986

However, this will always calculate every possible paths (which starts at params.fromnode; without circles) & will choose the longest of them, after that.

Edit: the solution above assumes that the graph is directed. If it's undirected, you can modify it, to make use of edges from tonode to fromnode too:

with recursive params as (
  select 7 fromnode,
         1 tonode
),
paths as (
    select ARRAY[fromnode] pathnodes,
           fromnode lastnode,
           0 sumdistance
    from   params
  union all
    select     pathnodes || e.nodeb,
               e.nodeb,
               sumdistance + e.distance
    from       paths
    cross join lateral (select fromnode nodea,
                               tonode   nodeb,
                               distance
                        from   edges
                        where  fromnode = lastnode
                        union
                        select tonode   nodea,
                               fromnode nodeb,
                               distance
                        from   edges
                        where  tonode = lastnode) e
    cross join params p
    where      e.nodea <> p.tonode
    and        e.nodeb <> all(pathnodes)
)
select   pathnodes, sumdistance
from     paths, params
where    lastnode = tonode
order by sumdistance desc
limit    1;

However, a unique index may be required on (LEAST(fromnode, tonode), GREATEST(fromnode, tonode)) to avoid duplication, where row1.fromnode = row2.tonode and row1.tonode = row2.fromnode. (The query does not handle such kind of duplication & will calculate with 2 separate paths in this case).

http://rextester.com/XEGAX60117