Connect to SQL Server: connecting nodes and calculating paths

I have an interesting problem where I need to calculate paths between
connecting nodes.
For example:-
Node 1 is connected to Node 2
Node 1 is connected to Node 4
Node 2 is connected to Node 3
Node 3 is connected to Node 4
Want I need to return if I search connections between these two nodes are:-
1-4
1-2-3-4
I don't however need to see results if the connection is more than 6 nodes
deep.
Can anyone help me achieve this?
or point me to resources (online or books) where I can learn how to do this.
I have good knowledge in using hierarchical data but cannot for the life of
me figure this one out.
http://www.rippo.co.uk/nodes.gif (to visually see what I am trying to
achieve)
Thanks
rippoA few ways you could do it...
You could just make a single query, joining on your table 6 times.
You could use a temporary table, inserting the next level of nodes each time
you run through a loop - this is handy for doing a breadth-first search
through a tree.
You could make a table with 6 fields, populating each as you go through a
loop.|||In principle it isn't difficult to do a full enumeration of 6-edged paths,
assuming you model the network like this:
CREATE TABLE Paths (from_node INTEGER NOT NULL, to_node INTEGER NOT NULL,
PRIMARY KEY (from_node,to_node))
INSERT INTO Paths (from_node, to_node)
SELECT 1,2 UNION ALL
SELECT 2,1 UNION ALL
SELECT 2,3 UNION ALL
SELECT 2,9 UNION ALL
SELECT 2,6 UNION ALL
SELECT 3,2 UNION ALL
SELECT 3,4 UNION ALL
SELECT 3,5 UNION ALL
SELECT 4,3 UNION ALL
SELECT 4,5 UNION ALL
SELECT 5,3 UNION ALL
SELECT 5,4 UNION ALL
SELECT 5,6 UNION ALL
SELECT 6,2 UNION ALL
SELECT 6,5 UNION ALL
SELECT 6,7 UNION ALL
SELECT 7,6 UNION ALL
SELECT 7,8 UNION ALL
SELECT 8,7 UNION ALL
SELECT 8,9 UNION ALL
SELECT 9,8 UNION ALL
SELECT 9,2
DECLARE @.from_node INTEGER, @.to_node INTEGER
SET @.from_node = 1
SET @.to_node = 6
SELECT
P1.from_node, P1.to_node, P2.to_node,
P3.to_node, P4.to_node, P5.to_node
FROM Paths AS P1
LEFT JOIN Paths AS P2
ON P1.to_node = P2.from_node
AND P1.to_node <> @.to_node
LEFT JOIN Paths AS P3
ON P2.to_node = P3.from_node
AND P2.to_node <> @.to_node
LEFT JOIN Paths AS P4
ON P3.to_node = P4.from_node
AND P3.to_node <> @.to_node
LEFT JOIN Paths AS P5
ON P4.to_node = P5.from_node
AND P4.to_node <> @.to_node
WHERE P1.from_node = @.from_node
AND COALESCE(P5.to_node, P4.to_node, P3.to_node,
P2.to_node, P1.to_node) = @.to_node
I'll leave it as an execise for you to figure out how to limit this to a
single visit per node, if that's what you are actually s

ing.
David Portas
SQL Server MVP
--|||David,
This assumes that he can't move on once he gets to the target.
He might need to say:
@.to_node in (P5.to_node, P4.to_node, ...
Using something similar to this in the join conditions to stop searching
down a path that has already visited @.to_node. (In a similar way to the
solution to your exercise for avoiding hitting the same node twice)
RobF|||Thank you for your help so far. I do wish to limit this to a single
visit to a node and also wish to order the results by shortest route
first!
I will have a crack at this myself, however any hints would be useful
:)
rippo|||Assuming you define "length" in terms of the number of edges and assuming yo
u
don't have any nodes of negative numbers. Try this:
SELECT node0, node1, node2, node3, node4, node5, edges
FROM
(SELECT
P1.from_node, P1.to_node, P2.to_node,
P3.to_node, P4.to_node, P5.to_node,
SIGN(COALESCE(P1.to_node,0))+SIGN(COALESCE(P2.to_node,0))+
SIGN(COALESCE(P3.to_node,0))+SIGN(COALESCE(P4.to_node,0))+
SIGN(COALESCE(P5.to_node,0)) AS edges
FROM Paths AS P1
LEFT JOIN Paths AS P2
ON P1.to_node = P2.from_node
AND P1.to_node <> @.to_node
LEFT JOIN Paths AS P3
ON P2.to_node = P3.from_node
AND P2.to_node <> @.to_node
LEFT JOIN Paths AS P4
ON P3.to_node = P4.from_node
AND P3.to_node <> @.to_node
LEFT JOIN Paths AS P5
ON P4.to_node = P5.from_node
AND P4.to_node <> @.to_node
WHERE P1.from_node = @.from_node
AND COALESCE(P5.to_node, P4.to_node, P3.to_node,
P2.to_node, P1.to_node) = @.to_node
AND NOT EXISTS
(SELECT NULL
FROM
(SELECT P1.from_node AS n UNION ALL
SELECT P1.to_node UNION ALL
SELECT P2.to_node UNION ALL
SELECT P3.to_node UNION ALL
SELECT P4.to_node UNION ALL
SELECT P5.to_node) AS T
GROUP BY n
HAVING COUNT(n)>1)
) AS X(node0, node1, node2, node3, node4, node5, edges)
ORDER BY edges
David Portas
SQL Server MVP
--|||>From the next edition of SQL FOR SMARTIES:
30.02.04. Listing the Paths
I got data for this table from the book Introduction to Algorithms by
Cormen, Leiserson and Rivest (ISBN 0-262-03141-8), page 518. This book
was very popular in college courses in the United States. I made one
decision that will be important later; I added self-traversal edges
(i.e., the node is both the out_node and the in_node of an edge) with
weights of zero.
INSERT INTO Edges VALUES ('s', 's', 0);
INSERT INTO Edges VALUES ('s', 'u', 3);
INSERT INTO Edges VALUES ('s', 'x', 5);
INSERT INTO Edges VALUES ('u', 'u', 0);
INSERT INTO Edges VALUES ('u', 'v', 6);
INSERT INTO Edges VALUES ('u', 'x', 2);
INSERT INTO Edges VALUES ('v', 'v', 0);
INSERT INTO Edges VALUES ('v', 'y', 2);
INSERT INTO Edges VALUES ('x', 'u', 1);
INSERT INTO Edges VALUES ('x', 'v', 4);
INSERT INTO Edges VALUES ('x', 'x', 0);
INSERT INTO Edges VALUES ('x', 'y', 6);
INSERT INTO Edges VALUES ('y', 's', 3);
INSERT INTO Edges VALUES ('y', 'v', 7);
INSERT INTO Edges VALUES ('y', 'y', 0);
I am not happy about this approach, because I have to decide the
maximum number of edges in path before I start looking for an answer.
But this will work and I know that a path will have no more than the
total number of nodes in the graph. Let's create a table to hold the
paths:
CREATE TABLE Paths
(step1 CHAR(2) NOT NULL,
step2 CHAR(2) NOT NULL,
step3 CHAR(2) NOT NULL,
step4 CHAR(2) NOT NULL,
step5 CHAR(2) NOT NULL,
total_cost INTEGER NOT NULL,
path_length INTEGER NOT NULL,
PRIMARY KEY (step1, step2, step3, step4, step5));
The "step1" node is where I begin the path. The other columns are the
second step, third step, fourth step, and so forth. The last step
column is the end of the journey. The "total_cost" column is the total
cost, based on the sum of the weights of the edges, on this path. The
path length column is harder to explain, but for now, let's just say
that it is a count of the nodes visited in the path.
To keep things easier, let's look at all the paths from 's' to 'y' in
the graph. The INSERT INTO statement for constructing that set looks
likes this:
INSERT INTO Paths
SELECT G1.out_node, -- it is 's' in this example
G2.out_node,
G3.out_node, G4.out_node,
G4.in_node, -- it is 'y' in this example
(G1.cost + G2.cost + G3.cost + G4.cost),
(CASE WHEN G1.out_node NOT IN (G2.out_node, G3.out_node,
G4.out_node) THEN 1 ELSE 0 END
+ CASE WHEN G2.out_node NOT IN (G1.out_node, G3.out_node,
G4.out_node) THEN 1 ELSE 0 END
+ CASE WHEN G3.out_node NOT IN (G1.out_node, G2.out_node,
G4.out_node) THEN 1 ELSE 0 END
+ CASE WHEN G4.out_node NOT IN (G1.out_node, G2.out_node,
G3.out_node) THEN 1 ELSE 0 END)
FROM Edges AS G1,
Edges AS G2,
Edges AS G3,
Edges AS G4
WHERE G1.out_node = 's'
AND G1.in_node = G2.out_node
AND G2.in_node = G3.out_node
AND G3.in_node = G4.out_node
AND G4.in_node = 'y';
I put in 's' and 'y' as the out_node and in_node of the path, and made
sure that the in_node of each step in the path was the out_node of the
next step in the path. This is a combinatorial explosion, but it is
easy to read and understand.
The sum of the weights is the cost of the path, which is easy to
understand. The path_length calculation is a bit harder. This sum of
CASE expressions looks at each node in the path. If it is unique
within the row, it is assigned a value of one, if it is not unique
within the row, it is assigned a value of zero.
All paths will have five steps in them because that is the way to table
is declared. But what if a path exists between the two nodes which is
shorter than five steps? That is where the self-traversal rows are
used! Consecutive pairs of steps in the same row can be repetitions of
the same node.
Here is what the rows of the Paths table look like after this INSERT
INTO statement, ordered by descending path_length, and then by
ascending cost.
Paths step1 step2 step3 step4 step5 total_cost path_length
========================================
==============
s s x x y 11 0
s s s x y 11 1
s x x x y 11 1
s x u x y 14 2
s s u v y 11 2
s s u x y 11 2
s s x v y 11 2
s s x y y 11 2
s u u v y 11 2
s u u x y 11 2
s u v v y 11 2
s u x x y 11 2
s x v v y 11 2
s x x v y 11 2
s x x y y 11 2
s x y y y 11 2
s x y v y 20 4
s x u v y 14 4
s u v y y 11 4
s u x v y 11 4
s u x y y 11 4
s x v y y 11 4
Clearly, all pairs of nodes could be picked from the original Edges
table and the same INSERT INTO run on them with a minor change in the
WHERE clause. However, this example is big enough for a short magazine
article. And it is too big for most applications. It is safe to
assume that people really want the cheapest path. In this example, the
total_cost column defines the cost of an path, so we can eliminate some
of the paths from the Paths table with this statement.
DELETE FROM Paths
WHERE total_cost
> (SELECT MIN(total_cost)
FROM Paths);
Again, if you had all the paths for all possible pairs of nodes, the
subquery expression would have a WHERE clause to correlate it to the
subset of paths for each possible pair.
In this example, it got rid of 3 out of 22 possible paths. It is
helpful and in some situations we might like having all the options.
But these are not distinct options.
As one of many examples, the paths
(s, x, v, v, y, 11, 2)
and
(s, x, x, v, y, 11, 2)
are both really the same path, (s, x, v, y). Before we decide to write
a statement to handle these equivalent rows, let's consider another
cost factor. People do not like to change airplanes or trains. If
they can go from Amsterdam to New York City on one plane without
changing planes for the same cost, they are happy. This is where that
path_length column comes in. It is a quick way to remove the paths
that have more edges than they need to get the job done.
DELETE FROM Paths
WHERE path_length
> (SELECT MIN(path_length)
FROM Paths);
In this case, that last DELETE FROM statement will reduce the table to
one row: (s, s, x, x, y, 11, 0) which reduces to (s, x, y). This
single remaining row is very convenient for my article, but if you look
at the table, you will see that there was also a subset of equivalent
rows that had higher path_length numbers.
(s, s, s, x, y, 11, 1)
(s, x, x, x, y, 11, 1)
(s, x, x, y, y, 11, 2)
(s, x, y, y, y, 11, 2)
Your task is to write code to handle equivalent rows. Hint: the
duplicate nodes will always be contiguous across the row.|||Joe,
When is the next edition going to come out?
jp
--CELKO-- wrote:
> 30.02.04. Listing the Paths
> I got data for this table from the book Introduction to Algorithms by
> Cormen, Leiserson and Rivest (ISBN 0-262-03141-8), page 518. This book
> was very popular in college courses in the United States. I made one
> decision that will be important later; I added self-traversal edges
> (i.e., the node is both the out_node and the in_node of an edge) with
> weights of zero.
> INSERT INTO Edges VALUES ('s', 's', 0);
> INSERT INTO Edges VALUES ('s', 'u', 3);
> INSERT INTO Edges VALUES ('s', 'x', 5);
> INSERT INTO Edges VALUES ('u', 'u', 0);
> INSERT INTO Edges VALUES ('u', 'v', 6);
> INSERT INTO Edges VALUES ('u', 'x', 2);
> INSERT INTO Edges VALUES ('v', 'v', 0);
> INSERT INTO Edges VALUES ('v', 'y', 2);
> INSERT INTO Edges VALUES ('x', 'u', 1);
> INSERT INTO Edges VALUES ('x', 'v', 4);
> INSERT INTO Edges VALUES ('x', 'x', 0);
> INSERT INTO Edges VALUES ('x', 'y', 6);
> INSERT INTO Edges VALUES ('y', 's', 3);
> INSERT INTO Edges VALUES ('y', 'v', 7);
> INSERT INTO Edges VALUES ('y', 'y', 0);
> I am not happy about this approach, because I have to decide the
> maximum number of edges in path before I start looking for an answer.
> But this will work and I know that a path will have no more than the
> total number of nodes in the graph. Let's create a table to hold the
> paths:
> CREATE TABLE Paths
> (step1 CHAR(2) NOT NULL,
> step2 CHAR(2) NOT NULL,
> step3 CHAR(2) NOT NULL,
> step4 CHAR(2) NOT NULL,
> step5 CHAR(2) NOT NULL,
> total_cost INTEGER NOT NULL,
> path_length INTEGER NOT NULL,
> PRIMARY KEY (step1, step2, step3, step4, step5));
> The "step1" node is where I begin the path. The other columns are the
> second step, third step, fourth step, and so forth. The last step
> column is the end of the journey. The "total_cost" column is the total
> cost, based on the sum of the weights of the edges, on this path. The
> path length column is harder to explain, but for now, let's just say
> that it is a count of the nodes visited in the path.
> To keep things easier, let's look at all the paths from 's' to 'y' in
> the graph. The INSERT INTO statement for constructing that set looks
> likes this:
> INSERT INTO Paths
> SELECT G1.out_node, -- it is 's' in this example
> G2.out_node,
> G3.out_node, G4.out_node,
> G4.in_node, -- it is 'y' in this example
> (G1.cost + G2.cost + G3.cost + G4.cost),
> (CASE WHEN G1.out_node NOT IN (G2.out_node, G3.out_node,
> G4.out_node) THEN 1 ELSE 0 END
> + CASE WHEN G2.out_node NOT IN (G1.out_node, G3.out_node,
> G4.out_node) THEN 1 ELSE 0 END
> + CASE WHEN G3.out_node NOT IN (G1.out_node, G2.out_node,
> G4.out_node) THEN 1 ELSE 0 END
> + CASE WHEN G4.out_node NOT IN (G1.out_node, G2.out_node,
> G3.out_node) THEN 1 ELSE 0 END)
> FROM Edges AS G1,
> Edges AS G2,
> Edges AS G3,
> Edges AS G4
> WHERE G1.out_node = 's'
> AND G1.in_node = G2.out_node
> AND G2.in_node = G3.out_node
> AND G3.in_node = G4.out_node
> AND G4.in_node = 'y';
> I put in 's' and 'y' as the out_node and in_node of the path, and made
> sure that the in_node of each step in the path was the out_node of the
> next step in the path. This is a combinatorial explosion, but it is
> easy to read and understand.
> The sum of the weights is the cost of the path, which is easy to
> understand. The path_length calculation is a bit harder. This sum of
> CASE expressions looks at each node in the path. If it is unique
> within the row, it is assigned a value of one, if it is not unique
> within the row, it is assigned a value of zero.
> All paths will have five steps in them because that is the way to table
> is declared. But what if a path exists between the two nodes which is
> shorter than five steps? That is where the self-traversal rows are
> used! Consecutive pairs of steps in the same row can be repetitions of
> the same node.
> Here is what the rows of the Paths table look like after this INSERT
> INTO statement, ordered by descending path_length, and then by
> ascending cost.
> Paths step1 step2 step3 step4 step5 total_cost path_length
> ========================================
==============
> s s x x y 11 0
> s s s x y 11 1
> s x x x y 11 1
> s x u x y 14 2
> s s u v y 11 2
> s s u x y 11 2
> s s x v y 11 2
> s s x y y 11 2
> s u u v y 11 2
> s u u x y 11 2
> s u v v y 11 2
> s u x x y 11 2
> s x v v y 11 2
> s x x v y 11 2
> s x x y y 11 2
> s x y y y 11 2
> s x y v y 20 4
> s x u v y 14 4
> s u v y y 11 4
> s u x v y 11 4
> s u x y y 11 4
> s x v y y 11 4
> Clearly, all pairs of nodes could be picked from the original Edges
> table and the same INSERT INTO run on them with a minor change in the
> WHERE clause. However, this example is big enough for a short magazine
> article. And it is too big for most applications. It is safe to
> assume that people really want the cheapest path. In this example, the
> total_cost column defines the cost of an path, so we can eliminate some
> of the paths from the Paths table with this statement.
> DELETE FROM Paths
> WHERE total_cost
> FROM Paths);
> Again, if you had all the paths for all possible pairs of nodes, the
> subquery expression would have a WHERE clause to correlate it to the
> subset of paths for each possible pair.
> In this example, it got rid of 3 out of 22 possible paths. It is
> helpful and in some situations we might like having all the options.
> But these are not distinct options.
> As one of many examples, the paths
> (s, x, v, v, y, 11, 2)
> and
> (s, x, x, v, y, 11, 2)
> are both really the same path, (s, x, v, y). Before we decide to write
> a statement to handle these equivalent rows, let's consider another
> cost factor. People do not like to change airplanes or trains. If
> they can go from Amsterdam to New York City on one plane without
> changing planes for the same cost, they are happy. This is where that
> path_length column comes in. It is a quick way to remove the paths
> that have more edges than they need to get the job done.
> DELETE FROM Paths
> WHERE path_length
> FROM Paths);
> In this case, that last DELETE FROM statement will reduce the table to
> one row: (s, s, x, x, y, 11, 0) which reduces to (s, x, y). This
> single remaining row is very convenient for my article, but if you look
> at the table, you will see that there was also a subset of equivalent
> rows that had higher path_length numbers.
> (s, s, s, x, y, 11, 1)
> (s, x, x, x, y, 11, 1)
> (s, x, x, y, y, 11, 2)
> (s, x, y, y, y, 11, 2)
> Your task is to write code to handle equivalent rows. Hint: the
> duplicate nodes will always be contiguous across the row.
>|||Thanks Guys.This has been very helpful.
I will implement both solutions and come up with the best appraoch for
my problem.
I should probably have mentioned that the nodes are not closed and will
be added to constantly. I estimate that there will be getting close to
100,000 nodes with each node being linked up to around 5 nodes. Some
nodes however will be linked to serveral hundred.
Anyway thanks again
Rippo

Friday, February 24, 2012

connecting nodes and calculating paths

No comments:

Post a Comment

Connect to SQL Server

Blog Archive

About Me