Longest Path Algorithm in a Graph

 

 

Longest Path – where individual nodes do not re-occur in the longest path

The aim of finding the longest possible path in a graph is to find a path that can cover the longest possible travelling distance in a graph, from a known starting node to an unknown ending node. Inside the longest path, individual nodes do not re-occur.

In my Longest Path coding, I am using coding similar to shortest path algorithm. But longest path programming will have additional coding to check do targeted destination node re-occur inside current node’s longest path. The targeted destination node will not again add inside current node’s longest path if found re-occurred.

Like shortest path, the sub-path of the longest path itself supposing are also the longest path.

When we are traveling from nodes to nodes, we can be possibly circulating around closed loops. The loops can be between nodes next to each other; the loops could be numerous numbers of nodes that linked together in big circles. Shortest path doesn’t have path travelling in closed loop problems; longest path needs to cater coding to prevent the path from keep-on circulating in closed loop, causing the total travelling distance to be infinity.

Shortest path algorithm coding I wrote in year 2004, my personal efforts without referring to any other people code.

Shortest path is very useful algorithm, used in various area like traffic navigation, network routing. Longest path algorithm on other hands maybe can be used in game engine. Where the character urged to acquire as much loots (nodes) as possible.

However, there are some scenarios when coding cannot derive the correct Longest Path from a graph. When the coding acquiring the longest path, it will add into the nearby nodes’ ‘sum of distance’. Making nearby nodes’ ‘sum of distance’ gradually bigger when farther away, and the coding will always choose the largest ‘sum of distance’. The problems are not that easily able to be resolved. In contrast to Shortest Path, while farther nodes’ ‘sum of distance’ become bigger and bigger, algorithm will always choose the smallest possible ‘sum of distance’.

The Longest Path coding is implementing near-to-far method to derive the longest path. It will start deriving the nearby nodes (from first node) ‘sum of longest distance’, and the derivations will be spreading out to the farther away nodes. When the graph has many interlinking, two-way routes among all the nodes, the coding might fail to derive longest path appropriately. Also, when first node and the longest path’s last node linkages are found to be nearby each other, coding might not able to derive the correct longest path.

 


Code

 

1.     Create a queue ‘q’ to store nodes that we intend to search for longest path. We don’t need to save all the nodes at once into the queue at the start. Only add the first node into the queue to kick start the searching process. Create a variable ‘Ptr’ to store currently program is at which node.

Queue<int> q = new Queue<int>();

q.Enqueue(firstNode);

 

int Ptr;

2.     Run the search while queue ‘q’ is not empty. Dequeue ‘q’ and record targeted node into ‘Ptr’.

while (q.Count != 0)

{

 

    Ptr = q.Dequeue();

3.     Sort the current node’s destinations by weight in descending order (‘myMap[Ptr].dict_Destinations_Weight.OrderByDescending(key => key.Value)’). Run the check from current node to all its’ destination nodes from the destination with the biggest weight. This step sometimes is reductant.

Calculate the sum of “current node’s total travelling distance + from current node to destination node travelling distance”. There should not be any path pointing back to ‘firstNode’ as this will cause the program to keep-on looping and freeze.

foreach (KeyValuePair<string, int> destinations in myMap[Ptr].dict_Destinations_Weight.OrderByDescending(key => key.Value))

{

   if (Int32.Parse(destinations.Key) != firstNode) //When this targeted destination node is not the first node

   {

      int sum = myMap[Ptr].longestDistance + destinations.Value;

4.     Check if the targeted destination node is being visited before by checking if ‘longestDistance’ equal to ‘0’. If targeted destination node is not been visited, set the longest path into targeted destination node (‘prePath’ = current node). Push the target destination node into queue ‘q’ for the program to run on the target destination node.

if (myMap[Int32.Parse(destinations.Key)].longestDistance == 0)

{

   //When this targeted destination node is not been visited before

   myMap[Int32.Parse(destinations.Key)].longestDistance = sum;

   myMap[Int32.Parse(destinations.Key)].prevPath = Ptr;

 

   if (!q.Contains(Int32.Parse(destinations.Key)))

      q.Enqueue(Int32.Parse(destinations.Key));

}

5.     If the target destination node was visited previously (‘longestDistance’ having some value other than default ‘0’), check if calculated ‘sum’ is bigger than the destination node’s recorded total travelling distance. If calculated ‘sum’ is bigger, enter the next process; else if smaller, do nothing. Subsequently check if target destination node exists inside current node’s longest path. If the target destination node does not exist inside current node’s longest path, set the current node’s longest path into targeted destination node (‘prePath’ = current node). Push the target destination node into queue ‘q’ for the program to run on the target destination node.

else

{

    //Check if the new path's total distance longer than the one recorded inside destination node

    if (myMap[Int32.Parse(destinations.Key)].longestDistance < sum)

    {

        //Check if this targeted destination node is re-occur inside the current node's longest path

        int temp_Ptr = Ptr;

        bool targeted_destination_node_reoccur = false;

        while (myMap[temp_Ptr].prevPath != -1 && !targeted_destination_node_reoccur)

        {

            if (myMap[temp_Ptr].prevPath == Int32.Parse(destinations.Key))

                targeted_destination_node_reoccur = true;

            else

                temp_Ptr = myMap[temp_Ptr].prevPath;

        }

 

        //If this targeted destination  node is not re-occur inside the current node longest path

        if (!targeted_destination_node_reoccur)

        {

            myMap[Int32.Parse(destinations.Key)].longestDistance = sum;

            myMap[Int32.Parse(destinations.Key)].prevPath = Ptr;

            if (!q.Contains(Int32.Parse(destinations.Key)))

                q.Enqueue(Int32.Parse(destinations.Key));

        }

    }

}

 

6.     After queue is empty, find which node having the maximum total travelling distance, ‘longestDistance’. This node would be the longest path from starting node.

int longestPath_Sum = 0, farthestPath_Location = firstNode;

for (int i = 0; i < total_nodecount; i++)

{

   if (longestPath_Sum < myMap[i].longestDistance)

   {

      longestPath_Sum = myMap[i].longestDistance;

      farthestPath_Location = i;

   }

}

 

 


C# Console App Programming

 

Example graph #1 of finding longest path

Example 1 (‘1.txt’): Red color paths are the longest path

Example graph #2 of finding longest path

Example 2 (‘2.txt’): Red color path are the longest path

 

Read from ‘2.txt’

Descriptions

From Node ‘0’ / row_1

1 1 2 2 3 3 4 4

To node ‘1’, weight=1

To node ‘2’, weight=2

To node ‘3’, weight=3

To node ‘4’, weight=4

From Node ‘1’ / row_2

5 2 6 2

To node ‘5’, weight=2

To node ‘6’, weight=2

From Node ‘2’ / row_3

1 1 6 1

To node ‘1’, weight=1

To node ‘6’, weight=1

From Node ‘3’ / row_4

1 2

To node ‘1’, weight=2

From Node ‘4’ / row_5

1 2 3 1

To node ‘1’, weight=2

To node ‘3’, weight=1

From Node ‘5’ / row_6

1 3 6 4

To node ‘1’, weight=3

To node ‘6’, weight=4

From Node ‘6’ / row_7

3 3

To node ‘3’, weight=3

Example 2 (‘2.txt’):

 

C# Console App Code

Version ‘1’ in c# Console App:   LongestPath_v1.cs   1.txt   2.txt

Version ‘2’ in c# Console App:   LongestPath_v2.cs   1.txt   2.txt

 

 

 


Longest Path – where individual nodes do re-occur in the longest path

The Longest Path Algorithm, where some of the nodes occur multiple times inside the longest path could be used in shotgun sequencing to sequence an unknown genome and derive the ‘Whole Genome Sequence’ (WGS).

When the same nodes reoccur twice in the path, it forms a closed loop path. Program can get caught inside the closed loop, keep-on circulating and getting infinity travelling distance, creating program logic error. Therefore, we can put these closed loop paths into nodes’ separate queues ‘q_prevPath_closedloop’ to avoid program freeze.

When same nodes occur three times in the path, we have 2 closed loops. In such case, we need to manually identify which closed loops should situated in first position; which closed loops at the subsequent position; alternately we can remove both closed loops out of the longest path; or only keeping one out of two.

Looping paths can reoccur multiple times along the route; they can be confusingly intertwined with each other’s; they can be jumping back and forth inside the longest path. However, these are my speculations. I need more times and information to verify them.

These looping nodes can be either included or excluded from the actual longest path we are looking for. Moreover, the last node in the longest path might be situated inside one of these closed loop paths.

Hence, we need to manually verify all these closed loop paths. To determine whether we ought to keep them inside the longest path.

The Longest Path Algorithm with re-occurring nodes are derived from Longest Path Algorithm without re-occurring nodes. All nodes will have an additional new queue ‘q_prevPath_closedloop’ to record any occurrence of closed loop paths to this node.

 

Graph_example_3.jpg for Example 3 (‘3.txt’): Red color path are the longest path

Example 3 (‘3.txt’): Red color path are the longest path. Orange color paths are the closed loops.

 

 

C# Console App Code

Version ‘1’ in c# Console App:   CShape_LongestPath_w_duplicateNodes.cs   3.txt

Version ‘2’ in c# Console App:   CShape_LongestPath_w_duplicateNodes_v2.cs   3.txt

 

 


Shotgun Sequencing method to sequence unknown genome using Longest Path algorithm (where individual nodes re-occur in the longest path)

In shotgun sequencing, DNA is broken up randomly into numerous small segments, which are sequenced using the chain termination method to obtain reads. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs (I think is the longest path algorithm) then use the overlapping ends of different reads to assemble them into a continuous sequence. Reference from Wikipedia.

 

Shotgun sequencing to assembling an genome

Assembly - Screenshot from lecture note written by Dr. Ben Langmead and Jacob Pritt from Johns Hopkins University

 

Image Courtesy of National Human Genome Research Institute

Image Courtesy of National Human Genome Research Institute

 

The appropriate computer programs to be used in shotgun sequencing I think will be the Longest Path Algorithm, where some nodes can occur multiple times in the longest path. I not Genome Sequence expert, whatever I comments here are from my personal knowledge and might not be accurate.

Some of the information here are reference from online learning course, ‘Algorithms for DNA Sequencing’ provided by Johns Hopkins University via Coursera website. In their lectures, they talked about shotgun sequencing, overlap graph produced and eventually deriving full genome sequence from the overlap graph. The overlap graph problem I think is the Longest Path algorithm.

 

A close up of a map

Short genome’s Overlap Graph by Dr. Ben Langmead and Jacob Pritt from Johns Hopkins University

Nodes: all 6-mers from “GTACGTACGAT

Edges: only select overlaps of length ≥4

 

Lecture notes by Dr. Ben Langmead and Jacob Pritt from Johns Hopkins University on Coursera website

1.     asm__overlaps.pdf

2.     asm__overlap_graph.pdf

 

To produce sequence overlap graph, first we need to find out the connections between all the nodes. The way of doing so is to set fix length of sequence fragments as the nodes, we call it the ‘k-mer’. ‘k-mer’s refer to all the possible subsequences (of length k) from a read obtained through DNA Sequencing.

When sequencer machine fragments a large DNA molecule and acquire reads from these DNA sequence fragments, the machine will save the reads result into ‘FASTQ’ format file. FASTQ format is a text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. Each reads in the FASTQ file are 100 characters in length, four bases: G (guanine), C (cytosine), A (adenine) and T (thymine).

We can set ‘k-mer’ length for each node as 50 characters. Wherever there are occurrences of a k-mer substring (node) next to another k-mer substring (node), we increase the respective edge weight by one. Hence we created our 50-mer overlap graph, then we use our algorithm to find the longest path / possible sequence.

By making ‘k-mer’ nodes’ substring length longer characters, we can reduce the occurrences of closed loops, reduce total number of nodes, making the graph smaller and lesser computing efforts. But on other hands, if there are sequencer machine errors on individual reads, the longest path result we are getting might not be accurate.

Biological genome data have duplicate repeated sequences / segments. Therefore these duplicate sequences create problem for people trying to do whole genome sequencing. Furthermore, there can be some sequencing errors from the machine, natural variation in each individuals’ genome (reference from Dr. Ben: approx__intro.pdf).

Hypothetically the more sequences data we collected and add into nodes’ weight, the more likely we can derive our correct whole genome sequence, mitigating the impact of sequencer machine errors.

Will my longest path algorithm able to successfully produce the correct genome sequence result from the un-sequenced ‘fastq’ files? This is an answer I would like to find out too. On the internet, there are only handfuls of ‘fastq’ (un-sequenced reads) files I can find; these ‘fastq’ files usually don’t have the corresponding ‘fasta’ (sequenced genome) file, which would allow me to verify the correctness of my algorithm. Furthermore these ‘fastq’ files inside the internet have very big file size. I don’t know whether my computer is fast enough to handle these big files. Currently I only able to find 1 set of ‘phi-X174 complete genome’ ‘fasta’ file and it’s partial ‘fastq’ file.

 

 

C# Console App Code

Source files

·       The ‘ERR266411_1.for_asm.fastq’ file attached here is a partial un-sequenced phi-X174 genome reads. It only contains first 748 characters from the ‘phi-X174 complete genome’.

·       phi-X174 complete genome.txt’ file

 

When starting node is known

This is an unlikely scenario but if we know the starting node, it will save us a lot of computing efforts in getting the output sequenced genome.

From our below example ‘fastq’ source file, we will only able to get partial phi-X174 genome. Because our ‘fastq’ source file does not contain all the reads from the phi-X174 genome. Therefore, the output sequenced genome we are getting here is only the first 748 characters from ‘phi-X174 complete genome.txt’. But it allows us to run on any machines and short program execution times because the ‘fastq’ file size is small.

The known 50-mer starting node for the phi-X174 genome is: ‘GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTG’

Version ‘1’ in c# Console App:   LongestPath_ShotgunOverlapGraph_knownStartingNode.cs   ERR266411_1.for_asm.fastq

Version ‘2’ in c# Console App:   LongestPath_ShotgunOverlapGraph_knownStartingNode_v2.cs   ERR266411_1.for_asm.fastq

 

When starting node is unknown

This create hinders to our job. Because in our longest path algorithm, we need to know our starting node.

We will have to tries-and-errors by putting every first 50 characters in the fastq file text line as the starting nodes. All the possible sequence results will be saved into text file, ‘possible_full_sequence_results.txt’. We will have to decide manually which is the correct sequence from all the possible sequence results.

Version ‘1’ in c# Console App:   LongestPath_ShotgunOverlapGraph_unknownStartingNode.cs   ERR266411_1.for_asm.fastq

 Version ‘2’ in c# Console App:   LongestPath_ShotgunOverlapGraph_unknownStartingNode_v2.cs   ERR266411_1.for_asm.fastq

 Version ‘3’

Below coding utilize multithreading in searching the Longest Path from different Starting Node. Computation time are reduced by number of threads available. When all the Longest Paths with different starting node are derived, save the results into text file “possible_full_sequence_results”.

While the coding deriving Longest Path, it will also save Contig records. Dead end path & disconnected bits of graph due to genome sequencing errors will be ignored.

These smaller or same weight Contigs path are divergence paths from the longest path. Contigs path will eventually join back to the longest path. Contig (contiguous sequences) can be:

·       polyploidy - inherit two copies, two versions of that chromosome, one from Paternal, one from Maternal. Hence these divergence path’s weight should be similar to the longest path

·       edges that are transitively inferable to longest path edges, implying each other

·       Bits of the genome in between the repetitive bits.

·       Sequencing machine error – number of such occurrence (path weight) should be noticeably smaller

Contigs are used to verify assembly genome and can contains useful information such as polyploidy. Longest Path might not be the actual assembly genome due to over-collapsing repeat or contig/longest path edges inferable to each other.

c# Console App:   LP_OverlapGraph_unknownStartingNode_multithreads_Contig_v3.cs   ERR266411_1.for_asm.fastq

Version ‘4’ in c# Console App:   LP_OverlapGraph_unknownStartingNode_multithreads_Contig_v4.cs   ERR266411_1.for_asm.fastq

Version ‘5’

Human genome base pairs can up to ‘3,088,286,401’, bigger than C# default integer range -2,147,483,648 to 2,147,483,647. That’s mean C# List, Dictionary, Array, Queue are having size limit constrained by integer size.

Human genome has 23 paired chromosomes, sum up together can up to ‘3,088,286,401’ base pairs. During genome sequencing, only single chromosome should be sequenced each time. Hence will not hit C# List, Dictionary, Array, Queue size limit.

c# Console App:   LP_OverlapGraph_unknownStartingNode_multithreads_Contig_v5.cs   ERR266411_1.for_asm.fastq

 

 

 

 

 

 

Edit date: 2 June 19