Longest Path –
where individual nodes do not re-occur in the longest path
The aim of
finding the longest possible path in a graph is to find a path that can
cover the longest possible travelling distance in a graph, from a known
starting node to an unknown ending node. Inside the longest path,
individual nodes do not re-occur.
In my Longest Path coding,
I am using coding similar to shortest path algorithm. But longest path
programming will have additional coding to check do targeted destination
node re-occur inside current node’s longest path. The targeted destination
node will not again add inside current node’s longest path if found
re-occurred.
Like shortest path, the
sub-path of the longest path itself supposing are also the longest path.
When we are traveling
from nodes to nodes, we can be possibly circulating around closed loops.
The loops can be between nodes next to each other; the loops could be
numerous numbers of nodes that linked together in big circles. Shortest
path doesn’t have path travelling in closed loop problems; longest path
needs to cater coding to prevent the path from keep-on circulating in
closed loop, causing the total travelling distance to be infinity.
Shortest path algorithm
coding I wrote in year 2004, my personal efforts without referring to any
other people code.
Shortest path is very useful algorithm, used in various
area like traffic navigation, network routing. Longest path algorithm on
other hands maybe can be used in game engine. Where the character urged to
acquire as much loots (nodes) as possible.
However, there are some scenarios when coding cannot
derive the correct Longest Path from a graph. When the coding acquiring the
longest path, it will add into the nearby nodes’ ‘sum of distance’. Making
nearby nodes’ ‘sum of distance’ gradually bigger when farther away, and the
coding will always choose the largest ‘sum of distance’. The problems are
not that easily able to be resolved. In contrast to Shortest Path, while
farther nodes’ ‘sum of distance’ become bigger and bigger, algorithm will
always choose the smallest possible ‘sum of distance’.
The Longest Path coding is implementing near-to-far
method to derive the longest path. It will start deriving the nearby nodes
(from first node) ‘sum of longest distance’, and the derivations will be
spreading out to the farther away nodes. When the graph has many
interlinking, two-way routes among all the nodes, the coding might fail to
derive longest path appropriately. Also, when first node and the longest
path’s last node linkages are found to be nearby each other, coding might
not able to derive the correct longest path.
Code
1.
Create a queue ‘q’ to store nodes that we intend to
search for longest path. We don’t need to save all the nodes at once into
the queue at the start. Only add the first node into the queue to kick
start the searching process. Create a variable ‘Ptr’ to store currently program is at which node.
Queue<int> q = new Queue<int>();
q.Enqueue(firstNode);
int Ptr;
|
2.
Run the search while queue ‘q’ is not empty. Dequeue ‘q’ and record targeted node into ‘Ptr’.
while (q.Count != 0)
{
Ptr = q.Dequeue();
|
3.
Sort the current node’s destinations
by weight in descending order (‘myMap[Ptr].dict_Destinations_Weight.OrderByDescending(key
=> key.Value)’). Run the check from current
node to all its’ destination nodes from the destination with the biggest
weight. This step sometimes is reductant.
Calculate the sum of “current
node’s total travelling distance + from current node to destination node
travelling distance”. There should not be any path pointing back to
‘firstNode’ as this will cause the program to keep-on looping and freeze.
foreach (KeyValuePair<string, int> destinations in
myMap[Ptr].dict_Destinations_Weight.OrderByDescending(key =>
key.Value))
{
if (Int32.Parse(destinations.Key) !=
firstNode) //When this targeted destination node is not the first node
{
int sum = myMap[Ptr].longestDistance +
destinations.Value;
|
4.
Check if the targeted
destination node is being visited before by checking if ‘longestDistance’ equal to
‘0’. If targeted destination node is not been visited, set the longest path
into targeted destination node (‘prePath’
= current node). Push the target destination node into queue ‘q’ for the
program to run on the target destination node.
if
(myMap[Int32.Parse(destinations.Key)].longestDistance == 0)
{
//When this targeted destination node
is not been visited before
myMap[Int32.Parse(destinations.Key)].longestDistance = sum;
myMap[Int32.Parse(destinations.Key)].prevPath = Ptr;
if
(!q.Contains(Int32.Parse(destinations.Key)))
q.Enqueue(Int32.Parse(destinations.Key));
}
|
5.
If the target destination node
was visited previously (‘longestDistance’
having some value other than default ‘0’), check if calculated ‘sum’ is bigger than the destination
node’s recorded total travelling distance. If calculated ‘sum’ is bigger, enter the next
process; else if smaller, do nothing. Subsequently check if target
destination node exists inside current node’s longest path. If the target
destination node does not exist inside current node’s longest path, set the
current node’s longest path into targeted destination node (‘prePath’ = current node). Push the
target destination node into queue ‘q’ for the program to run on the target
destination node.
else
{
//Check if the new path's
total distance longer than the one recorded inside destination node
if
(myMap[Int32.Parse(destinations.Key)].longestDistance < sum)
{
//Check if this targeted destination node is re-occur inside
the current node's longest path
int
temp_Ptr = Ptr;
bool
targeted_destination_node_reoccur = false;
while
(myMap[temp_Ptr].prevPath != -1 &&
!targeted_destination_node_reoccur)
{
if
(myMap[temp_Ptr].prevPath == Int32.Parse(destinations.Key))
targeted_destination_node_reoccur = true;
else
temp_Ptr =
myMap[temp_Ptr].prevPath;
}
//If this targeted destination node is not re-occur inside the current
node longest path
if
(!targeted_destination_node_reoccur)
{
myMap[Int32.Parse(destinations.Key)].longestDistance = sum;
myMap[Int32.Parse(destinations.Key)].prevPath = Ptr;
if
(!q.Contains(Int32.Parse(destinations.Key)))
q.Enqueue(Int32.Parse(destinations.Key));
}
}
}
|
6.
After queue is empty, find which
node having the maximum total travelling distance, ‘longestDistance’. This node would be the longest path from
starting node.
int longestPath_Sum = 0,
farthestPath_Location = firstNode;
for (int i = 0; i < total_nodecount; i++)
{
if (longestPath_Sum <
myMap[i].longestDistance)
{
longestPath_Sum = myMap[i].longestDistance;
farthestPath_Location = i;
}
} |
C# Console App
Programming

Example
1 (‘1.txt’): Red color paths are the longest path

Example
2 (‘2.txt’): Red color path are the longest path
|
Read from ‘2.txt’
|
Descriptions
|
From Node ‘0’
/ row_1
|
1 1 2 2 3 3 4
4
|
To node ‘1’, weight=1
To node ‘2’, weight=2
To node ‘3’, weight=3
To node ‘4’, weight=4
|
From Node ‘1’
/ row_2
|
5 2 6 2
|
To node ‘5’, weight=2
To node ‘6’, weight=2
|
From Node ‘2’
/ row_3
|
1 1 6 1
|
To node ‘1’, weight=1
To node ‘6’, weight=1
|
From Node ‘3’
/ row_4
|
1 2
|
To node ‘1’, weight=2
|
From Node ‘4’
/ row_5
|
1 2 3 1
|
To node ‘1’, weight=2
To node ‘3’, weight=1
|
From Node ‘5’
/ row_6
|
1 3 6 4
|
To node ‘1’, weight=3
To node ‘6’, weight=4
|
From Node ‘6’
/ row_7
|
3 3
|
To node ‘3’, weight=3
|
Example
2 (‘2.txt’):
C# Console App Code
Version ‘1’ in c# Console App: LongestPath_v1.cs 1.txt 2.txt
Version ‘2’ in c# Console App: LongestPath_v2.cs 1.txt 2.txt
Longest Path –
where individual nodes do re-occur in the longest path
The Longest Path Algorithm, where some of the nodes
occur multiple times inside the longest path could be used in shotgun sequencing
to sequence an unknown genome and derive the ‘Whole Genome
Sequence’ (WGS).
When
the same nodes reoccur twice in the path, it forms a closed loop path.
Program can get caught inside the closed loop, keep-on circulating and
getting infinity travelling distance, creating program logic error.
Therefore, we can put these closed loop paths into nodes’ separate queues
‘q_prevPath_closedloop’ to avoid program freeze.
When
same nodes occur three times in the path, we have 2 closed loops. In such
case, we need to manually identify which closed loops should situated in
first position; which closed loops at the subsequent position; alternately
we can remove both closed loops out of the longest path; or only keeping
one out of two.
Looping
paths can reoccur multiple times along the route; they can be confusingly
intertwined with each other’s; they can be jumping back and forth inside
the longest path. However, these are my speculations. I need more times and
information to verify them.
These
looping nodes can be either included or excluded from the actual longest
path we are looking for. Moreover, the last node in the longest path might
be situated inside one of these closed loop paths.
Hence,
we need to manually verify all these closed loop paths. To determine
whether we ought to keep them inside the longest path.
The
Longest Path Algorithm with re-occurring nodes are derived from Longest
Path Algorithm without re-occurring nodes. All nodes will have an
additional new queue ‘q_prevPath_closedloop’ to record any occurrence of
closed loop paths to this node.

Example
3 (‘3.txt’): Red color path are the longest path. Orange color paths are
the closed loops.
C# Console App Code
Version ‘1’ in c# Console App: CShape_LongestPath_w_duplicateNodes.cs 3.txt
Version ‘2’ in c# Console App: CShape_LongestPath_w_duplicateNodes_v2.cs 3.txt
Shotgun Sequencing method to sequence unknown genome using Longest Path
algorithm (where individual nodes re-occur in the longest path)
In shotgun
sequencing, DNA is broken up randomly into numerous small segments,
which are sequenced using the chain termination method to obtain reads.
Multiple overlapping reads for the target DNA are obtained by performing
several rounds of this fragmentation and sequencing. Computer programs (I
think is the longest path algorithm) then use the overlapping ends of
different reads to assemble them into a continuous sequence. Reference from Wikipedia.

Assembly -
Screenshot from lecture note written by Dr. Ben Langmead and Jacob Pritt
from Johns Hopkins University

Image Courtesy of
National Human Genome Research Institute
The appropriate computer programs to be used in shotgun sequencing
I think will be the Longest Path Algorithm, where some nodes can occur
multiple times in the longest path. I not Genome Sequence expert, whatever
I comments here are from my personal knowledge and might not be accurate.
Some of the information here are reference from online learning course,
‘Algorithms
for DNA Sequencing’ provided by Johns Hopkins University via Coursera
website. In their lectures, they talked about shotgun sequencing,
overlap graph produced and eventually deriving full genome sequence from
the overlap graph. The overlap graph problem I think is the Longest Path
algorithm.

Short genome’s Overlap Graph by Dr. Ben Langmead and Jacob Pritt
from Johns Hopkins University
Nodes:
all 6-mers from “GTACGTACGAT”
Edges: only select
overlaps of length ≥4
Lecture notes by Dr. Ben Langmead and Jacob Pritt from Johns Hopkins
University on Coursera website
1.
asm__overlaps.pdf
2.
asm__overlap_graph.pdf
To produce sequence overlap graph, first we need to find out the
connections between all the nodes. The way of doing so is to set fix length
of sequence fragments as the nodes, we call it the ‘k-mer’. ‘k-mer’s refer to all the
possible subsequences (of length k) from a read obtained through DNA
Sequencing.
When sequencer machine fragments a large DNA molecule and acquire reads
from these DNA sequence fragments, the machine will save the reads result
into ‘FASTQ’ format
file. FASTQ format is a text-based format for storing both a biological
sequence (usually nucleotide sequence) and its corresponding quality
scores. Each reads in the FASTQ file are 100 characters in length, four
bases: G (guanine),
C (cytosine),
A (adenine)
and T (thymine).
We can set ‘k-mer’ length for
each node as 50 characters. Wherever there are occurrences of a k-mer
substring (node) next to another k-mer substring (node), we increase the
respective edge weight by one. Hence we created our 50-mer overlap graph,
then we use our algorithm to find the longest path / possible sequence.
By making ‘k-mer’ nodes’
substring length longer characters, we can reduce the occurrences of closed
loops, reduce total number of nodes, making the graph smaller and lesser
computing efforts. But on other hands, if there are sequencer machine
errors on individual reads, the longest path result we are getting might
not be accurate.
Biological genome data have duplicate repeated sequences / segments.
Therefore these duplicate sequences create problem for people trying to do whole genome
sequencing. Furthermore, there can be some sequencing errors from the
machine, natural variation in each individuals’ genome (reference from Dr.
Ben: approx__intro.pdf).
Hypothetically the more sequences data we collected and add into nodes’
weight, the more likely we can derive our correct whole genome sequence,
mitigating the impact of sequencer machine errors.
Will my longest path algorithm able to successfully produce the correct
genome sequence result from the un-sequenced ‘fastq’ files? This is an
answer I would like to find out too. On the internet, there are only
handfuls of ‘fastq’ (un-sequenced reads) files I can find; these ‘fastq’
files usually don’t have the corresponding ‘fasta’ (sequenced
genome) file, which would allow me to verify the correctness of my
algorithm. Furthermore these ‘fastq’ files inside the internet have very
big file size. I don’t know whether my computer is fast enough to handle
these big files. Currently I only able to find 1 set of ‘phi-X174
complete genome’ ‘fasta’ file and it’s partial ‘fastq’ file.
C# Console App Code
Source files
·
The ‘ERR266411_1.for_asm.fastq’
file attached here is a partial un-sequenced phi-X174 genome reads.
It only contains first 748 characters from the ‘phi-X174 complete
genome’.
·
‘phi-X174
complete genome.txt’ file
When starting node is known
This is an unlikely scenario but
if we know the starting node, it will save us a lot of computing efforts in
getting the output sequenced genome.
From our below example ‘fastq’
source file, we will only able to get partial phi-X174 genome. Because our
‘fastq’ source file does not contain all the reads from the phi-X174
genome. Therefore, the output sequenced genome we are getting here is only
the first 748 characters from ‘phi-X174 complete genome.txt’. But it allows
us to run on any machines and short program execution times because the
‘fastq’ file size is small.
The known 50-mer starting node
for the phi-X174 genome is:
‘GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTG’
Version ‘1’ in c# Console App: LongestPath_ShotgunOverlapGraph_knownStartingNode.cs ERR266411_1.for_asm.fastq
Version ‘2’ in c# Console App: LongestPath_ShotgunOverlapGraph_knownStartingNode_v2.cs ERR266411_1.for_asm.fastq
When starting node is unknown
This create hinders to our job.
Because in our longest path algorithm, we need to know our starting node.
We will have to tries-and-errors
by putting every first 50 characters in the fastq file text line as the
starting nodes. All the possible sequence results will be saved into text
file, ‘possible_full_sequence_results.txt’. We will have to decide manually
which is the correct sequence from all the possible sequence results.
Version ‘1’ in c# Console App: LongestPath_ShotgunOverlapGraph_unknownStartingNode.cs ERR266411_1.for_asm.fastq
Version ‘2’
in c# Console App: LongestPath_ShotgunOverlapGraph_unknownStartingNode_v2.cs ERR266411_1.for_asm.fastq
Version
‘3’
Below coding utilize
multithreading in searching the Longest Path from different Starting Node.
Computation time are reduced by number of threads available. When all the Longest
Paths with different starting node are derived, save the results into text
file “possible_full_sequence_results”.
While the coding deriving
Longest Path, it will also save Contig records. Dead end path &
disconnected bits of graph due to genome sequencing errors will be ignored.
These smaller or same weight
Contigs path are divergence paths from the longest path. Contigs path will
eventually join back to the longest path. Contig (contiguous sequences) can
be:
·
polyploidy - inherit two copies,
two versions of that chromosome, one from Paternal, one from Maternal.
Hence these divergence path’s weight should be similar to the longest path
·
edges that are transitively
inferable to longest path edges, implying each other
·
Bits of the genome in between
the repetitive bits.
·
Sequencing machine error –
number of such occurrence (path weight) should be noticeably smaller
Contigs are used to verify
assembly genome and can contains useful information such as polyploidy.
Longest Path might not be the actual assembly genome due to over-collapsing
repeat or contig/longest path edges inferable to each other.
c# Console App: LP_OverlapGraph_unknownStartingNode_multithreads_Contig_v3.cs ERR266411_1.for_asm.fastq
Version ‘4’ in c# Console App: LP_OverlapGraph_unknownStartingNode_multithreads_Contig_v4.cs ERR266411_1.for_asm.fastq
Version ‘5’
Human genome base pairs
can up to ‘3,088,286,401’, bigger than C# default integer range -2,147,483,648
to 2,147,483,647. That’s mean C# List, Dictionary, Array, Queue are having
size limit constrained by integer size.
Human
genome has 23 paired chromosomes, sum up together can up to ‘3,088,286,401’
base pairs. During genome sequencing, only single chromosome should be
sequenced each time. Hence will not hit C# List, Dictionary, Array, Queue
size limit.
c# Console App: LP_OverlapGraph_unknownStartingNode_multithreads_Contig_v5.cs ERR266411_1.for_asm.fastq
|