Introduction
The
shortest superstring is given a numerous number of short strings, find the shortest
superstring that has all the short strings as substrings / subsequences.
The Shortest
Common Superstring (SCS) and Longest
Common Subsequence (LCS) technique only applicable to only two
short strings, overlapping both short strings as much as possible to derive
the shortest superstring. When there are 3 or more short strings, Shortest
Common Superstring and Longest
Common Subsequence (LCS) is not applicable.
Shortest Superstring like a riddle waiting
for answers.
Trial &
Error Shortest Superstring method
‘Trial and Error’ / ‘generate and test’
approach will obtain the correct superstring result.
Repeated, comprehensive attempts with all
the possible combinations to check and find the Shortest Superstring. Leave
no stones unturned, hence computing intensive. Take long times to generate
the result.
·
With N
numbers of short strings, create N + 1 number of queues
·
Enqueue
all the short strings into queue[0]
·
Dequeue the
first short string ‘SS1’ from queue[0]
·
Enqueue
the remaining short strings (other than first short string ‘SS1’) into
queue[1]
·
Dequeue
first queue[1] (short string ‘SS2’), combine short string ‘SS1’ & short
string ‘SS2’ with Shortest Common Superstring.
·
Again,
enqueue all the remaining short strings (other than short string ‘SS1’,
‘SS2’) into queue[2]. Repeat the process.
·
When no
more short strings left in queue[N], go one level up to queue[N-1], dequeue
the first queue[N-1] and repeat the process again
·
When a
complete Superstring is derived, check is this the shortest superstring? If
yes, save this superstring result into a list.
·
Repeat
until no more short strings left in queue[0]~queue[N].
Trial &
Error Shortest Superstring Method source files
ShortestSuperstring_1.cs test1.txt
Greedy Shortest
Common Superstring Method
To make the process faster some people are
proposing alternative methods, one of them is Greedy Shortest Common
Superstring (Greedy SCS) method.
Greedy SCS execute faster than Trial &
Error method, but does not produce correct Shortest Superstring results.
From the short strings List, Greedy SCS
program will find a pair of short strings inside the List that has the
Shortest Common Superstring (SCS) and Longest Common Subsequence (LCS).
Remove this pair of short strings from the list, combine them together into
one SCS and then add this SCS back to the list. Repeat the process of
finding the next SCS pair in the List. Until eventually only one string
(supposing the Shortest Superstring) left inside the List.
Greedy Shortest
Common Superstring Method source code
Single thread programming: Greedy
SCS_code_3.cs Greedy_Example_1.txt
Multi-threading programming: Greedy
SCS_code_4_multithread.txt Greedy_Example_1.txt
First to Match
Shortest Superstring method
This is an approach thought by me. Each of
the short string will take turns to become the first in the long string,
find one short string in the queue that has the Shortest Common Superstring
(SCS) with this first short string (SS). Append these pair together to
become the new long string. Repeat the process of find another short string
in the queue that has the Shortest Common Superstring with the current long
string, append them.
When the queue is empty, record the
result. Take another short string to be the first string in long string.
Repeat the process to get a complete long string. Compare which long string
has the shortest length, these would be the Shortest Superstring.
Similar to Greedy SCS method, this
approach does not always produce correct Shortest Superstring result.
First to Match Shortest
Superstring Method source files
When all the short strings are exact
length: ShortestCommonSuperstring_SameLength_9.cs
ShortString_1.txt
When each short strings’ length is
different, additional computation efforts are needed to identify them: ShortestCommonSuperstring_11_variableLength.cs 3.txt
Note: Please use ‘.NET Framework 4.6.2’ or
latest to compile as some commands may only available in the latest .NET
Framework.
Example:
Input: str1
= “cdef”, str2 = “fghi”,
str3 = “defg”, str4 = “abcd”
Output: “abcdefghi”
Input: str1
= “BABBAA”, str2 = “ABBAAA”, str3 = “BBAAB”
Shortest Common Superstring method result: “BABBAAABBAAB” 12 characters
Trial and Error method result:
“BABBAABBAAA” or “BBAABABBAAA”
11 characters
|
C# Console App
Code
1.
All the short
strings will be placed at the first position so that all combinations are
checked to find the Shortest Superstring.
foreach (string str_rawReadFile
in rawReadFile)
ShortStrings_TryoutList[0].Enqueue(str_rawReadFile);
|
Implementation of Shortest Superstring in Shotgun Sequencing method to
sequence unknown genome
DNA is
duplicated & fragmented randomly into multiple small overlapping reads
segment in shotgun
sequencing. Computer programs use the overlapping ends of different
reads to assemble them into a continuous sequence. Reference from Wikipedia.
We
may use Shortest Superstring method to assemble these overlapping ends of
different reads into a continuous sequence. But because Shortest
Superstring (SS) does not consider number of occurrence (path weight) of
individual short strings or k-mer, therefore ‘Shortest Superstring’ cannot
mitigate sequencer machine errors, natural variation in each individuals’
genome. Hence, I think it is unlikely ‘Shortest Superstring’ can derive
decently accurate continuous sequence.
When
some reads are one/two characters different from adjacent overlapping reads
due to sequencer machine error, number of such errors occurrence should be
much smaller than the correct reads. Hence deriving a continuous sequence
using number of occurrences should be able to mitigate the sequencer
machine error.
We can see from ‘Figure 1’, some reads are
one/two characters different from adjacent overlapping reads (maybe due to
sequencer machine errors). Hence ‘Shortest Superstring’ might not able to
derive the correct continuous sequence. One way to counter this is to use approximate
string matching to allow some errors tolerance during the assembly. But
the output sequences will still have many errors.

Figure 1:
Assembly - Screenshot from lecture note written by Dr. Ben Langmead and
Jacob Pritt from Johns Hopkins University
Trial &
Error Shortest Superstring & Approximate Matching methods in Sequencing
1.
Version 1: This method slow. Implementing
‘Approximate Contains’ & ‘Approximate Equals’.
During the SCS (Shortest Common
Superstring):
·
if both strings length > 40, allowable
Edit Distance = 3
·
if both strings length > 30, allowable
Edit Distance = 2
·
if both strings length > 20, allowable
Edit Distance = 1
·
Else if strings shorter than 20, use Exact
String Matching method.
ShortestSuperstring_Genome_TryError.cs ERR266411_1.for_asm.fastq
2.
Version 2
In Genome, different mismatches will have
different penalties.
·
‘Transition’ mismatches (A<->G,
C<->T) are more likely to occur. Hence penalty ‘2’
·
‘Transversions’ mismatches penalty ‘4’
·
‘Gap’ (or deletion) mismatches are least
likely to occur, therefore penalty of ‘8’.
The ‘Penalty Matrix’ table is call ‘Global
Alignment’.
ShortestSuperstring_Genome_TryError_globalAlign.cs ERR266411_1.for_asm.fastq

Figure 2: The ‘Penalty Matrix’ / ‘Global
Alignment’ table. Lecture note screenshot by Dr. Ben Langmead and Jacob
Pritt from Johns Hopkins University

Figure 3: Human transition to transversion
ratio (AKA ti/tv) is ~2.1. Lecture note screenshot
by Dr. Ben Langmead and Jacob Pritt from Johns Hopkins University
Greedy Shortest
Common Superstring & Approximate Matching methods in Sequencing
1.
Single thread programming
ShortestSuperstring_Genome_GreedySCS_4.txt ERR266411_1.for_asm.fastq
2.
Multithreading programming
ShortestSuperstring_Genome_GreedySCS_5_multithread.cs ERR266411_1.for_asm.fastq
3.
Implementing ‘Penalty Matrix’ / ‘Global
Alignment’ table
ShortestSuperstring_Genome_GreedySCS_7_multithread_globalAlign.txt ERR266411_1.for_asm.fastq
|