Shortest Superstring

 

 

Introduction

The shortest superstring is given a numerous number of short strings, find the shortest superstring that has all the short strings as substrings / subsequences.

The Shortest Common Superstring (SCS) and Longest Common Subsequence (LCS) technique only applicable to only two short strings, overlapping both short strings as much as possible to derive the shortest superstring. When there are 3 or more short strings, Shortest Common Superstring and Longest Common Subsequence (LCS) is not applicable.

Shortest Superstring like a riddle waiting for answers.

 

 

Trial & Error Shortest Superstring method

‘Trial and Error’ / ‘generate and test’ approach will obtain the correct superstring result.

Repeated, comprehensive attempts with all the possible combinations to check and find the Shortest Superstring. Leave no stones unturned, hence computing intensive. Take long times to generate the result.

·       With N numbers of short strings, create N + 1 number of queues

·       Enqueue all the short strings into queue[0]

·       Dequeue the first short string ‘SS1’ from queue[0]

·       Enqueue the remaining short strings (other than first short string ‘SS1’) into queue[1]

·       Dequeue first queue[1] (short string ‘SS2’), combine short string ‘SS1’ & short string ‘SS2’ with Shortest Common Superstring.

·       Again, enqueue all the remaining short strings (other than short string ‘SS1’, ‘SS2’) into queue[2]. Repeat the process.

·       When no more short strings left in queue[N], go one level up to queue[N-1], dequeue the first queue[N-1] and repeat the process again

·       When a complete Superstring is derived, check is this the shortest superstring? If yes, save this superstring result into a list.

·       Repeat until no more short strings left in queue[0]~queue[N].

 

 

Trial & Error Shortest Superstring Method source files

ShortestSuperstring_1.cs     test1.txt

 

 

Greedy Shortest Common Superstring Method

To make the process faster some people are proposing alternative methods, one of them is Greedy Shortest Common Superstring (Greedy SCS) method.

Greedy SCS execute faster than Trial & Error method, but does not produce correct Shortest Superstring results.

From the short strings List, Greedy SCS program will find a pair of short strings inside the List that has the Shortest Common Superstring (SCS) and Longest Common Subsequence (LCS). Remove this pair of short strings from the list, combine them together into one SCS and then add this SCS back to the list. Repeat the process of finding the next SCS pair in the List. Until eventually only one string (supposing the Shortest Superstring) left inside the List.

Greedy Shortest Common Superstring Method source code

Single thread programming:   Greedy SCS_code_3.cs       Greedy_Example_1.txt

Multi-threading programming:  Greedy SCS_code_4_multithread.txt       Greedy_Example_1.txt

 

 

First to Match Shortest Superstring method

This is an approach thought by me. Each of the short string will take turns to become the first in the long string, find one short string in the queue that has the Shortest Common Superstring (SCS) with this first short string (SS). Append these pair together to become the new long string. Repeat the process of find another short string in the queue that has the Shortest Common Superstring with the current long string, append them.

When the queue is empty, record the result. Take another short string to be the first string in long string. Repeat the process to get a complete long string. Compare which long string has the shortest length, these would be the Shortest Superstring.

Similar to Greedy SCS method, this approach does not always produce correct Shortest Superstring result.

First to Match Shortest Superstring Method source files

When all the short strings are exact length: ShortestCommonSuperstring_SameLength_9.cs    ShortString_1.txt

When each short strings’ length is different, additional computation efforts are needed to identify them:  ShortestCommonSuperstring_11_variableLength.cs     3.txt

 

 

Note: Please use ‘.NET Framework 4.6.2’ or latest to compile as some commands may only available in the latest .NET Framework.

 

 

Example:

Input:      str1 = “cdef”, str2 = “fghi”, str3 = “defg”, str4 = “abcd

Output:   abcdefghi

 

Input:      str1 = “BABBAA”, str2 = “ABBAAA”, str3 = “BBAAB”

Shortest Common Superstring method result:    “BABBAAABBAAB”   12 characters

Trial and Error method result:                                 “BABBAABBAAA” or “BBAABABBAAA”    11 characters

 

 


C# Console App Code

 

1.     All the short strings will be placed at the first position so that all combinations are checked to find the Shortest Superstring.

foreach (string str_rawReadFile in rawReadFile)

   ShortStrings_TryoutList[0].Enqueue(str_rawReadFile);

 

 

 

 

 


Implementation of Shortest Superstring in Shotgun Sequencing method to sequence unknown genome

 

DNA is duplicated & fragmented randomly into multiple small overlapping reads segment in shotgun sequencing. Computer programs use the overlapping ends of different reads to assemble them into a continuous sequence. Reference from Wikipedia.

We may use Shortest Superstring method to assemble these overlapping ends of different reads into a continuous sequence. But because Shortest Superstring (SS) does not consider number of occurrence (path weight) of individual short strings or k-mer, therefore ‘Shortest Superstring’ cannot mitigate sequencer machine errors, natural variation in each individuals’ genome. Hence, I think it is unlikely ‘Shortest Superstring’ can derive decently accurate continuous sequence.

When some reads are one/two characters different from adjacent overlapping reads due to sequencer machine error, number of such errors occurrence should be much smaller than the correct reads. Hence deriving a continuous sequence using number of occurrences should be able to mitigate the sequencer machine error.

We can see from ‘Figure 1’, some reads are one/two characters different from adjacent overlapping reads (maybe due to sequencer machine errors). Hence ‘Shortest Superstring’ might not able to derive the correct continuous sequence. One way to counter this is to use approximate string matching to allow some errors tolerance during the assembly. But the output sequences will still have many errors.

 

 

 

 

 

 

 

 

Shotgun sequencing to assembling an genome

Figure 1: Assembly - Screenshot from lecture note written by Dr. Ben Langmead and Jacob Pritt from Johns Hopkins University

 

 

 

Trial & Error Shortest Superstring & Approximate Matching methods in Sequencing

1.     Version 1: This method slow. Implementing ‘Approximate Contains’ & ‘Approximate Equals’.

During the SCS (Shortest Common Superstring):

·       if both strings length > 40, allowable Edit Distance = 3

·       if both strings length > 30, allowable Edit Distance = 2

·       if both strings length > 20, allowable Edit Distance = 1

·       Else if strings shorter than 20, use Exact String Matching method.

ShortestSuperstring_Genome_TryError.cs     ERR266411_1.for_asm.fastq

2.     Version 2

In Genome, different mismatches will have different penalties.

·       ‘Transition’ mismatches (A<->G, C<->T) are more likely to occur. Hence penalty ‘2’

·       ‘Transversions’ mismatches penalty ‘4’

·       ‘Gap’ (or deletion) mismatches are least likely to occur, therefore penalty of ‘8’.

The ‘Penalty Matrix’ table is call ‘Global Alignment’.

ShortestSuperstring_Genome_TryError_globalAlign.cs     ERR266411_1.for_asm.fastq

 

A screen shot of a social media post

Description automatically generated

Figure 2: The ‘Penalty Matrix’ / ‘Global Alignment’ table. Lecture note screenshot by Dr. Ben Langmead and Jacob Pritt from Johns Hopkins University

 

A close up of a map

Description automatically generated

Figure 3: Human transition to transversion ratio (AKA ti/tv) is ~2.1. Lecture note screenshot by Dr. Ben Langmead and Jacob Pritt from Johns Hopkins University

 

 

 

Greedy Shortest Common Superstring & Approximate Matching methods in Sequencing

1.     Single thread programming

ShortestSuperstring_Genome_GreedySCS_4.txt      ERR266411_1.for_asm.fastq

2.     Multithreading programming

ShortestSuperstring_Genome_GreedySCS_5_multithread.cs      ERR266411_1.for_asm.fastq

3.     Implementing ‘Penalty Matrix’ / ‘Global Alignment’ table

ShortestSuperstring_Genome_GreedySCS_7_multithread_globalAlign.txt      ERR266411_1.for_asm.fastq

 

 

 

 

 

Edit date: 06 June 19