String Search Algorithm

Introduction

To search a substring within a long string in online mode, we can use Naive Exact Matching algorithm, Boyer Moore search algorithm and custom search algorithm.

In Naive Exact Matching search algorithm, every character in the string will be compared to find the matches from left most character in the string to the right. When found miss-match character, shift the target search string 1 character to the right of source string and continue the comparison.

Boyer Moore search algorithm is good tool to search string within a long string. It theoretically faster than Naive Exact Matching method. Boyer Moore search algorithm bypass a number of characters in a long string without comparison depend on the conditions.

When the source database ('long string') has below characteristics, we can implement Boyer Moore search algorithm/custom search algorithm:

Each nodes/data within the source database are not unique. They are repeatedly occurred inside the source database ('long string').

There are only limited number of distinct nodes/elements (for example 26 letters English alphabets). These nodes/elements repeatedly occur inside the source database.

These nodes/data cannot be quantified. Therefore, we cannot compare which nodes/data are bigger or smaller.

The source database is overall unique because each elements/nodes are aligned in one unique sequence/position.

It is an online, without preprocessing database

Examples

English essay, paragraph - 26 letters English alphabet

English alphabets string - 26 letters English alphabet

DNA sequence - composed from 4 elements /nitrogen-containing nucleobases: 'A', 'G', 'C', 'T'

Web search engine without preprocessing database

Searching words in a document

https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

In this article, we will have another string search algorithm, which pre-construct string P shifting table before start the search of string P against string T . This will further speed up the string search efficiency.

String search are important algorithms in computing world.

Figure: String search algorithm Example

Loops and Jumpers

Loops and jumpers are seen in my string search algorithm.

Codes are executing back and forth depend on conditions. When conditions do not meet, program will loop back to the start of the comparing process and the process continually repeated. Until certain conditions are reach, the algorithm will jump out of the loop to the next comparing process.

String searching conditions often occur repeatedly after every target search string shift to the right. Therefore, when the condition meet the algorithm will jump back and forth between the comparing processes.

To truly optimize the string search algorithm, this algorithm should be made into transistors circuit and added into CPU as part of the CPU basic instruction.

My version of Boyer Moore search algorithm

In Boyer Moore search algorithm, there are four searching processes:

Good Suffix rule

Bad Character rule

Only last character match shift

Found Exact Match

The search algorithm will be running from either one of these 4 main processes in each cycle.

The search start from string P last character toward first character, right to left direction. This would ensure the shift of string P in string T would be minimum and all possible occurrence of string P in string T are verified.

Figure: String search algorithm without Preprocessing string P Flow Chart

Declare variables

List<int> local_P_in_T_results = new List<int>();

string matched_string = "";

int start_P_in_T = 0;

int end_P_in_T = P.Length - 1;

//Check string 'P' & 'T' are not empty and T string is longer than P string, or else definitely will not be any matches and no reason to proceed the checking

if (!String.IsNullOrEmpty(P) && !String.IsNullOrEmpty(T) && T.Length >= P.Length)

{

int T_ptr;

Special case: when string P only consist of 1 character

String P will compare every character in string T from left to right until found a match, or else return -1 .

if (P.Length == 1) //Special case: if string 'P' only consist of 1 character

{

T_ptr = 0;

while (T_ptr < T.Length)

{

if (P[0] == T[T_ptr])

local_P_in_T_results.Add(T_ptr);

T_ptr++;

}

return local_P_in_T_results;

}

Compare string P against string T from last character toward first character

At the start, check if string 'P' currently still located within string 'T' boundary. If string 'P' last character located outside of string 'T' boundary, no Good Suffix rule & Bad Character rule comparison will be made. Process will goto 'end_string_compare' instead. It also means there are totally no match of P string in source string T.

int P_ptr = P.Length - 1; //Pointer showing position of target search string 'P' last character now

T_ptr = P.Length - 1; //Pointer showing position of target search string 'P' last character in string 'T' now

int shift_P_in_T = 0; //How much string 'P' needed to shift in string 'T' after Good Suffix rule, Bad Character rule or last_character_not_match_shift

string_compare_P_against_T:

if (T_ptr < T.Length)

{

In Boyer Moore search algorithm, we start string comparison from the end of target search string 'P' to to start of 'P' against source string 'T' depend on where 'P' located in "T'. This comparison process will continue to loop until it reaches start of string 'P' or 'P' & 'T' character comparison is found to be not the same.

while (T_ptr >= start_P_in_T && P[P_ptr] == T[T_ptr])

{ //compare string 'P' against string 'T' from last character toward first character, stop whenever there is a mismatch

matched_string = P[P_ptr].ToString() + matched_string;

T_ptr--;

P_ptr--;

}

Bad Character Rule

If we found no matches ('matched_string' is empty), string P last character does not match string T character ( T[T_Ptr] ), we implement Bad Character Rule. Search the occurrence of character T[T_Ptr] in remaining of string P from right to left direction, stop the search when found character T[T_Ptr] in string P .

If character T[T_Ptr] not exist in string P , shift whole string P away from character T[T_Ptr] in string T by P.Length .

If character T[T_Ptr] is found in string P , shift string P so that character in string P matching character T[T_Ptr] in string T .

Afterward restart the string comparison again from string P last character toward first character against string T .

if (matched_string.Length == 0) //Found string 'P' last character not matching string 'T'

{

//Bad Character Rule

while (P_ptr >= 0 && P[P_ptr] != T[T_ptr])

{

P_ptr--;

shift_P_in_T++;

}

Console.Write("Bad Character shift, ");

}

Only Last Character Matched

If we found only string 'P' last character matched ('matched_string.Count() == 1') against string 'T'. Search the occurrence of string 'P' last character or character T[T_Ptr] in remaining of string P from right to left direction, stop the comparison when found string 'P' last character in remaining of string P . or else shift whole string P away from character T[T_Ptr] in string T by P.Length .

'shift_P_in_T' increase by 1

Shift 'T_ptr' pointer location shifted 1 character back to the right, location where P last character located.

Compare remaining of string P against string P last character from right to left when 'P_ptr' pointer is still within string P boundary.

When character not matching, shift 'P_ptr' pointer 1 character to the left; increase 'shift_P_in_T' by 1.

else if (matched_string.Count() == 1)

{

//only string 'P' last character matching against string 'T'

shift_P_in_T = shift_P_in_T + 1;

T_ptr++;

while (P_ptr >= 0 && P[P_ptr] != T[T_ptr])

{

P_ptr--;

shift_P_in_T++;

}

Console.Write("only last character match shift, ");

}

Good suffix rule

If we found string P suffix of two or more characters are matching when comparing end of 'P' against 'T' but 'P' string is not wholly matching in string 'T', Good Suffix shift will be implemented. Compare the partially matched 'P' suffix 'matched_string' with the remaining characters other than last character of 'P' string ( P_ptr = P.Length - 2 ) to see whether if there are a match.

else

{

//P against T matched string is 2 characters or more but not exactly matched, only partially matched at suffix

int matched_string_ptr = matched_string.Count() - 1;

P_ptr = P.Length - 2; //set the P_ptr to be second last characters of P string

//Partial of the matched string has chance to be within the matched string when matched string has repeated characters inside

Create matched string pointer 'matched_string_ptr' = 'matched_string.Count() - 1' to record currently pointer position in 'matched_string'.

Enter 'finding_good_suffix_shift:' process

First check if P pointer 'P_ptr' is still within P string boundary ('P_ptr >= 0')

Check if 'matched_string' current character match with P string current character ('matched_string[matched_string_ptr] == P[P_ptr]')

o If they are match, check if pointer 'P_ptr' or pointer 'matched_string_ptr' reached the left most end character in either P or 'matched_string' string If either pointers reach the left most end of P or 'matched_string' strings and there is at least a match character, process jump to 'found_good_suffix_shift'.

o Else move pointer 'matched_string_ptr' 1 character to the left

//Good suffix rule checking

finding_good_suffix_shift:

if (P_ptr >= 0)

{

finding_good_suffix_shift_2:

if (matched_string[matched_string_ptr] == P[P_ptr])

{

if (P_ptr == 0 || matched_string_ptr == 0)

goto found_good_suffix_shift;

matched_string_ptr--;

}

If 'matched_string' previously suffix partially match with remaining P string but current character does not match with remaining P string current character ('matched_string[matched_string_ptr] != P[P_ptr]'), reset the pointer 'matched_string_ptr' to 'matched_string' last character ( matched_string_ptr = matched_string.Count() - 1 ). Loop back to finding_good_suffix_shift_2 and restart 'matched_string' with remaining P string comparison. No change in 'P_ptr'.

else

{

if (matched_string_ptr != matched_string.Count() - 1)

{

//Reset the matched_string pointer, compare matched string from last character again

matched_string_ptr = matched_string.Count() - 1;

goto finding_good_suffix_shift_2;

}

If found no match between 'matched_string' current character& remaining P string current character, shift pointer 'P_ptr' to the left 1 character.

Loop back to finding_good_suffix_shift, continue the character comparison.

P_ptr--;

goto finding_good_suffix_shift;

When pointer 'P_ptr' out of P string boundary (P_ptr == -1), goto found_good_suffix_shift process

In found_good_suffix_shift process, check P_ptr value

o If P_ptr > 0, that means matched string are found in the mid of remaining P string. shift_P_in_T = P.Length - matched_string.Length - P_ptr.

o If P_ptr == 0, that means matched string is found at the prefix of remaining P string. 'shift_P_in_T = P.Length - matched_string.Length - matched_string_ptr'

o When P_ptr == -1, no matched string is found in the remaining of P string. Therefore bypass this whole string segment, 'shift_P_in_T = P.Length'

found_good_suffix_shift:

if (P_ptr > 0) //Match string are found in the mid of remaining P string

shift_P_in_T = P.Length - matched_string.Length - P_ptr;

else if (P_ptr == 0) //Match string are found in the start of remaining P string

shift_P_in_T = P.Length - matched_string.Length + matched_string_ptr;

else //When P_ptr < 0, no match found in remaining P string

shift_P_in_T = P.Length;

Console.Write("Good Suffix shift, ");

Found Exact Match

If we found 'P' string exactly matching in string 'T', we record the current starting position of string P in string T by adding it into the local_P_in_T_results List.

Shift 1 character to the left ( shift_P_in_T = 1 ) and start the next string P against string T comparison, check whether string P occurred in remaining of string T .

else if (matched_string.Length == P.Length) //found the exact match

{

local_P_in_T_results.Add(start_P_in_T); //record the position of string 'P' in string 'T'

shift_P_in_T = 1;

}

Next cycle of string P against string T comparison

After the above processes, we shift string P in string T by shift_P_in_T derived from the one of the processes. Reset the variables & pointers and start the next cycle of character comparison ( goto string_compare_P_against_T ).

matched_string = "";

end_P_in_T += shift_P_in_T;

start_P_in_T += shift_P_in_T;

Console.WriteLine("start_P_in_T = " + start_P_in_T);

T_ptr = end_P_in_T;

P_ptr = P.Length - 1;

shift_P_in_T = 0;

goto string_compare_P_against_T;

When pointer T_ptr out of string T boundary, we completed the string P against string T comparison. Return the local_P_in_T_results List.

Source code

My version of Boyer Moore search algorithm without pre-constructed string P fast shifting table Source code: Boyer_Moore_search_algorithim.txt

String Search algorithm with pre-constructed string P fast shifting table

Figure: String search algorithm with pre-constructed string P fast shifting table flow chart

When source string T is much longer compare to target search string P , there will be a lot of similar repetitive string comparison conditions. Thus, we can cut short our comparing time and computing power by pre-constructing string P number of shift in string T table before start comparing string P against string T . The number of shift can be found just within string P itself without need to know string T contents.

When string P is short and string T is comparatively long. Construct string P number of shift in string T table become a viable and preferable option.

Compare string P against string T character by character from string P last character to first character. Whenever there is one character mismatch, stop the string comparison immediately.

Bad Character Rule: Construct the P_Bad_Character_shift_table

We will try to search T[T_Ptr] character in string P from string P second last character toward first character. There are many repeated characters within string P , we only want one from each repeated character to be added into List P_Bad_Character_shift_table , the one at the right most of string P . But excluding P.last() character from the List, because P.last() character already found to be mismatching character T[T_Ptr] . Hence, we reduce additional searching steps from these repeated characters.

string P = "TAGATAAGATA";
	P_Char	Bad_Character_Shift
P_Bad_Character_shift_table[0] =	T	1
P_Bad_Character_shift_table[1] =	G	3

As we can see from the table above, List P_Bad_Character_shift_table only has two nodes. We don t need to add string P last character A into the List and other characters in string P are either character T or G .

Right most character of T is 1 character away from string P last character; right most character of G is 3 characters away from string P last character.

//Third case: when string 'P' last character does not match string 'T'

//P_Bad_Character_Rule

P_ptr = P.Length - 2;

List<P_Bad_Character_struct> P_Bad_Character_shift_table = new List<P_Bad_Character_struct>();

while (P_ptr > 0 && P[P_ptr] == P.Last())

P_ptr--;

//P_ptr == -1, special case when string 'P' only consist of only one duplicated character

if (P_ptr != -1) //string 'P' has character different from P.Last()

{

P_Bad_Character_shift_table.Add(new P_Bad_Character_struct(P[P_ptr], P.Length - 1 - P_ptr)); //Add P second last character into the table

P_ptr--;

while (P_ptr >= 0) //when string P.length >= 3

{

int i = 0;

while (i < P_Bad_Character_shift_table.Count() && P_Bad_Character_shift_table[i].P_Char != P[P_ptr])

i++;

if (i == P_Bad_Character_shift_table.Count() && P[P_ptr] != P.Last()) //P current character not occur inside P_Bad_Character_shift_table, add this new character & it position

P_Bad_Character_shift_table.Add(new P_Bad_Character_struct(P[P_ptr], P.Length - 1 - P_ptr));

P_ptr--;

}

for (int i = 0; i < P_Bad_Character_shift_table.Count(); i++)

Console.WriteLine("P_Bad_Character_shift_table.P_Char[" + i + "] = " + P_Bad_Character_shift_table[i].P_Char

+ ", Bad_Character_Shift = " + P_Bad_Character_shift_table[i].Bad_Character_Shift);

}

Figure: When only string P last character matching string T flow chart

Only Last Character Matched find the only_last_character_of_P_matching_shift

Shift string P in string T by only_last_character_of_P_matching_shift . Find the next P.last() character in string P from right to left direction.

string P = "TAGATAAGATA";
	Character A
only_last_character_of_P_matching_shift =	2

The next character A that is occur in string P from right to left direction is 2 characters away from P last character.

//First case: when only last character of string 'P' matching string 'T', how many characters shift string 'P' should be in string 'T'

P_ptr = P.Length - 2;

int only_last_character_of_P_matching_shift;

while (P_ptr >= 0 && P[P_ptr] != P.Last())

P_ptr--;

only_last_character_of_P_matching_shift = P.Length - 1 - P_ptr;

Good Suffix Rule Construct the P_Good_Suffix_shift_table

String P suffix can be between last character to second last character till last character to second first character, total count of P.length 2 suffixes. Construct P_Good_Suffix_shift_table[P.length 2] fast shifting table to save number of shift for each different string P suffix length.

string P = "TAGATAAGATA";
	String P suffix	Good_Suffix_Shift
P_Good_Suffix_shift_table[0] =	TA	5
P_Good_Suffix_shift_table[1] =	ATA	5
P_Good_Suffix_shift_table[2] =	GATA	5
P_Good_Suffix_shift_table[3] =	AGATA	5
P_Good_Suffix_shift_table[4] =	AAGATA	9
P_Good_Suffix_shift_table[5] =	TAAGATA	9
P_Good_Suffix_shift_table[6] =	ATAAGATA	9
P_Good_Suffix_shift_table[7] =	GATAAGATA	9
P_Good_Suffix_shift_table[8] =	AGATAAGATA	9

As we can see from the table above, in P_Good_Suffix_shift_table[] , the Good_Suffix_Shift is always bigger in later nodes.

Longer suffixes always occur at shorter suffix position or left side of shorter suffix. Therefore, our starting string comparison P_Ptr pointer can start from position where previous shorter suffix was found. P_ptr = P.Length - P_Good_Suffix_shift_table[matched_string.Length - 2];

If shorter suffix is not found in string P , the later longer suffix definitely not exist in string P . therefore we can shift the whole string P away by P.Length in remaining of longer suffixes.

//Second case: when string 'P' suffix partially matching string 'T', 2 characters & more. How many characters shift string 'P' should be in string 'T'

matched_string = P.Substring(P.Length - 2, 2);

int matched_string_ptr = 1;

P_ptr = P.Length - 2;

//P_ptr pointer continue from only_last_character_of_P_matching case

int[] P_Good_Suffix_shift_table = new int[P.Length - 2];

P_Good_Suffix_Rule:

if (matched_string.Length < P.Length)

{

finding_good_suffix_shift:

if (P_ptr >= 0)

{

finding_good_suffix_shift_2:

if (matched_string[matched_string_ptr] == P[P_ptr])

{

if (P_ptr == 0 || matched_string_ptr == 0)

goto found_good_suffix_shift;

matched_string_ptr--;

}

else

{

if (matched_string_ptr != matched_string.Count() - 1)

{

matched_string_ptr = matched_string.Count() - 1;

goto finding_good_suffix_shift_2;

}

P_ptr--;

goto finding_good_suffix_shift;

}

found_good_suffix_shift:

if (P_ptr > 0) //Match string are found in the mid of remaining P string

{

P_Good_Suffix_shift_table[matched_string.Length - 2] = P.Length - matched_string.Length - P_ptr;

P_ptr = P.Length 1 - P_Good_Suffix_shift_table[matched_string.Length - 2];

}

else if (P_ptr == 0) //Match string are found in the start of remaining P string

{

int P_Good_Suffix_shift = P.Length - matched_string.Count() + matched_string_ptr;

for (int i = matched_string.Length - 2; i < P.Length - 2; i++)

P_Good_Suffix_shift_table[i] = P_Good_Suffix_shift;

goto P_Bad_Character_Rule;

}

else //When P_ptr < 0, no match found in remaining P string

{

for (int i = matched_string.Length - 2; i < P.Length - 2; i++)

P_Good_Suffix_shift_table[i] = P.Length;

goto P_Bad_Character_Rule;

}

matched_string_ptr = matched_string.Length; //matched_string increase length by 1

matched_string = P.Substring(P.Length - matched_string_ptr - 1, matched_string_ptr + 1); //matched_string increase length by 1

goto P_Good_Suffix_Rule;

}

Figure: String search algorithm with pre-constructed string P fast shifting Good Suffix Rule table flow chart

Other than the above method to construct P_Good_Suffix_shift_table[] , we have improved version of deriving P_Good_Suffix_shift_table[] .

When we found match string at mid of remaining P string (P_ptr > 0), the next longer suffix potential occurrence location will be at current suffix position or to the left side. Therefore, we can save computing effort by only comparing longer suffix first character against string P , P_ptr-- & matched_string_ptr = 0 .

When current suffix is found at the start of string P , that would mean that all the following longer suffix will be located at the same position. Hence P_Good_Suffix_shift_table[i] value will be the same for all remaining longer suffix.

When current suffix is not found in the remaining of string P , it would also indicate all the remaining longer suffix will not be found inside remaining of string P . Therefore we can shift the whole string P to new location in string T by P_Good_Suffix_shift_table[i] = P.Length for current suffix and the following longer suffix.

Notice that there are no checkpoint of if (matched_string.Length < P.Length) to check if suffix will be longer or equal to string P in this improved version code, where every cycle suffix length will be increase length by 1 when current suffix occur in the mid of string P . We already know the total number of possible suffix occurrence, total count of P.length 2 suffixes. Thus, the code will stop at P.length 2 count.

When suffixes are getting longer and longer, eventually these following longer suffixes will either occur at the start of string P or not exist in remaining of string P . Thus, we can reduce our computing efforts.

//Second case: when string 'P' suffix partially matching string 'T', 2 characters & more. How many characters shift string 'P' should be in string 'T'

matched_string = P.Substring(P.Length - 2, 2);

int matched_string_ptr = 1;

//P_ptr pointer continue from only_last_character_of_P_matching case

int[] P_Good_Suffix_shift_table = new int[P.Length - 2];

//P_Good_Suffix_Rule

finding_good_suffix_shift:

if (P_ptr >= 0)

{

finding_good_suffix_shift_2:

if (matched_string[matched_string_ptr] == P[P_ptr])

{

if (P_ptr == 0 || matched_string_ptr == 0)

goto found_good_suffix_shift;

matched_string_ptr--;

}

else

{

if (matched_string_ptr != matched_string.Count() - 1)

{

matched_string_ptr = matched_string.Count() - 1;

goto finding_good_suffix_shift_2;

}

P_ptr--;

goto finding_good_suffix_shift;

}

found_good_suffix_shift:

if (P_ptr > 0) //Match string are found in the mid of remaining P string

{

P_Good_Suffix_shift_table[matched_string.Length - 2] = P.Length - matched_string.Length - P_ptr;

P_ptr--;

//matched_string increase length by 1

matched_string = P.Substring(P.Length - matched_string.Length - 1, matched_string.Length + 1);

matched_string_ptr = 0; //search suffix from previous shorter suffix position, 1 character to the left

goto finding_good_suffix_shift; //should goto finding_good_suffix_shift_2

}

else if (P_ptr == 0) //Match string are found in the start of remaining P string

{

int P_Good_Suffix_shift = P.Length - matched_string.Count() + matched_string_ptr;

for (int i = matched_string.Length - 2; i < P.Length - 2; i++)

P_Good_Suffix_shift_table[i] = P_Good_Suffix_shift;

//goto P_Bad_Character_Rule;

}

else //When P_ptr < 0, no match found in remaining P string

{

for (int i = matched_string.Length - 2; i < P.Length - 2; i++)

P_Good_Suffix_shift_table[i] = P.Length;

//goto P_Bad_Character_Rule;

}

Search string P against string T

After constructed string P shifting table, we can start string P against string T comparison. After each shifting string P in string T , repeat the characters comparison from string P last character till first character against string T .

Console.WriteLine("\nStart string 'P' against string 'T' comparison");

//Start string 'P' against string 'T' comparison

int matched_string_length = 0;

P_ptr = P.Length - 1;

//string_compare_P_against_T:

while (T_ptr < T.Length)

{

//String compare from string 'P' last character toward first character against string 'T'. Stop whenever there is a mismatch.

while (P_ptr >= 0 && P[P_ptr] == T[T_ptr])

{

matched_string_length++;

T_ptr--;

P_ptr--;

}

if (matched_string_length == 0) //Bad Character Rule

{

if (P_Bad_Character_shift_table.Count() == 0)

{

//special case when string 'P' only consist of only one duplicated character

start_P_in_T += P.Length;

end_P_in_T += P.Length;

}

else

{

int i = 0;

while (i < P_Bad_Character_shift_table.Count() && P_Bad_Character_shift_table[i].P_Char != T[end_P_in_T])

i++;

if (i == P_Bad_Character_shift_table.Count())

{

//string 'T' character correspond to string 'P' last character position cannot be found in string 'P'

start_P_in_T += P.Length;

end_P_in_T += P.Length;

}

else

{

start_P_in_T += P_Bad_Character_shift_table[i].Bad_Character_Shift;

end_P_in_T += P_Bad_Character_shift_table[i].Bad_Character_Shift;

}

Console.Write("Bad Character Rule, ");

}

else if (matched_string_length == 1)

{

//Only P last character match

start_P_in_T += only_last_character_of_P_matching_shift;

end_P_in_T += only_last_character_of_P_matching_shift;

Console.Write("Only P last character match, ");

}

else if (matched_string_length == P.Length) //found string 'P' matching string 'T' in this position

{

local_P_in_T_results.Add(start_P_in_T); //record the position of string 'P' in string 'T'

start_P_in_T += 1;

end_P_in_T += 1;

}

else //P against T matched string is 2 characters or more but not exactly matched, only partially match

{

//Good Suffix Rule

start_P_in_T += P_Good_Suffix_shift_table[matched_string_length - 2];

end_P_in_T += P_Good_Suffix_shift_table[matched_string_length - 2];

Console.Write("Good Suffix Rule, ");

}

T_ptr = end_P_in_T;

P_ptr = P.Length - 1;

matched_string_length = 0;

Console.WriteLine("start_P_in_T = " + start_P_in_T);

//goto string_compare_P_against_T;

}

return local_P_in_T_results;

When T_ptr beyond string T boundary, that s mean we finished searching the whole string T . Stop the string search and return the result List local_P_in_T_results .

c# Console App Source code

String search algorithm with pre-constructed string P fast shifting table Source code: string_search_with_pre-constructed_P_fast_shift_table_version1.txt

Improved version: String search algorithm with pre-constructed string P fast shifting table Source code: string_search_with_pre-constructed_P_fast_shift_table_version2.txt

Statistic: Probability occurrences of Bad Character Rule, Good Suffix Rule, Only_Last_Character_matached Rule and Exact Match

By using simple statistic, we can compute the probability occurrences of Bad Character Rule, Good Suffix Rule, Only_Last_Character_matached Rule and Exact Match.

Below we take a DNA sequence for example. DNA sequence consist of 4 elements. Hereby assuming elements T , A , G , C have equal probability of occurrence.

string P = "TAGA";
	Probability Occurrences (%)
matched_string.length == 0 Bad Character Rule String P last character not matching string T	= 75%
matched_string.length == 1 Only Last Character Matched Only string P last character matched string T	* = 18.75%
matched_string.length == 2 Good Suffix Rule string P suffix matched string T	* * = 4.6875%
matched_string.length == 3 Good Suffix Rule string P suffix matched string T	* * * = 1.171875%
matched_string.length == 4 Found exact match string P is matching string T	* * * = 0.390625%
Total	100%

Assuming all elements in string P & string T have the same probability of occurrences. In the above example 75% chance the algorithm will be running Bad Character Rule ; 18.75% probability running Only Last Character Matched ; 5.859375% running Good Suffix Rule ; 0.390625% found the exact match.

Linear Execution

From the above flowcharts we can see that the string search program is executing linearly. To utilize multi-threads feature of CPU, we can divide the string T into multiple segments, search string P simultaneously in each segments of string T.

Multi-tasking String Search by implementing multi-threading programming

Nowadays CPUs are all multi-cores and multi-threads. Therefore, we can implement multi-threads programming to run our string search algorithm efficiently and faster. Our string search algorithm is linear execution. Execute the program by utilizing only one single core and letting other cores / threads remain idle are waste of computer resources.

Hence to utilize the CPU multi-threads function, we can split the string T into equal segments. Perform searching of string P in each segments of string T concurrently.

First, we identify how many cores / threads in current computer. We would reserve 1~2 threads to run operating system processes & other miscellaneous processes.

using System.Threading;

using System.Runtime.InteropServices;

namespace String_Search

{

class Program

{

[DllImport("kernel32.dll")]

static extern int GetCurrentProcessorNumber();

static void Main(string[] args)

{

Console.WriteLine("The number of processors on this computer is {0}.", Environment.ProcessorCount);

int number_of_program_threads = Environment.ProcessorCount - 1; //One thread is reserved for system & other miscellaneous processes

Preprocessing string P to create fast shift tables beforehand to speed up the search process, my_string_search_class.preprocessing_string_P(P); .

Divide the workloads evenly to the remaining of cores / threads, number_of_program_threads . Create number_of_program_threads number of lists to store the string search results. Create another list P_in_T_results_total to combine all the search results together.

Define the start index & length of each segments of string T for the string search program to start with. The start index of each segments is derived from dividing T.Length by number_of_program_threads .

List<int>[] P_in_T_results = new List<int>[number_of_program_threads];

for (int i = 0; i < number_of_program_threads; i++)

P_in_T_results[i] = new List<Int32>();

List<int> P_in_T_results_total = new List<int>();

int[] start_index = new int[number_of_program_threads];

int[] end_index = new int[number_of_program_threads];

Create number_of_program_threads to run the string search program parallelly.

Pre-processed string P fast shift tables are the same for all the threads program, therefore I make all the threads retrieve the number of shifts from single common tables without creating fast shift tables for each individual threads.

Start all the threads by Start command and run all threads until done by Join command.

Thread[] threadsArray = new Thread[number_of_program_threads];

threadsArray[0] = new Thread(() => P_in_T_results[0] = my_string_search_class.string_search(P, T.Substring(0, end_index[0])));

threadsArray[1] = new Thread(() => P_in_T_results[1] = my_string_search_class.string_search(P, T.Substring(start_index[1], end_index[1] - start_index[1])));

threadsArray[2] = new Thread(() => P_in_T_results[2] = my_string_search_class.string_search(P, T.Substring(start_index[2], T.Length - start_index[2])));

try

{

foreach (Thread t in threadsArray)

t.Start();

foreach (Thread t in threadsArray)

t.Join();

// Join both threads with no timeout

// Run all until done.

// threads have finished at this point.

}

catch (ThreadStateException e)

{

Console.WriteLine(e); // Display text of exception

}

catch (ThreadInterruptedException e)

{

Console.WriteLine(e); // This exception means that the thread

// was interrupted during a Wait

}

Since in each segments of string T , index always start from 0. To compromise this, we add all search results by start_index[x] inside P_in_T_results[x] list so that the search results are reflecting the corresponding position of string P inside string T .

Combine all the individual P_in_T_results[x] lists into one total list P_in_T_results_total afterward.

for (int x = 0; x < number_of_program_threads; x++)

for (int y = 0; y < P_in_T_results[x].Count(); y++)

{

P_in_T_results[x][y] += start_index[x];

}

for (int i = 0; i < number_of_program_threads; i++)

P_in_T_results_total.AddRange(P_in_T_results[i]);

if (P_in_T_results_total.Any())

{

Console.Write("String P is found inside string T, between: ");

foreach (int i in P_in_T_results_total)

Console.Write(i + "~" + (i + P.Length - 1) + ", ");

Console.WriteLine("");

}

else

Console.WriteLine("P string not exist in string T");

c# Console App Source code

String search algorithm with pre-constructed string P fast shifting table and with multithreading Source code: String_search_multi-threads.txt

Multi-threading String Search with Wild Card

In below code, user can use Wild Card ? in string P to find Exact String Match of string P from string T .

c# Console App Source code

String search algorithm with pre-constructed string P fast shifting table and with multithreading Source code: String_search_multi-threads_WildCard.txt