Introduction
To search a
substring within a long string in online mode, we can use Naive Exact Matching
algorithm, Boyer Moore search algorithm and custom search algorithm.
In Naive Exact
Matching search algorithm, every character in the string will be compared
to find the matches from left most character in the string to the
right. When found miss-match character, shift the target search string
1 character to the right of source string and continue the comparison.
Boyer Moore search
algorithm is good tool to search string within a long string. It
theoretically faster than Naive Exact Matching method. Boyer Moore search
algorithm bypass a number of characters in a long string without
comparison depend on the conditions.
When the
source database ('long string') has below characteristics, we
can implement Boyer Moore search algorithm/custom search algorithm:
Each nodes/data within the source
database are not unique. They are repeatedly occurred inside the source
database ('long string').
There are only limited number of
distinct nodes/elements (for example 26 letters English alphabets).
These nodes/elements repeatedly occur inside the source database.
These nodes/data cannot be
quantified. Therefore, we cannot compare which nodes/data are bigger or
smaller.
The source database is overall unique
because each elements/nodes are aligned in one unique sequence/position.
It is an online, without
preprocessing database
Examples
English essay, paragraph - 26 letters
English alphabet
English alphabets string - 26 letters
English alphabet
DNA sequence - composed from 4
elements /nitrogen-containing nucleobases:
'A', 'G', 'C', 'T'
Web search engine without
preprocessing database
Searching words in a document
https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
In
this article, we will have another string search algorithm, which
pre-construct string P shifting table before start the search of string P
against string T . This will further speed up the string search efficiency.
String search are
important algorithms in computing world.

Figure: String search algorithm Example
Loops
and Jumpers
Loops and
jumpers are seen in my string search algorithm.
Codes are
executing back and forth depend on conditions. When conditions do not
meet, program will loop back to the start of the comparing process and the
process continually repeated. Until certain conditions are reach, the
algorithm will jump out of the loop to the next comparing process.
String
searching conditions often occur repeatedly after every target search
string shift to the right. Therefore, when the condition meet the algorithm
will jump back and forth between the comparing processes.
To truly
optimize the string search algorithm, this algorithm should be made into
transistors circuit and added into CPU as part of the CPU basic
instruction.
My version of
Boyer Moore search algorithm
In Boyer
Moore search algorithm, there are four searching processes:
Good Suffix rule
Bad Character rule
Only last character match shift
Found Exact Match
The
search algorithm will be running from either one of these 4 main
processes in each cycle.
The
search start from string P last character toward first character, right to
left direction. This would ensure the shift of string P in string T would
be minimum and all possible occurrence of string P in string T are
verified.

Figure: String search algorithm without
Preprocessing string P Flow Chart
Declare variables
List<int>
local_P_in_T_results = new List<int>();
string
matched_string = "";
int
start_P_in_T = 0;
int end_P_in_T
= P.Length - 1;
//Check string 'P' & 'T'
are not empty and T string is longer than P string, or else definitely
will not be any matches and no reason to proceed the checking
if (!String.IsNullOrEmpty(P)
&& !String.IsNullOrEmpty(T)
&& T.Length >= P.Length)
{
int T_ptr;
|
Special case: when string P only consist of 1 character
String P
will compare every character in string T from left to right until found a
match, or else return -1 .
if (P.Length == 1) //Special case: if string 'P'
only consist of 1 character
{
T_ptr = 0;
while (T_ptr < T.Length)
{
if (P[0] == T[T_ptr])
local_P_in_T_results.Add(T_ptr);
T_ptr++;
}
return local_P_in_T_results;
}
|
Compare string P
against string T from last character toward first character
At the
start, check if string 'P' currently still located within string 'T'
boundary. If string 'P' last character located outside of string 'T'
boundary, no Good Suffix rule & Bad Character rule comparison will be
made. Process will goto 'end_string_compare' instead. It also means there
are totally no match of P string in source string T.
int P_ptr = P.Length - 1; //Pointer showing position of
target search string 'P' last character now
T_ptr = P.Length - 1; //Pointer showing position of
target search string 'P' last character in string 'T' now
int shift_P_in_T = 0; //How much string 'P' needed
to shift in string 'T' after Good Suffix rule, Bad Character rule or
last_character_not_match_shift
string_compare_P_against_T:
if
(T_ptr < T.Length)
{
|
In Boyer Moore
search algorithm, we start string comparison from the end of target search
string 'P' to to start of 'P' against source string 'T' depend on where 'P'
located in "T'. This comparison process will continue
to loop until it reaches start of string 'P' or 'P' & 'T'
character comparison is found to be not the same.
while (T_ptr >= start_P_in_T
&& P[P_ptr] == T[T_ptr])
{ //compare string 'P' against
string 'T' from last character toward first character, stop whenever
there is a mismatch
matched_string = P[P_ptr].ToString() +
matched_string;
T_ptr--;
P_ptr--;
}
|
Bad Character Rule
If we found no matches ('matched_string' is empty), string
P last character does not match string T character ( T[T_Ptr] ), we implement Bad Character Rule. Search the occurrence of
character T[T_Ptr] in remaining of string P from
right to left direction, stop the search when found character T[T_Ptr] in string P .
If character T[T_Ptr] not exist in string P , shift
whole string P away from character T[T_Ptr]
in string T by P.Length .
If character T[T_Ptr] is found in string P , shift
string P so that character in string P matching character T[T_Ptr] in string T .
Afterward restart the
string comparison again from string P last character toward first character
against string T .
if (matched_string.Length == 0) //Found string 'P' last
character not matching string 'T'
{
//Bad Character Rule
while (P_ptr >= 0 &&
P[P_ptr] != T[T_ptr])
{
P_ptr--;
shift_P_in_T++;
}
Console.Write("Bad Character shift,
");
}
|
Only Last Character Matched
If we found only string 'P' last
character matched ('matched_string.Count() == 1') against string
'T'. Search the
occurrence of string 'P' last character or character T[T_Ptr] in remaining of string P from
right to left direction, stop the comparison when found string 'P' last character in remaining of string P . or else shift whole
string P away from character T[T_Ptr]
in string T by P.Length .
'shift_P_in_T' increase by 1
Shift 'T_ptr' pointer location
shifted 1 character back to the right, location where P last character
located.
Compare remaining of string P against
string P last character from right to left when 'P_ptr' pointer is still within string P boundary.
When character not matching, shift
'P_ptr' pointer 1 character to the left;
increase 'shift_P_in_T' by 1.
else if (matched_string.Count() == 1)
{
//only string 'P' last character
matching against string 'T'
shift_P_in_T = shift_P_in_T + 1;
T_ptr++;
while (P_ptr >= 0 &&
P[P_ptr] != T[T_ptr])
{
P_ptr--;
shift_P_in_T++;
}
Console.Write("only last character
match shift, ");
}
|
Good suffix rule
If we found string P suffix of two or more
characters are matching when comparing end of 'P' against 'T'
but 'P' string is not wholly matching in string 'T', Good
Suffix shift will be implemented. Compare the partially matched 'P'
suffix 'matched_string' with the remaining characters other than last character of
'P' string ( P_ptr = P.Length - 2 ) to see whether if there are a match.
else
{
//P against T matched string is 2
characters or more but not exactly matched, only partially matched at
suffix
int matched_string_ptr =
matched_string.Count() - 1;
P_ptr = P.Length - 2; //set the P_ptr to be second
last characters of P string
//Partial of the matched string has
chance to be within the matched string when matched string has repeated
characters inside
|
Create matched string pointer 'matched_string_ptr' = 'matched_string.Count() - 1' to record currently pointer position in 'matched_string'.
Enter 'finding_good_suffix_shift:' process
First check if P pointer 'P_ptr' is still within P string boundary ('P_ptr >= 0')
Check if 'matched_string' current character match with P string current character ('matched_string[matched_string_ptr]
== P[P_ptr]')
o
If they are match, check
if pointer 'P_ptr' or pointer 'matched_string_ptr' reached the left most end character in either P or
'matched_string' string If either pointers reach the left most end of P or 'matched_string' strings and there is at least a match character, process jump
to 'found_good_suffix_shift'.
o
Else move pointer
'matched_string_ptr' 1 character to the left
//Good suffix rule checking
finding_good_suffix_shift:
if (P_ptr >= 0)
{
finding_good_suffix_shift_2:
if
(matched_string[matched_string_ptr] == P[P_ptr])
{
if (P_ptr == 0 ||
matched_string_ptr == 0)
goto found_good_suffix_shift;
matched_string_ptr--;
}
|
If 'matched_string' previously suffix partially match with remaining P string but current character does not match with remaining P string
current character ('matched_string[matched_string_ptr] !=
P[P_ptr]'), reset the pointer 'matched_string_ptr' to 'matched_string' last character ( matched_string_ptr
= matched_string.Count() - 1 ). Loop back to finding_good_suffix_shift_2 and restart 'matched_string' with remaining P string comparison. No change in 'P_ptr'.
else
{
if (matched_string_ptr !=
matched_string.Count() - 1)
{
//Reset the matched_string pointer,
compare matched string from last character again
matched_string_ptr =
matched_string.Count() - 1;
goto finding_good_suffix_shift_2;
}
}
|
If found no match between 'matched_string' current character& remaining P string current character,
shift pointer 'P_ptr' to the left
1 character.
Loop back to finding_good_suffix_shift, continue the character comparison.
P_ptr--;
goto finding_good_suffix_shift;
|
When pointer 'P_ptr' out of P string boundary (P_ptr == -1), goto found_good_suffix_shift
process
In found_good_suffix_shift
process, check P_ptr value
o
If P_ptr > 0, that means matched string are found in the mid of remaining P
string. shift_P_in_T
= P.Length - matched_string.Length - P_ptr.
o
If P_ptr == 0, that means matched string is found at the prefix of remaining
P string. 'shift_P_in_T
= P.Length - matched_string.Length - matched_string_ptr'
o
When P_ptr == -1, no matched string is found in the remaining of P string.
Therefore bypass this whole string segment, 'shift_P_in_T = P.Length'
found_good_suffix_shift:
if (P_ptr > 0) //Match string are found in
the mid of remaining P string
shift_P_in_T = P.Length -
matched_string.Length - P_ptr;
else if (P_ptr == 0) //Match string are found in
the start of remaining P string
shift_P_in_T = P.Length -
matched_string.Length + matched_string_ptr;
else //When P_ptr < 0, no match
found in remaining P string
shift_P_in_T = P.Length;
Console.Write("Good
Suffix shift, ");
|
Found Exact Match
If we found 'P' string exactly matching in
string 'T', we record the current starting position of string P in string T
by adding it into the local_P_in_T_results List.
Shift 1 character to the left ( shift_P_in_T = 1 ) and start the next string P against string T comparison,
check whether string P occurred in remaining of string T .
else if (matched_string.Length ==
P.Length) //found the exact match
{
local_P_in_T_results.Add(start_P_in_T);
//record the
position of string 'P' in string 'T'
shift_P_in_T = 1;
}
|
Next cycle of string P against string T comparison
After the above processes, we shift string P in
string T by shift_P_in_T derived from the
one of the processes. Reset the variables & pointers and start the next
cycle of character comparison ( goto string_compare_P_against_T ).
matched_string = "";
end_P_in_T += shift_P_in_T;
start_P_in_T += shift_P_in_T;
Console.WriteLine("start_P_in_T = " + start_P_in_T);
T_ptr = end_P_in_T;
P_ptr = P.Length - 1;
shift_P_in_T = 0;
goto string_compare_P_against_T;
|
When pointer T_ptr out of string T boundary, we completed the string P against
string T comparison. Return the local_P_in_T_results List.
Source
code
My version
of Boyer Moore search algorithm without pre-constructed string P fast
shifting table Source code: Boyer_Moore_search_algorithim.txt
String Search
algorithm with pre-constructed string P fast shifting table

Figure: String search algorithm with
pre-constructed string P fast shifting table flow chart
When source string T
is much longer compare to target search string P , there will be a lot of
similar repetitive string comparison conditions. Thus, we can cut short our
comparing time and computing power by pre-constructing string P number of
shift in string T table before start comparing string P against string T .
The number of shift can be found just within string P itself without need
to know string T contents.
When string P is
short and string T is comparatively long. Construct string P number of
shift in string T table become a viable and preferable option.
Compare string P
against string T character by character from string P last character to
first character. Whenever there is one character mismatch, stop the string
comparison immediately.
Bad Character Rule: Construct the P_Bad_Character_shift_table
We will try
to search T[T_Ptr] character
in string P from string P second last character toward first character.
There are many repeated characters within string P , we only want one from
each repeated character to be added into List P_Bad_Character_shift_table , the one at the right most of
string P . But excluding P.last() character from the List,
because P.last() character already found to be
mismatching character T[T_Ptr] . Hence, we reduce additional
searching steps from these repeated characters.
string P =
"TAGATAAGATA";
|
|
P_Char
|
Bad_Character_Shift
|
P_Bad_Character_shift_table[0]
=
|
T
|
1
|
P_Bad_Character_shift_table[1]
=
|
G
|
3
|
As we can see from
the table above, List P_Bad_Character_shift_table only has two nodes. We don t
need to add string P last character A into the List and other characters in
string P are either character T or G .
Right most character
of T is 1 character away from string P last character; right most character
of G is 3 characters away from string P last character.
//Third case: when string 'P'
last character does not match string 'T'
//P_Bad_Character_Rule
P_ptr = P.Length - 2;
List<P_Bad_Character_struct>
P_Bad_Character_shift_table = new List<P_Bad_Character_struct>();
while (P_ptr >
0 && P[P_ptr] == P.Last())
P_ptr--;
//P_ptr == -1, special case
when string 'P' only consist of only one duplicated character
if (P_ptr != -1) //string 'P' has character
different from P.Last()
{
P_Bad_Character_shift_table.Add(new P_Bad_Character_struct(P[P_ptr], P.Length - 1 -
P_ptr)); //Add P second last character into the table
P_ptr--;
while (P_ptr >= 0) //when string P.length >= 3
{
int i = 0;
while (i <
P_Bad_Character_shift_table.Count() &&
P_Bad_Character_shift_table[i].P_Char != P[P_ptr])
i++;
if (i ==
P_Bad_Character_shift_table.Count() && P[P_ptr] != P.Last()) //P current character not
occur inside P_Bad_Character_shift_table, add this new character & it
position
P_Bad_Character_shift_table.Add(new P_Bad_Character_struct(P[P_ptr], P.Length - 1 -
P_ptr));
P_ptr--;
}
for (int i = 0; i <
P_Bad_Character_shift_table.Count(); i++)
Console.WriteLine("P_Bad_Character_shift_table.P_Char[" + i + "] = " +
P_Bad_Character_shift_table[i].P_Char
+ ", Bad_Character_Shift
= " + P_Bad_Character_shift_table[i].Bad_Character_Shift);
}
|

Figure: When
only string P last character matching string T flow chart
Only Last Character Matched find the
only_last_character_of_P_matching_shift
Shift string P in
string T by only_last_character_of_P_matching_shift . Find the next P.last()
character in string P from right to left direction.
string
P = "TAGATAAGATA";
|
|
Character A
|
only_last_character_of_P_matching_shift =
|
2
|
The next character A
that is occur in string P from right to left direction is 2 characters away
from P last character.
//First case: when only last
character of string 'P' matching string 'T', how many characters shift
string 'P' should be in string 'T'
P_ptr = P.Length - 2;
int
only_last_character_of_P_matching_shift;
while (P_ptr
>= 0 && P[P_ptr] != P.Last())
P_ptr--;
only_last_character_of_P_matching_shift
= P.Length - 1 - P_ptr;
|
Good Suffix Rule Construct the P_Good_Suffix_shift_table
String P suffix can
be between last character to second last character till last character to
second first character, total count of P.length 2
suffixes. Construct P_Good_Suffix_shift_table[P.length 2] fast shifting table to save
number of shift for each different string P suffix length.
string P =
"TAGATAAGATA";
|
|
String P suffix
|
Good_Suffix_Shift
|
P_Good_Suffix_shift_table[0]
=
|
TA
|
5
|
P_Good_Suffix_shift_table[1]
=
|
ATA
|
5
|
P_Good_Suffix_shift_table[2]
=
|
GATA
|
5
|
P_Good_Suffix_shift_table[3]
=
|
AGATA
|
5
|
P_Good_Suffix_shift_table[4]
=
|
AAGATA
|
9
|
P_Good_Suffix_shift_table[5]
=
|
TAAGATA
|
9
|
P_Good_Suffix_shift_table[6]
=
|
ATAAGATA
|
9
|
P_Good_Suffix_shift_table[7]
=
|
GATAAGATA
|
9
|
P_Good_Suffix_shift_table[8]
=
|
AGATAAGATA
|
9
|
As we can see from
the table above, in P_Good_Suffix_shift_table[] , the Good_Suffix_Shift is always bigger in later
nodes.
Longer suffixes
always occur at shorter suffix position or left side of shorter suffix.
Therefore, our starting string comparison P_Ptr pointer can start from position where previous shorter suffix
was found. P_ptr = P.Length
- P_Good_Suffix_shift_table[matched_string.Length - 2];
If shorter suffix is
not found in string P , the later longer suffix definitely not exist in
string P . therefore we can shift the whole string P away by P.Length in remaining of longer
suffixes.
//Second case: when string 'P'
suffix partially matching string 'T', 2 characters & more. How many
characters shift string 'P' should be in string 'T'
matched_string =
P.Substring(P.Length - 2, 2);
int
matched_string_ptr = 1;
P_ptr = P.Length - 2;
//P_ptr pointer continue from
only_last_character_of_P_matching case
int[]
P_Good_Suffix_shift_table = new int[P.Length -
2];
P_Good_Suffix_Rule:
if
(matched_string.Length < P.Length)
{
finding_good_suffix_shift:
if (P_ptr >= 0)
{
finding_good_suffix_shift_2:
if
(matched_string[matched_string_ptr] == P[P_ptr])
{
if (P_ptr == 0 ||
matched_string_ptr == 0)
goto
found_good_suffix_shift;
matched_string_ptr--;
}
else
{
if (matched_string_ptr !=
matched_string.Count() - 1)
{
matched_string_ptr =
matched_string.Count() - 1;
goto
finding_good_suffix_shift_2;
}
}
P_ptr--;
goto finding_good_suffix_shift;
}
found_good_suffix_shift:
if (P_ptr > 0) //Match
string are found in the mid of remaining P string
{
P_Good_Suffix_shift_table[matched_string.Length
- 2] = P.Length - matched_string.Length - P_ptr;
P_ptr = P.Length 1 -
P_Good_Suffix_shift_table[matched_string.Length - 2];
}
else if (P_ptr ==
0) //Match string are found in the start of remaining P string
{
int P_Good_Suffix_shift =
P.Length - matched_string.Count() + matched_string_ptr;
for (int i =
matched_string.Length - 2; i < P.Length - 2; i++)
P_Good_Suffix_shift_table[i] =
P_Good_Suffix_shift;
goto P_Bad_Character_Rule;
}
else //When P_ptr
< 0, no match found in remaining P string
{
for (int i =
matched_string.Length - 2; i < P.Length - 2; i++)
P_Good_Suffix_shift_table[i] =
P.Length;
goto P_Bad_Character_Rule;
}
matched_string_ptr =
matched_string.Length; //matched_string increase
length by 1
matched_string = P.Substring(P.Length
- matched_string_ptr - 1, matched_string_ptr + 1); //matched_string
increase length by 1
goto P_Good_Suffix_Rule;
}
|

Figure: String search algorithm with
pre-constructed string P fast shifting Good Suffix Rule table flow chart
Other than the above method to construct P_Good_Suffix_shift_table[] , we have improved version of deriving P_Good_Suffix_shift_table[] .
When
we found match string at mid of remaining P string (P_ptr > 0), the next longer suffix potential occurrence location will be
at current suffix position or to the left side. Therefore, we can save
computing effort by only comparing longer suffix first character against
string P , P_ptr-- & matched_string_ptr = 0 .
When
current suffix is found at the start of string P , that would mean that all
the following longer suffix will be located at the same position. Hence P_Good_Suffix_shift_table[i] value will be the same for all remaining longer
suffix.
When
current suffix is not found in the remaining of string P , it would also
indicate all the remaining longer suffix will not be found inside remaining
of string P . Therefore we can shift the whole string P to new location in
string T by P_Good_Suffix_shift_table[i] = P.Length for current suffix and the
following longer suffix.
Notice
that there are no checkpoint of if
(matched_string.Length < P.Length) to check if suffix will be longer or equal to
string P in this improved version code, where every cycle suffix length
will be increase length by 1 when current suffix occur in the mid of string
P . We already know the total number of possible suffix occurrence, total
count of P.length 2 suffixes. Thus, the code will
stop at P.length 2 count.
When
suffixes are getting longer and longer, eventually these following longer
suffixes will either occur at the start of string P or not exist in
remaining of string P . Thus, we can reduce our computing efforts.
//Second case: when string 'P'
suffix partially matching string 'T', 2 characters & more. How many
characters shift string 'P' should be in string 'T'
matched_string =
P.Substring(P.Length - 2, 2);
int matched_string_ptr = 1;
//P_ptr pointer continue from
only_last_character_of_P_matching case
int[] P_Good_Suffix_shift_table =
new int[P.Length - 2];
//P_Good_Suffix_Rule
finding_good_suffix_shift:
if (P_ptr >= 0)
{
finding_good_suffix_shift_2:
if
(matched_string[matched_string_ptr] == P[P_ptr])
{
if (P_ptr == 0 ||
matched_string_ptr == 0)
goto found_good_suffix_shift;
matched_string_ptr--;
}
else
{
if (matched_string_ptr !=
matched_string.Count() - 1)
{
matched_string_ptr =
matched_string.Count() - 1;
goto finding_good_suffix_shift_2;
}
}
P_ptr--;
goto finding_good_suffix_shift;
}
found_good_suffix_shift:
if (P_ptr > 0) //Match string are found in
the mid of remaining P string
{
P_Good_Suffix_shift_table[matched_string.Length
- 2] = P.Length - matched_string.Length - P_ptr;
P_ptr--;
//matched_string increase length by 1
matched_string = P.Substring(P.Length
- matched_string.Length - 1, matched_string.Length + 1);
matched_string_ptr = 0; //search suffix from previous
shorter suffix position, 1 character to the left
goto finding_good_suffix_shift; //should goto
finding_good_suffix_shift_2
}
else if (P_ptr == 0) //Match string are found in
the start of remaining P string
{
int P_Good_Suffix_shift =
P.Length - matched_string.Count() + matched_string_ptr;
for (int i = matched_string.Length -
2; i < P.Length - 2; i++)
P_Good_Suffix_shift_table[i] =
P_Good_Suffix_shift;
//goto P_Bad_Character_Rule;
}
else //When P_ptr < 0, no match
found in remaining P string
{
for (int i = matched_string.Length -
2; i < P.Length - 2; i++)
P_Good_Suffix_shift_table[i] =
P.Length;
//goto P_Bad_Character_Rule;
}
|
Search string P against string T
After constructed
string P shifting table, we can start string P against string T comparison.
After each shifting string P in string T , repeat the characters comparison
from string P last character till first character against string T .
Console.WriteLine("\nStart string 'P'
against string 'T' comparison");
//Start string 'P' against
string 'T' comparison
int matched_string_length = 0;
P_ptr = P.Length - 1;
//string_compare_P_against_T:
while (T_ptr < T.Length)
{
//String compare from string 'P' last
character toward first character against string 'T'. Stop whenever there
is a mismatch.
while (P_ptr >= 0 &&
P[P_ptr] == T[T_ptr])
{
matched_string_length++;
T_ptr--;
P_ptr--;
}
if (matched_string_length == 0) //Bad Character Rule
{
if
(P_Bad_Character_shift_table.Count() == 0)
{
//special case when string 'P'
only consist of only one duplicated character
start_P_in_T += P.Length;
end_P_in_T += P.Length;
}
else
{
int i = 0;
while (i <
P_Bad_Character_shift_table.Count() &&
P_Bad_Character_shift_table[i].P_Char != T[end_P_in_T])
i++;
if (i ==
P_Bad_Character_shift_table.Count())
{
//string 'T' character
correspond to string 'P' last character position cannot be found in
string 'P'
start_P_in_T += P.Length;
end_P_in_T += P.Length;
}
else
{
start_P_in_T +=
P_Bad_Character_shift_table[i].Bad_Character_Shift;
end_P_in_T +=
P_Bad_Character_shift_table[i].Bad_Character_Shift;
}
}
Console.Write("Bad Character Rule,
");
}
else if (matched_string_length == 1)
{
//Only P last character match
start_P_in_T +=
only_last_character_of_P_matching_shift;
end_P_in_T += only_last_character_of_P_matching_shift;
Console.Write("Only P last character
match, ");
}
else if (matched_string_length ==
P.Length) //found string 'P' matching string 'T' in this position
{
local_P_in_T_results.Add(start_P_in_T); //record the position of
string 'P' in string 'T'
start_P_in_T += 1;
end_P_in_T += 1;
}
else //P against T matched string
is 2 characters or more but not exactly matched, only partially match
{
//Good Suffix Rule
start_P_in_T +=
P_Good_Suffix_shift_table[matched_string_length - 2];
end_P_in_T +=
P_Good_Suffix_shift_table[matched_string_length - 2];
Console.Write("Good Suffix Rule,
");
}
T_ptr = end_P_in_T;
P_ptr = P.Length - 1;
matched_string_length = 0;
Console.WriteLine("start_P_in_T = " + start_P_in_T);
//goto string_compare_P_against_T;
}
return
local_P_in_T_results;
|
When T_ptr
beyond string T boundary, that s mean we finished searching the whole
string T . Stop the string search and return the result List local_P_in_T_results .
c# Console App Source
code
String
search algorithm with pre-constructed string P fast shifting table Source
code: string_search_with_pre-constructed_P_fast_shift_table_version1.txt
Improved
version: String search algorithm with pre-constructed string P fast
shifting table Source code: string_search_with_pre-constructed_P_fast_shift_table_version2.txt
Statistic: Probability occurrences of Bad Character Rule, Good Suffix
Rule, Only_Last_Character_matached Rule and Exact Match
By using simple
statistic, we can compute the probability occurrences of Bad Character
Rule, Good Suffix Rule, Only_Last_Character_matached Rule and Exact Match.
Below we take a DNA
sequence for example. DNA sequence consist of 4 elements. Hereby assuming
elements T , A , G , C have equal probability of occurrence.
string
P = "TAGA";
|
|
Probability Occurrences (%)
|
matched_string.length
== 0
Bad
Character Rule String P last character not matching string T
|
= 75%
|
matched_string.length
== 1
Only
Last Character Matched Only string P last character matched string T
|
* =
18.75%
|
matched_string.length
== 2
Good
Suffix Rule string P suffix matched string T
|
* * =
4.6875%
|
matched_string.length
== 3
Good
Suffix Rule string P suffix matched string T
|
* * * =
1.171875%
|
matched_string.length
== 4
Found
exact match string P is matching string T
|
* * * =
0.390625%
|
Total
|
100%
|
Assuming all elements
in string P & string T have the same probability of occurrences. In the
above example 75% chance the algorithm will be running Bad Character Rule ;
18.75% probability running Only Last Character Matched ; 5.859375% running
Good Suffix Rule ; 0.390625% found the exact match.
Linear Execution
From the above
flowcharts we can see that the string search program is executing linearly.
To utilize multi-threads feature of CPU, we can divide the string T into
multiple segments, search string P simultaneously in each segments of
string T.
Multi-tasking String Search by implementing multi-threading programming
Nowadays CPUs are all
multi-cores and multi-threads. Therefore, we can implement multi-threads
programming to run our string search algorithm efficiently and faster. Our
string search algorithm is linear execution. Execute the program by
utilizing only one single core and letting other cores / threads remain
idle are waste of computer resources.
Hence to utilize the
CPU multi-threads function, we can split the string T into equal segments.
Perform searching of string P in each segments of string T concurrently.
First, we identify
how many cores / threads in current computer. We would reserve 1~2 threads
to run operating system processes & other miscellaneous processes.
using System.Threading;
using System.Runtime.InteropServices;
namespace String_Search
{
class Program
{
[DllImport("kernel32.dll")]
static extern int GetCurrentProcessorNumber();
static void Main(string[] args)
{
Console.WriteLine("The number of
processors on this computer is {0}.", Environment.ProcessorCount);
int number_of_program_threads = Environment.ProcessorCount - 1; //One thread is reserved for
system & other miscellaneous processes
|
Preprocessing string
P to create fast shift tables beforehand to speed up the search process, my_string_search_class.preprocessing_string_P(P); .
Divide the workloads
evenly to the remaining of cores / threads, number_of_program_threads . Create number_of_program_threads number of lists to store the string search
results. Create another list P_in_T_results_total to combine all the search
results together.
Define the start
index & length of each segments of string T for the string search
program to start with. The start index of each segments is derived from
dividing T.Length by number_of_program_threads .
List<int>[] P_in_T_results = new List<int>[number_of_program_threads];
for (int i = 0; i <
number_of_program_threads; i++)
P_in_T_results[i] = new List<Int32>();
List<int> P_in_T_results_total = new List<int>();
int[] start_index = new int[number_of_program_threads];
int[] end_index = new int[number_of_program_threads];
|
Create number_of_program_threads to run the string search program parallelly.
Pre-processed string
P fast shift tables are the same for all the threads program, therefore I
make all the threads retrieve the number of shifts from single common
tables without creating fast shift tables for each individual threads.
Start all the threads
by Start command and run all threads until done by Join command.
Thread[] threadsArray = new Thread[number_of_program_threads];
threadsArray[0] = new Thread(() => P_in_T_results[0] =
my_string_search_class.string_search(P, T.Substring(0, end_index[0])));
threadsArray[1] = new Thread(() => P_in_T_results[1] =
my_string_search_class.string_search(P, T.Substring(start_index[1],
end_index[1] - start_index[1])));
threadsArray[2] = new Thread(() => P_in_T_results[2] = my_string_search_class.string_search(P,
T.Substring(start_index[2], T.Length - start_index[2])));
try
{
foreach (Thread t in threadsArray)
t.Start();
foreach (Thread t in threadsArray)
t.Join();
// Join both threads with no
timeout
// Run all until done.
// threads have finished at this
point.
}
catch (ThreadStateException e)
{
Console.WriteLine(e); // Display text of exception
}
catch (ThreadInterruptedException e)
{
Console.WriteLine(e); // This exception means that
the thread
// was interrupted during a Wait
}
|
Since in each
segments of string T , index always start from 0. To compromise this, we
add all search results by start_index[x] inside P_in_T_results[x] list so that the search results are reflecting
the corresponding position of string P inside string T .
Combine all the
individual P_in_T_results[x] lists into one total list P_in_T_results_total afterward.
for (int x = 0; x <
number_of_program_threads; x++)
for (int y = 0; y <
P_in_T_results[x].Count(); y++)
{
P_in_T_results[x][y] +=
start_index[x];
}
for (int i = 0; i <
number_of_program_threads; i++)
P_in_T_results_total.AddRange(P_in_T_results[i]);
if (P_in_T_results_total.Any())
{
Console.Write("String P is found
inside string T, between: ");
foreach (int i in P_in_T_results_total)
Console.Write(i + "~" + (i + P.Length - 1) + ", ");
Console.WriteLine("");
}
else
Console.WriteLine("P string not exist in
string T");
|
c# Console App Source
code
String search
algorithm with pre-constructed string P fast shifting table and with
multithreading Source code: String_search_multi-threads.txt
Multi-threading String Search with Wild Card
In below code, user
can use Wild Card ? in string P to find Exact String Match of string P
from string T .
c# Console App Source
code
String search
algorithm with pre-constructed string P fast shifting table and with
multithreading Source code: String_search_multi-threads_WildCard.txt
|