Introduction
Approximate
String Matching or Fuzzy
String Matching is a method of finding whether a string is matching
approximately / fuzzily with another string.
Certain tolerable number of mismatches are permitted to identify
both strings are approximately / fuzzily matching. The number of mismatches
are call “Edit Distance” (integer), we can decide
what is the maximum “Edit Distance” value we can accept to determine
whether both strings are approximately matching.
The mismatches include insertions, deletions, substitutions of
individual characters inside the target or source strings.
·
insertion:
cot → coat
Edit Distance = 1
·
deletion:
coat → cot
Edit Distance = 1
·
substitution:
coat → cost
Edit Distance = 1
·
cost → cots
deletion + insertion = 2 Edit Distance
|
Reference
from Wikipedia. (https://en.wikipedia.org/wiki/Approximate_string_matching)
Example
applications of Approximate String Matching are mentioned in Wikipedia.
Approximate String
Matching should not imply to ‘Fast Exact String-Matching’ method.
Recursive
Finding of ‘Edit Distance’
Take
long time to derive the outcome.
C#
code in local function
private static int edDistRecursive(string a, string b)
{
if (a.Length == 0)
return b.Length;
if (b.Length == 0)
return a.Length;
int delt;
if (a.LastOrDefault() != b.LastOrDefault())
delt = 1;
else
delt = 0;
return Math.Min(Math.Min(
edDistRecursive(a.Substring(0, a.Length - 1), b.Substring(0,
b.Length - 1)) + delt,
edDistRecursive(a.Substring(0, a.Length - 1), b) + 1),
edDistRecursive(a, b.Substring(0, b.Length - 1)) + 1);
}
|
Calculating
‘Edit Distance’ using Dynamic Programming
This is a faster method
to derive ‘Edit Distance’ / number of mismatches, it calls Dynamic
Programming. We create an Edit Distance table to
derive our targeted Edit Distance faster.
When two strings
are approximately same, we use ‘Edit
Distance by Approximate Equals’ method. Similar command in C# will be ‘str1.equals(str2)’.
When one string is
much longer than another string, we would say string#1 approximately
contains string#2 within. Standard C# command alike will be ‘str1.contains(str2)’.
Both ‘Edit Distance by Approximate Equals’
and ‘Edit Distance by Approximate
Contains’ coding are similar but with few
differences.
Calculate ‘Edit
Distance’ by Approximate Equals
Below is an example
of comparing string y “shekespr_*” vs string x “shakspeare_”
1.
First
we create string x (x.length
+ 1) time string y (y.length + 1) 2-dimentional
array Edit Distance table.
2.
Then
we fill-up first row and first column with incremental numbers, start from
1 till x.length & y.length. The incremental numbers mean that if both
strings are found offset by one (to the left or right), ‘Edit Distance’
will increment by 1 for subsequent characters.
|
|
s
|
h
|
e
|
k
|
e
|
s
|
p
|
r
|
_
|
*
|
|
0
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
s
|
1
|
0
|
|
|
|
|
|
|
|
|
|
h
|
2
|
|
|
|
|
|
|
|
|
|
|
a
|
3
|
|
|
|
|
|
|
|
|
|
|
k
|
4
|
|
|
|
|
|
|
|
|
|
|
s
|
5
|
|
|
|
|
|
|
|
|
|
|
p
|
6
|
|
|
|
|
|
|
|
|
|
|
e
|
7
|
|
|
|
|
|
|
|
|
|
|
a
|
8
|
|
|
|
|
|
|
|
|
|
|
r
|
9
|
|
|
|
|
|
|
|
|
|
|
e
|
10
|
|
|
|
|
|
|
|
|
|
|
_
|
11
|
|
|
|
|
|
|
|
|
|
|
Table 1: Approximate Equals ‘Edit
Distance’ table
3.
Then
we can fill up the cells inside the table with Edit Distance value.
a.
Compare
the respective cells (i & j) string x character and string y
character.
b.
If
they are the same, ‘distDiag = D[i
- 1, j - 1]’. Copy cells ‘D[i - 1, j - 1]’ value.
c.
If
they are different characters, ‘distDiag = D[i - 1, j - 1] + 1’.
Copy cells ‘D[i - 1, j - 1]’ value but add one.
d.
‘distHor = D[i, j-1] + 1’
e.
‘distVer = D[i-1,
j] + 1’
f.
Compare
‘distDiag’,
‘distHor’
& ‘distVer’,
see which one has the smallest value.
g.
Set
current cell’s Edit Distance to this value
4.
When
table is filled up, local function will return the last row last column
cell’s Edit Distance value. This will be the two strings ‘Approximate
Equals Edit Distance’ result. In this example, Edit Distance is 6. String y
“shekespr_*”
need 3 insertions, 2 deletions and 1 substitution to match string x “shakspeare_”.
|
|
s
|
h
|
e
|
k
|
e
|
s
|
p
|
r
|
_
|
*
|
|
0
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
s
|
1
|
0
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
h
|
2
|
1
|
0
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
a
|
3
|
2
|
1
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
k
|
4
|
3
|
2
|
2
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
s
|
5
|
4
|
3
|
3
|
2
|
2
|
2
|
3
|
4
|
5
|
6
|
p
|
6
|
5
|
4
|
4
|
3
|
3
|
3
|
2
|
3
|
4
|
5
|
e
|
7
|
6
|
5
|
4
|
4
|
3
|
4
|
3
|
3
|
4
|
5
|
a
|
8
|
7
|
6
|
5
|
5
|
4
|
4
|
4
|
4
|
4
|
5
|
r
|
9
|
8
|
7
|
6
|
6
|
5
|
5
|
5
|
4
|
5
|
5
|
e
|
10
|
9
|
8
|
7
|
7
|
6
|
6
|
6
|
5
|
5
|
6
|
_
|
11
|
10
|
9
|
8
|
8
|
7
|
7
|
7
|
6
|
5
|
6
|
Table 2: Approximate Equals ‘Edit
Distance’ table

Table 3: Approximate Equals ‘Edit
Distance’ table
‘Edit
Distance by Approximate Equals’ Code
C#
code in local function
private static int editDistance(string x, string y)
{
int x_length
= x.Length, y_length = y.Length;
int[,] D = new int[x_length
+ 1, y_length + 1];
for (int n = 0; n < x_length + 1; n++)
D[n, 0] =
n;
for (int n = 0; n < y_length + 1; n++)
D[0, n] =
n;
int distHor,
distVer, distDiag;
for (int i =
1; i < x_length +
1; i++)
for (int j = 1; j < y_length + 1; j++)
{
distHor
= D[i, j-1] + 1;
distVer
= D[i-1, j] + 1;
if (x[i - 1] == y[j - 1])
distDiag
= D[i - 1, j - 1];
else
distDiag
= D[i - 1, j - 1] + 1;
D[i, j] = Math.Min(Math.Min(distHor, distVer), distDiag);
}
return D[x_length, y_length];
}
|
Calculate ‘Edit
Distance’ by Approximate Contains
When string x is longer
than string y, we will check does string x approximately contains string y
within. Below is an example of comparing string x “GCGTATGC” vs string y “TATTGGCTATACGGTT”.
|
|
T
|
A
|
T
|
T
|
G
|
G
|
C
|
T
|
A
|
T
|
A
|
C
|
G
|
G
|
T
|
T
|
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
0
|
G
|
1
|
1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
C
|
2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
G
|
3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
T
|
4
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
A
|
5
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
T
|
6
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
G
|
7
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
C
|
8
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table
4: Approximate Contains ‘Edit Distance’ table
First row Edit Distance
will be 0, because we will ignore the offset position of string ‘x’ in
string ‘y’.

Table
5: Approximate Contains ‘Edit Distance’ table

Table
6: Approximate Contains ‘Edit Distance’ table
The above example
is from the lecture note (0440_approx__editdist3.pdf)
written by Dr. Ben Langmead and Jacob Pritt from Johns Hopkins University.
Working principle can be found from their lecture video too.
C# code in local function
static void Main(string[] args)
{
string c = "TATTGGCTATACGGTT";
string d = "GCGTATGC";
var myTuple
= editDistance_ApproximateContains(c, d);
Console.WriteLine("\nString
c = " + c + "\nString d = " + d +
"\nEdit
Distance = " + myTuple.Item1 +
", Location of last
character of strD in strC
= " + myTuple.Item2);
Console.Read();
}
private static Tuple<int, int> editDistance_ApproximateContains(string _a, string _b)
{
string x, y;
//longer string will become
'y'
if (_a.Length > _b.Length)
{
y = _a;
x = _b;
}
else
{
y = _b;
x = _a;
}
int[,] D = new int[x.Length
+ 1, y.Length + 1];
for (int n = 0; n < x.Length + 1; n++)
D[n, 0] = n;
//ignore any offset positions
of x in y, because str.Contains
for (int n = 0; n < y.Length + 1; n++)
D[0, n] = 0;
int distHor,
distVer, distDiag;
for (int i =
1; i < x.Length
+ 1; i++)
for (int j = 1; j < y.Length + 1; j++)
{
distHor =
D[i, j - 1] + 1;
distVer =
D[i - 1, j] + 1;
if (x[i - 1] == y[j - 1])
distDiag
= D[i - 1, j - 1];
else
distDiag
= D[i - 1, j - 1] + 1;
D[i, j] = Math.Min(Math.Min(distHor, distVer), distDiag);
}
int smallestEditDistance
= D[x.Length, 0];
int _StrX_lastChar_location_in_StrY
= 0;
for (int j = 1; j < y.Length + 1; j++)
{
if(smallestEditDistance > D[x.Length,
j])
{
smallestEditDistance
= D[x.Length, j];
_StrX_lastChar_location_in_StrY
= j;
}
}
//print out Edit Distance
dynamic Table, optional
for (int i =
0; i < x.Length
+ 1; i++)
{
for (int j = 0; j < y.Length + 1; j++)
{
Console.Write(D[i, j] + " ");
}
Console.WriteLine();
}
return new Tuple<int, int>(smallestEditDistance, _StrX_lastChar_location_in_StrY
- 1);
}
|
The C# codes here
are direct conversion from lectures (https://www.coursera.org/learn/dna-sequencing/home/welcome)
given by Dr. Ben Langmead and Jacob Pritt from Johns Hopkins University in
Python codes.
Display
detailed differences between two strings using ‘Edit Distance’ table
The ‘Edit Distance’ coding
illustrated above only tell us two strings’ Edit Distance / number of
differences, without telling us the details about the differences.
To know what
characters mismatches between two strings are, computer can trace it using
the blue line on the ‘Edit Distance’ table.
For Approximate
Equals comparison, blue line starts from the last row last column cell; for
Approximate Contains comparison, blue line starts from the last row and the
column cell with the smallest Edit Distance.
The following
previous cell the blue line will draft is the cell that has the smallest
value.
·
When
blue line goes up: means string x need insertion
·
When
blue line goes left: means string x need deletion
·
When
blue line goes diagonal and the cells value no change: both characters
matching
·
When
blue line goes diagonal and the cells value decremental by one: means
string x character is substituted by string y character
The comparison
result will be displayed as follow format:
T:
|
G
|
C
|
-
|
T
|
A
|
T
|
A
|
C
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
P:
|
G
|
C
|
G
|
T
|
A
|
T
|
G
|
C
|

Table 3: Approximate Equals ‘Edit
Distance’ table

Table
6: Approximate Contains ‘Edit Distance’ table
C# code
Single threaded version of
‘Approximate Equals’ and ‘Approximate Contains’, print the detailed
differences string ‘P’ against string ‘T’: ApproxContains_ApproxEquals_DetailedDiffs.cs
Multi-threading version for
‘Approximate Contains’
For ‘Approximate Contains’ function, user
set an allowable Edit Distance and search all the possible locations of
string ‘P’ in string ‘T’. Then print-out all the approximately matched
strings and the differences against string ‘P’.

When string ‘T’ is relatively long
compare to string ‘P’:
1.
Check
number processing threads available on the current computer.
2.
Split
string ‘T’ into smaller segments, number equals to number of system
threads.
3.
Each
thread runs ‘Approximate Contains’ search function on its given string ‘T’
segments
ApproxContains_ApproxEquals_DetailedDiffs_multithread.cs
|