Approximate String Matching or

Fuzzy String Matching

 

 

Introduction

Approximate String Matching or Fuzzy String Matching is a method of finding whether a string is matching approximately / fuzzily with another string.

Certain tolerable number of mismatches are permitted to identify both strings are approximately / fuzzily matching. The number of mismatches are call “Edit Distance” (integer), we can decide what is the maximum “Edit Distance” value we can accept to determine whether both strings are approximately matching.

The mismatches include insertions, deletions, substitutions of individual characters inside the target or source strings.

 

·        insertion: cotcoat

Edit Distance = 1

·        deletion: coatcot

Edit Distance = 1

·        substitution: coatcost

Edit Distance = 1

·        costcots

deletion + insertion = 2 Edit Distance

Reference from Wikipedia. (https://en.wikipedia.org/wiki/Approximate_string_matching)

 

Example applications of Approximate String Matching are mentioned in Wikipedia.

Approximate String Matching should not imply to ‘Fast Exact String-Matching’ method.

 

Recursive Finding of ‘Edit Distance’

Take long time to derive the outcome.

 

C# code in local function

private static int edDistRecursive(string a, string b)

{

   if (a.Length == 0)

      return b.Length;

 

   if (b.Length == 0)

      return a.Length;

 

   int delt;

   if (a.LastOrDefault() != b.LastOrDefault())

      delt = 1;

   else

      delt = 0;

 

   return Math.Min(Math.Min(

                              edDistRecursive(a.Substring(0, a.Length - 1), b.Substring(0, b.Length - 1)) + delt,

                              edDistRecursive(a.Substring(0, a.Length - 1), b) + 1),

                              edDistRecursive(a, b.Substring(0, b.Length - 1)) + 1);

           

}

 

 

 

Calculating ‘Edit Distance’ using Dynamic Programming

This is a faster method to derive ‘Edit Distance’ / number of mismatches, it calls Dynamic Programming. We create an Edit Distance table to derive our targeted Edit Distance faster.

When two strings are approximately same, we use ‘Edit Distance by Approximate Equals’ method. Similar command in C# will be ‘str1.equals(str2)’.

When one string is much longer than another string, we would say string#1 approximately contains string#2 within. Standard C# command alike will be ‘str1.contains(str2)’.

Both ‘Edit Distance by Approximate Equals’ and ‘Edit Distance by Approximate Contains’ coding are similar but with few differences.

 

Calculate ‘Edit Distance’ by Approximate Equals

Below is an example of comparing string y “shekespr_*” vs string x “shakspeare_

1.     First we create string x (x.length + 1) time string y (y.length + 1) 2-dimentional array Edit Distance table.

2.     Then we fill-up first row and first column with incremental numbers, start from 1 till x.length & y.length. The incremental numbers mean that if both strings are found offset by one (to the left or right), ‘Edit Distance’ will increment by 1 for subsequent characters.

 

 

s

h

e

k

e

s

p

r

_

*

 

0

1

2

3

4

5

6

7

8

9

10

s

1

0

 

 

 

 

 

 

 

 

 

h

2

 

 

 

 

 

 

 

 

 

 

a

3

 

 

 

 

 

 

 

 

 

 

k

4

 

 

 

 

 

 

 

 

 

 

s

5

 

 

 

 

 

 

 

 

 

 

p

6

 

 

 

 

 

 

 

 

 

 

e

7

 

 

 

 

 

 

 

 

 

 

a

8

 

 

 

 

 

 

 

 

 

 

r

9

 

 

 

 

 

 

 

 

 

 

e

10

 

 

 

 

 

 

 

 

 

 

_

11

 

 

 

 

 

 

 

 

 

 

Table 1: Approximate Equals ‘Edit Distance’ table

3.     Then we can fill up the cells inside the table with Edit Distance value.

a.     Compare the respective cells (i & j) string x character and string y character.

b.     If they are the same, ‘distDiag = D[i - 1, j - 1]’. Copy cells ‘D[i - 1, j - 1]’ value.

c.      If they are different characters, ‘distDiag = D[i - 1, j - 1] + 1’. Copy cells ‘D[i - 1, j - 1]’ value but add one.

d.     distHor = D[i, j-1] + 1 

e.     distVer = D[i-1, j] + 1

f.       Compare ‘distDiag’, ‘distHor’ & ‘distVer’, see which one has the smallest value.

g.     Set current cell’s Edit Distance to this value

4.     When table is filled up, local function will return the last row last column cell’s Edit Distance value. This will be the two strings ‘Approximate Equals Edit Distance’ result. In this example, Edit Distance is 6. String y “shekespr_*” need 3 insertions, 2 deletions and 1 substitution to match string x “shakspeare_”.

 

 

 

s

h

e

k

e

s

p

r

_

*

 

0

1

2

3

4

5

6

7

8

9

10

s

1

0

1

2

3

4

5

6

7

8

9

h

2

1

0

1

2

3

4

5

6

7

8

a

3

2

1

1

2

3

4

5

6

7

8

k

4

3

2

2

1

2

3

4

5

6

7

s

5

4

3

3

2

2

2

3

4

5

6

p

6

5

4

4

3

3

3

2

3

4

5

e

7

6

5

4

4

3

4

3

3

4

5

a

8

7

6

5

5

4

4

4

4

4

5

r

9

8

7

6

6

5

5

5

4

5

5

e

10

9

8

7

7

6

6

6

5

5

6

_

11

10

9

8

8

7

7

7

6

5

6

Table 2: Approximate Equals ‘Edit Distance’ table

Table 3: Approximate Equals ‘Edit Distance’ table

Table 3: Approximate Equals ‘Edit Distance’ table

 

 

‘Edit Distance by Approximate Equals’ Code

C# code in local function

private static int editDistance(string x, string y)

{

   int x_length = x.Length, y_length = y.Length;

   int[,] D = new int[x_length + 1, y_length + 1];

 

   for (int n = 0; n < x_length + 1; n++)

      D[n, 0] = n;

 

   for (int n = 0; n < y_length + 1; n++)

      D[0, n] = n;

 

   int distHor, distVer, distDiag;

 

   for (int i = 1; i < x_length + 1; i++)

      for (int j = 1; j < y_length + 1; j++)

      {

         distHor = D[i, j-1] + 1;

         distVer = D[i-1, j] + 1;

 

         if (x[i - 1] == y[j - 1])

            distDiag = D[i - 1, j - 1];

         else

            distDiag = D[i - 1, j - 1] + 1;

 

         D[i, j] = Math.Min(Math.Min(distHor, distVer), distDiag);

      }

 

   return D[x_length, y_length];

}

 

Calculate ‘Edit Distance’ by Approximate Contains

When string x is longer than string y, we will check does string x approximately contains string y within. Below is an example of comparing string x “GCGTATGC” vs string y “TATTGGCTATACGGTT”.

 

 

T

A

T

T

G

G

C

T

A

T

A

C

G

G

T

T

 

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

G

1

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

G

3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

T

4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

T

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

G

7

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C

8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Table 4: Approximate Contains ‘Edit Distance’ table

 

First row Edit Distance will be 0, because we will ignore the offset position of string ‘x’ in string ‘y’.

Table 5: Approximate Contains ‘Edit Distance’ table

Table 5: Approximate Contains ‘Edit Distance’ table

Table 6: Approximate Contains ‘Edit Distance’ table

Table 6: Approximate Contains ‘Edit Distance’ table

 

The above example is from the lecture note (0440_approx__editdist3.pdf) written by Dr. Ben Langmead and Jacob Pritt from Johns Hopkins University. Working principle can be found from their lecture video too.

 

C# code in local function

 

static void Main(string[] args)

{

   string c = "TATTGGCTATACGGTT";

   string d = "GCGTATGC";

 

   var myTuple = editDistance_ApproximateContains(c, d);

   Console.WriteLine("\nString c = " + c + "\nString d = " + d +

                "\nEdit Distance = " + myTuple.Item1 +

                ", Location of last character of strD in strC = " + myTuple.Item2);

   Console.Read();

 

}

 

private static Tuple<int, int> editDistance_ApproximateContains(string _a, string _b)

{

   string x, y;

 

   //longer string will become 'y'

   if (_a.Length > _b.Length)

   {

      y = _a;

      x = _b;

   }

   else

   {

      y = _b;

      x = _a;

   }

 

int[,] D = new int[x.Length + 1, y.Length + 1];

 

for (int n = 0; n < x.Length + 1; n++)

   D[n, 0] = n;

 

//ignore any offset positions of x in y, because str.Contains

for (int n = 0; n < y.Length + 1; n++)

   D[0, n] = 0;

 

int distHor, distVer, distDiag;

 

for (int i = 1; i < x.Length + 1; i++)

   for (int j = 1; j < y.Length + 1; j++)

   {

      distHor = D[i, j - 1] + 1;

      distVer = D[i - 1, j] + 1;

 

      if (x[i - 1] == y[j - 1])

         distDiag = D[i - 1, j - 1];

      else

         distDiag = D[i - 1, j - 1] + 1;

 

      D[i, j] = Math.Min(Math.Min(distHor, distVer), distDiag);

   }

 

 

   int smallestEditDistance = D[x.Length, 0];

   int _StrX_lastChar_location_in_StrY = 0;

   for (int j = 1; j < y.Length + 1; j++)

   {

      if(smallestEditDistance > D[x.Length, j])

      {

         smallestEditDistance = D[x.Length, j];

         _StrX_lastChar_location_in_StrY = j;

      }

   }

 

   //print out Edit Distance dynamic Table, optional

   for (int i = 0; i < x.Length + 1; i++)

   {

      for (int j = 0; j < y.Length + 1; j++)

      {

         Console.Write(D[i, j] + " ");

      }

      Console.WriteLine();

   }

 

   return new Tuple<int, int>(smallestEditDistance, _StrX_lastChar_location_in_StrY - 1);

}

 

The C# codes here are direct conversion from lectures (https://www.coursera.org/learn/dna-sequencing/home/welcome) given by Dr. Ben Langmead and Jacob Pritt from Johns Hopkins University in Python codes.

 

 

 


Display detailed differences between two strings using ‘Edit Distance’ table

 

The ‘Edit Distance’ coding illustrated above only tell us two strings’ Edit Distance / number of differences, without telling us the details about the differences.

To know what characters mismatches between two strings are, computer can trace it using the blue line on the ‘Edit Distance’ table.

For Approximate Equals comparison, blue line starts from the last row last column cell; for Approximate Contains comparison, blue line starts from the last row and the column cell with the smallest Edit Distance.

The following previous cell the blue line will draft is the cell that has the smallest value.

·       When blue line goes up: means string x need insertion

·       When blue line goes left: means string x need deletion

·       When blue line goes diagonal and the cells value no change: both characters matching

·       When blue line goes diagonal and the cells value decremental by one: means string x character is substituted by string y character

 

The comparison result will be displayed as follow format:

T:

G

C

-

T

A

T

A

C

 

|

|

 

|

|

|

 

|

P:

G

C

G

T

A

T

G

C

 

Table 3: Approximate Equals ‘Edit Distance’ table

Table 3: Approximate Equals ‘Edit Distance’ table

 

Table 6: Approximate Contains ‘Edit Distance’ table

Table 6: Approximate Contains ‘Edit Distance’ table

 

C# code

Single threaded version of ‘Approximate Equals’ and ‘Approximate Contains’, print the detailed differences string ‘P’ against string ‘T’: ApproxContains_ApproxEquals_DetailedDiffs.cs

 

Multi-threading version for ‘Approximate Contains’

For ‘Approximate Contains’ function, user set an allowable Edit Distance and search all the possible locations of string ‘P’ in string ‘T’. Then print-out all the approximately matched strings and the differences against string ‘P’.

'Approximate Contains' multithreading explanation

 

When string ‘T’ is relatively long compare to string ‘P’:

1.     Check number processing threads available on the current computer.

2.     Split string ‘T’ into smaller segments, number equals to number of system threads.

3.     Each thread runs ‘Approximate Contains’ search function on its given string ‘T’ segments

ApproxContains_ApproxEquals_DetailedDiffs_multithread.cs

 

 

 

 

 

 

 

Edit date: 23 May 2020