Levenshtein distance

Levenshtein distance

Thursday - December 27, 2018

The levenshtein formula calculates the similarity in two strings.

It will return a value corresponding with the amount of letter changes required to become the other string.

'0' means the two strings are exactly the same.
'1' means a single letter is different between the two strings
'2' means two letters are different between the two strings.
...and so on.

Examples:

Word 1 Word 2 Levenshtein Distance Explanation
Cat Bat 1 ‘C’ and ‘B’ are different.
Plain Plane 2 The last two letters are different.
Apple Apple 0 No difference

The formula

Formula

Here are what the letters in the formula represent:

a = string #1
b = string #2
i = number of letters in string #1
j = number of letters in string #2
ai = Refers to a specific letter in string #1, using the i variable
bj = Refers to a specific letter in string #2, using the j variable

Source: Wikipedia

A recursive formula for levenshtein distance. A “recursive formula” means that you are running the same formula inside the formula - you can see that there are three levenshtein formulas (labeled as “lev”) inside the levenshtein formula.

The formula in plain english:

The first thing you do is check whether i or j is equal to zero. If so, that particular levenshtein value will be the bigger number (max) of i or j.

If neither i or j is equal to zero, you must run the levenshtein formula in three separate variations. For the first two variations, you add 1 to the result. In the last variation you will add 1 only if the letters ai and bj are not equal.

Whichever variation produces the smallest number (min) is your levenshtein distance.

The three variations mentioned:

Implementing this algorithm in code

In Common Lisp:

;;; The main function
;; a = string #1
;; b = string #2
;; i = current index in string #1
;; j = current index in string #2
(defun lev (a b i j)
  (if (= (min i j) 0)
      (max i j)
      ;;; otherwise.
      (min (+ (lev a b (- i 1) j) 1)
           (+ (lev a b i (- j 1)) 1)
           (+ (lev a b (- i 1) (- j 1))
              (if (eq (char a (- i 1))
                      (char b (- j 1)))
                  0
                  1)))))

;;; Convenience function
(defun levenshtein (a b &key (case-insensitive nil))
  (when case-insensitive
    (setq a (string-downcase a))
    (setq b (string-downcase b)))
  (lev a b (length a) (length b)))

;;; Example usage
(levenshtein "CaT" "cat")
                                        ; 2

(levenshtein "CaT" "cat" :case-insensitive t)
                                        ; 0

(levenshtein "apple" "orange")
                                        ; 5

In Python

def lev(a, b, i, j):
    if min(i, j) == 0:
        return max(i, j)
    return min(
        lev(a, b, i - 1, j) + 1,
        lev(a, b, i, j - 1) + 1,
        lev(a, b, i - 1, j - 1) + (0 if a[i-1] == b[j-1] else 1))


def levenshtein(a, b, caseInsensitive=False):
    if caseInsensitive:
        a = a.lower()
        b = b.lower()
    return lev(a, b, len(a), len(b))


levenshtein("CaT", "cat")
# 2

levenshtein("CaT", "cat", caseInsensitive=True)
# 0

levenshtein("apple", "orange")
# 5

Leave a comment