Levenshtein distance

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Levenshtein distance is a way to calculate how different two words (or longer chains of symbols, like sentences or paragraphs) are from one another.

The simple way this works is by counting how many times you need to change one word to turn it into another word.

The three atomic 'changes' considered in this measure are: inserting a single symbol (usually a character, like letter, digit etc.), deleting a single character, and replacing (substitution) a single character with another one. Moving a character to another position in the text, swapping two characters, as well as adding, deleting or replacing longer blocks of characters (like words in a sentence) are not counted as single changes in this measure.

For example, the Levenshtein distance between "kitten" and "sitting" is 3, since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

  1. kitten → sitten (substitution of 's' for 'k')
  2. sitten → sittin (substitution of 'i' for 'e')
  3. sittin → sitting (insertion of 'g' at the end).

This can be used by various websites when you change your password to make sure you're not using a similar password that can decrease the security of the website. The given minimum distance between two passwords is recommended to be 5, so that means that 5 changes need to take place between your old password and your new password for certain websites to accept your new password.

This can also be used to estimate if one text is a plagiarism from the other one.