Introduction to String similarity and soundex | Algorithms comparison | Java Apache commons similarity and codec soundex examples

In this article we will get familiar with different ways to check if two strings are similar. ‘Being Similar’ is different than ‘Being equal’ or ‘Being equals ignore case’ as inputs might have slight differences yet similar in some ways.

Similarity features

Here are few features in two inputs strings that might tell us how much similar they are.

  • Matching – How many words or characters match between two inputs. Ex: “abc” and “abz” have 2 characters matching but 1 different.
  • Sequence – Are matching characters in same order or sequence? Ex: “abc” and “abz” have matching characters ‘ab’ in same sequence. “abc” and “zba” have matching ‘ab’ characters but not in same sequence.
  • Sound – Some spellings might be different, but when we pronounce them they might sound same. That might be a criteria to categorize them as similar.

Algorithms

Below is very quick comparison of existing algorithms to find out String similarity or String distance (As per Apache implementation). Algorithms are compared based on –

  • Comparison by –  Tells if algorithm compares input strings by words, or by characters or by phonetics.
  • Case sensitive – Some algorithms treat different case of same character as different characters so they are case sensitive. Others might treat them same so case insensitive.
  • Sequence matters? – In addition to matching characters, some algorithms will also verify if sequence of matching characters in one string is same in other string. This gives additional perspective about how much similar strings are.
  • Result – Some algorithms give number between 0 to 1 indicating percentage of similarity but others might give some points indicating similarity.
Algorithm Comparison
by
Case
Sensitive
Sequence
Matters?
Result
Cosine Similarity / Distance word Number
(0 to 1)
Fuzzy Score character Score/
points
Hamming Distance
(Same length input strings only)
character Count of substitutions
Jaccard Similarity / Distance character Number
(0 to 1)
Jaro Winkler Similarity / distance character Number
(0 to 1)
Levenshtein Distance character Count of edits
Longest Common Subsequence character Common String
Soundex phonetics Soundex codes or their diff



How similarity algorithms work?

This video gives detailed understanding of how all above similarity algorithms works. Continue after video to get into actual code examples.

Apache implementation examples

Apache provides out of the box implementations of above algorithms. You will need below library dependencies in order to execute examples.

Library Dependency

Cosine similarity/distance

  • Algorithm
    • Similarity is checked by words in both inputs. Word counts are put in the cosine similarity formula as shown in above video to get similarity.
    • Algorithm is case sensitive so if words are in different case, they count as different.
    • Sequence of words doesn’t matter. Two sentences with same words in different order count as exact same.
  • Result
    • Algorithm gives value between 0 to 1. You can multiply it by 100 to get in percentage format.
    • Similarity = 1 – Distance (one minus distance)
  • Usage
    • Cosine similarity algorithms is used to find out similarity between documents regardless of their size. Formula for cosine doesn’t depend on magnitude of inputs. Below code has example of similarity between definitions of Gravity from Nasa & Cambridge.



Fuzzy score

  • Algorithm
    • Similarity is checked by matching characters in query String against term String.
    • This algorithms is case insensitive. Apache implementation converts inputs to lower case before comparison.
    • Sequence of characters in query, if matches against term then algorithm gives bonus points. So higher point also indicate that characters are in same sequence.
    • Note that method arguments in Apache implementation are position sensitive. First argument is term & second argument is query. Locale is needed to normalize strings to lower case.
    • Position of query in term doesn’t matter. Score of query ‘abc’ against term ‘abcxyz’ or ‘xyzabc’ or ‘xabcyz’ gives same result since they are in same sequence i.e. fuzzy score of query against term is 7.0 points, so matches query anywhere in term.
  • Result
    • Algorithm gives score in terms of points. Higher points means more similarity.
  • Usage
    • As noted in Apache documentation, editors like Sublime Text use similar algorithms to show suggestions in order of score. Here is screenshot of Sublime Text with same test data from below example. Search results are ordered by score.



Hamming distance

  • Algorithm
    • Distance / dis-similarity is checked by matching characters at same indexes. It is mainly a distance algorithm so it does not give ‘similarity’.
    • Algorithm is case sensitive, hence different case means different character.
    • If same characters in different sequence in two words, they count as different so sequence matters.
    • Algorithms has constraint that Both inputs must be same length. Apache implementation gives IllegalArgumentException if different length.
  • Result
    • Number of substitutions needed to change one input into other. Its called edit distance.
  • Usage
    • Hamming distance is used to determine transmission errors of bits. For example: if receiver is supposed to receive code as 0101 but they receive 0100, then number of error bits = hamming distance of 0101 & 0100 i.e. 1. So it has 1 bit error. For more info Read This.



Jaccard distance

  • Algorithm
    • Similarity is checked by characters using intersection of characters over union of characters. i.e. in case of exact match intersection = union. If no match at all, Intersection is zero.
    • Algorithms is case sensitive.
    • Sequence of characters or count of each character in given input doesn’t matter. Two words with same characters & different counts of same characters count as same i.e. ‘singing’ & ‘sign’ are considered 100% same.
  • Result
    • Algorithm gives value between 0 to 1. You can multiply it by 100 to get in percentage format.
    • Similarity = 1 – Distance (one minus distance)



Jaro Winkler distance

  • Algorithm
    • Similarity is checked by matching characters in specific way & then performing a Jaro formula & Winkler adjustment formula. on it to get final similarity/distance.
    • Algorithm is case sensitive.
    • Algorithm doesn’t really look for sequence. But sequence plays some role in sense that, if same character is beyond matching range then its not considered as match. Above video showcase matching range in action.
  • Result
    • Gives value between 0 to 1 kind of percentage.
    • Gives dis-similarity, do 1-x for similarity.



Levenshtein Distance

  • Algorithm
    • Similarity is checked by matching characters & finding out edits needed. Edits can be Replace character, Insert Character, Delete Character.
    • Algorithm is case sensitive.
    • Sequence of characters matters. Characters might be present, but if in wrong order then it might not be counted as match.
  • Result
    • Number of edits needed to change one input into other.



Longest Common Subsequence

  • Algorithm
    • Similarity is checked by matching aligned characters.
    • Algorithm is case sensitive.
    • Sequence matters in this algorithm. If character out of sequence, its not considered match.
  • Result
    • This algorithm doesn’t really give similarity or distance indicating value.
    • This gives String of common matching sequence of characters



Soundex

  • Algorithm
    • Similarity is checked by converting input string into soundex code. Then diff of soundex code tells if String are similar in phonetic way. Above video shows algorithm in action.
    • Algorithm is case in-sensitive. Case doesn’t matter because it is intended to find similarity based on phonetics or pronunciation.
  • Result
    • This algorithm can either give soundex code for inputs individually & you can compare by yourselves or you can use inbuilt diff which gives value betwee 0 to 4. 0 is no similarity, 4 is strong similarity.



Leave a Reply

Your email address will not be published.