fuzzywuzzy : string matching Python

Many times while dealing with text analytics, we need to compare text. There are multiple algorithms and approaches to do the job. Lets have a look at fuzzywuzzy library.

fuzzywuzzy
Installation
pip install fuzzywuzzy  
pip install python-Levenshtein  

fuzzywuzzy will work even if you dont install python-Levenshtein but installing it will enhance performance.

Using fuzz.ratio

This is basic comparison and output is as below

>>> from fuzzywuzzy import fuzz
>>> from fuzzywuzzy import process
>>> fuzz.ratio("ABCD", "ABCD")
100  
>>> fuzz.ratio("ABCD", "ABCDE")
89  
>>> fuzz.ratio("ABCD", "ABCDEF")
80  
>>> fuzz.ratio("ABCD", "ABCDEFG")
73  
>>> fuzz.ratio("ABCD", "ABCDEFGH")
67  
>>> fuzz.ratio("ABCD", "ABCDEFGHI")
62  
>>> fuzz.ratio("ABCD", "ABCDEFGHIJ")
57  
Using partial_ratio

ratio is very simple comparison, you can use partial_ratio to do sub-string mapping.

>>> fuzz.partial_ratio("ABCD", "ABCDEFGHIJ")
100  

But evening partial ration fails when words are scarmbled.

>>> fuzz.partial_ratio("India Vs Aus","Aus Vs India")
42  
Using tokensortratio

Basically “India Vs Aus” and “Aus Vs India” are same thing but people can use either ways and both are valid. In cases of words where sequence might be different,you can use tokensortratio

>>> fuzz.token_sort_ratio("India Vs Aus","Aus Vs India")
100  
>>> fuzz.token_sort_ratio("India cricket team Vs Aus cricket team","Aus Vs India")
48  

Now lets add further complication, if I add ‘cricket team’ in one of the word, match does not work.

fuzz.tokensetratio

In such cases, using tokensetration might help you.

>>> fuzz.token_set_ratio("India cricket team Vs Aus cricket team","Aus Vs India")
100  
Which matching algorithm to use ?

Well, just because fuzz.tokensetratio works even with extra words, it might not be suitable for your application. Ultimately, it will boils down to what you want ti compare and how is use data.

You can take a look at other libraries such as jellyfish if you are looking for alternative approaches.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.