Click to see image
The image displayed above is a visualization of the graph-structure of one of the groups of strings found by string_grouper.  Each circle (node) represents a string, and each connecting arc (edge) represents a match between a pair of strings with a similarity score above a given threshold score (here 0.8).
The centroid of the group, as determined by string_grouper (see tutorials/group_representatives.md for an explanation), is the largest node, also with the most edges originating from it.  A thick line in the image denotes a strong similarity between the nodes at its ends, while a faint thin line denotes weak similarity.
The power of string_grouper is discernible from this image: in large datasets, string_grouper is often able to resolve indirect associations between strings even when, say, due to memory-resource-limitations, direct matches between those strings cannot be computed using conventional methods with a lower threshold similarity score.
This image was designed using the graph-visualization software Gephi 0.9.2 with data generated by string_grouper operating on the sec__edgar_company_info.csv sample data file.
string_grouper is a library that makes finding groups of similar strings within a single, or multiple, lists of
strings easy — and fast. string_grouper uses tf-idf to calculate cosine similarities
within a single list or between two lists of strings. The full process is described in the blog Super Fast String Matching in Python.
pip install string-grouper
string_grouper leverages the blazingly fast sparse_dot_topn libary
to calculate cosine similarities.
s = datetime.datetime.now()
matches = match_strings(names['Company Name'], number_of_processes = 4)
e = datetime.datetime.now()
diff = (e - s)
str(diff)Results in:
00:05:34.65 On an Intel i7-6500U CPU @ 2.50GHz, where len(names) = 663 000
in other words, the library is able to perform fuzzy matching of 663 000 names in five and a half minutes on a 2015 consumer CPU using 4 cores.
import pandas as pd
from string_grouper import match_strings
company_names = 'sec__edgar_company_info.csv'
companies = pd.read_csv(company_names)
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches['left_Company Name'] != matches['right_Company Name']].head()| left_index | left_Company Name | similarity | right_Company Name | right_index | |
|---|---|---|---|---|---|
| 15 | 14 | 0210, LLC | 0.870291 | 90210 LLC | 4211 | 
| 167 | 165 | 1 800 MUTUALS ADVISOR SERIES | 0.931615 | 1 800 MUTUALS ADVISORS SERIES | 166 | 
| 168 | 166 | 1 800 MUTUALS ADVISORS SERIES | 0.931615 | 1 800 MUTUALS ADVISOR SERIES | 165 | 
| 172 | 168 | 1 800 RADIATOR FRANCHISE INC | 1 | 1-800-RADIATOR FRANCHISE INC. | 201 | 
| 178 | 173 | 1 FINANCIAL MARKETPLACE SECURITIES LLC /BD | 0.949364 | 1 FINANCIAL MARKETPLACE SECURITIES, LLC | 174 | 
companies[["group-id", "name_deduped"]] = group_similar_strings(companies['Company Name'])
companies.groupby('name_deduped')['Line Number'].count().sort_values(ascending=False).head(10)| name_deduped | Line Number | 
|---|---|
| ADVISORS DISCIPLINED TRUST | 1747 | 
| NUVEEN TAX EXEMPT UNIT TRUST SERIES 1 | 916 | 
| GUGGENHEIM DEFINED PORTFOLIOS, SERIES 1200 | 652 | 
| U S TECHNOLOGIES INC | 632 | 
| CAPITAL MANAGEMENT LLC | 628 | 
| CLAYMORE SECURITIES DEFINED PORTFOLIOS, SERIES 200 | 611 | 
| E ACQUISITION CORP | 561 | 
| CAPITAL PARTNERS LP | 561 | 
| FIRST TRUST COMBINED SERIES 1 | 560 | 
| PRINCIPAL LIFE INCOME FUNDINGS TRUST 20 | 544 | 
The documentation can be found here