python - Filter list to remove similar, but not identical, entries -


i have long list containing several thousand names unique strings, filter them produce shorter list if there similar names 1 retained. example, original list contain:

mickey mouse

mickey m mouse

mickey m. mouse

the new list contain 1 of them - doesn't matter @ moment in time. it's possible similarity score using code below (where a , b text being compared), providing pick appropriate ratio have way of making include/exclude decision.

difflib.sequencematcher(none, a, b).ratio() 

what i'm struggling work out how populate second list first one. i'm sure it's trivial matter, baffling newbie brain.

i'd have thought along lines of have worked, nothing ends being populated in second list.

for p in ppl1:     pp in ppl2:        if difflib.sequencemater(none, p, pp).ratio() <=0.9:            ppl2.append(p) 

in fact, if did populate list, it'd still wrong. guess it'd need compare name first list names in second list, keep track of highest ratio scored, , add if highest ratio less cutoff criteria.

any guidance gratefully received!

i'm going risk never getting accept because may advanced you, here's optimal solution.

what you're trying variant of agglomerative clustering. union-find algorithm can used solve efficiently. pairs of distinct strings a , b, can generated using

def pairs(l):     i, in enumerate(l):         j in range(i + 1, len(l)):             yield (a, l[j]) 

you filter pairs have similarity ratio <= .9:

similar = ((a, b) a, b in pairs                   if difflib.sequencematcher(none, p, pp).ratio() <= .9) 

then union in disjoint-set forest. after that, loop on sets representatives.


Comments

Popular posts from this blog

.htaccess - First slash is removed after domain when entering a webpage in the browser -

Automatically create pages in phpfox -

c# - Farseer ContactListener is not working -