python - Filter list to remove similar, but not identical, entries -
i have long list containing several thousand names unique strings, filter them produce shorter list if there similar names 1 retained. example, original list contain:
mickey mouse
mickey m mouse
mickey m. mouse
the new list contain 1 of them - doesn't matter @ moment in time. it's possible similarity score using code below (where a , b text being compared), providing pick appropriate ratio have way of making include/exclude decision.
difflib.sequencematcher(none, a, b).ratio() what i'm struggling work out how populate second list first one. i'm sure it's trivial matter, baffling newbie brain.
i'd have thought along lines of have worked, nothing ends being populated in second list.
for p in ppl1: pp in ppl2: if difflib.sequencemater(none, p, pp).ratio() <=0.9: ppl2.append(p) in fact, if did populate list, it'd still wrong. guess it'd need compare name first list names in second list, keep track of highest ratio scored, , add if highest ratio less cutoff criteria.
any guidance gratefully received!
i'm going risk never getting accept because may advanced you, here's optimal solution.
what you're trying variant of agglomerative clustering. union-find algorithm can used solve efficiently. pairs of distinct strings a , b, can generated using
def pairs(l): i, in enumerate(l): j in range(i + 1, len(l)): yield (a, l[j]) you filter pairs have similarity ratio <= .9:
similar = ((a, b) a, b in pairs if difflib.sequencematcher(none, p, pp).ratio() <= .9) then union in disjoint-set forest. after that, loop on sets representatives.
Comments
Post a Comment