Parsing City/States From User Input With Python NGram

0 comments |

A friend just asked how to do city/state lookup on input strings. I've used metaphones and Levenshtein distance in the past but that seems like over kill. Using a n-gram is a nice and easy solution

  1. easy_install ngram

  2. build file with all the city and state names one per line, place in citystate.data Redwood City, CA Redwood, VA etc

  3. Experiment ( the .2 threshold is a little lax )

import string
import ngram
cityStateParser = ngram.NGram(
  items = (line.strip() for line in open('citystate.data')) ,
  N=3, iconv=string.lower, qconv=string.lower,  threshold=.2
)

Example:

cityStateParser.search('redwood')
[('Redwood VA', 0.5),
('Redwood NY', 0.5),
('Redwood MS', 0.5),
('Redwood City CA', 0.36842105263157893),
...
]

Notes: Because these are NGrams you might get overmatch when the state is part of a ngram in the city i.e. search for "washington" would yield Washington IN with a bette score than "Washington OK"

You might also want read Using Superimposed Coding Of N-Gram Lists For Efficient Inexact Matching (PDF Download)

If this works for you, consider giving me a vote on StackOverflow.com