Parsing City/States From User Input With Python NGram


A friend just asked how to do city/state lookup on input strings. I've used metaphones and Levenshtein distance in the past but that seems like over kill. Using a n-gram is a nice and easy solution

  1. easy_install ngram

  2. build file with all the city and state names one per line, place in Redwood City, CA Redwood, VA etc

  3. Experiment ( the .2 threshold is a little lax )

import string
import ngram
cityStateParser = ngram.NGram(
  items = (line.strip() for line in open('')) ,
  N=3, iconv=string.lower, qconv=string.lower,  threshold=.2

[('Redwood VA', 0.5),
('Redwood NY', 0.5),
('Redwood MS', 0.5),
('Redwood City CA', 0.36842105263157893),

Notes: Because these are NGrams you might get overmatch when the state is part of a ngram in the city i.e. search for "washington" would yield Washington IN with a bette score than "Washington OK"

You might also want read Using Superimposed Coding Of N-Gram Lists For Efficient Inexact Matching (PDF Download)

If this works for you, consider giving me a vote on