A friend just asked how to do city/state lookup on input strings. I've used metaphones and Levenshtein distance in the past but that seems like over kill. Using a n-gram is a nice and easy solution
easy_install ngram
build file with all the city and state names one per line, place in citystate.data Redwood City, CA Redwood, VA etc
Experiment ( the .2 threshold is a little lax )
import string
import ngram
cityStateParser = ngram.NGram(
items = (line.strip() for line in open('citystate.data')) ,
N=3, iconv=string.lower, qconv=string.lower, threshold=.2
)
Example:
cityStateParser.search('redwood')
[('Redwood VA', 0.5),
('Redwood NY', 0.5),
('Redwood MS', 0.5),
('Redwood City CA', 0.36842105263157893),
...
]
Notes: Because these are NGrams you might get overmatch when the state is part of a ngram in the city i.e. search for "washington" would yield Washington IN with a bette score than "Washington OK"
You might also want read Using Superimposed Coding Of N-Gram Lists For Efficient Inexact Matching (PDF Download)
If this works for you, consider giving me a vote on StackOverflow.com