Recently in python Category

A friend just asked how to do city/state lookup on input strings. I've used metaphones and Levenshtein distance in the past but that seems like over kill. Using a n-gram is a nice and easy solution

  1. easy_install ngram

  2. build file with all the city and state names one per line, place in citystate.data Redwood City, CA Redwood, VA etc

  3. Experiment ( the .2 threshold is a little lax )

import string
import ngram
cityStateParser = ngram.NGram(
  items = (line.strip() for line in open('citystate.data')) ,
  N=3, iconv=string.lower, qconv=string.lower,  threshold=.2
)

Example:

cityStateParser.search('redwood')
[('Redwood VA', 0.5),
('Redwood NY', 0.5),
('Redwood MS', 0.5),
('Redwood City CA', 0.36842105263157893),
...
]

Notes: Because these are NGrams you might get overmatch when the state is part of a ngram in the city i.e. search for "washington" would yield Washington IN with a bette score than "Washington OK"

You might also want read Using Superimposed Coding Of N-Gram Lists For Efficient Inexact Matching (PDF Download)

If this works for you, consider giving me a vote on StackOverflow.com

Cascading and Coroutines

Cascading looks quite interesting. Here is a python program that does something similar to the Technical Overview seen main in the python program.

    #!/usr/bin/env python
    # encoding: utf-8
    import sys

    def input(theFile, pipe):
        """
        pushes a file a line at a time to a coroutine pipe
        """
        for line in theFile:
            pipe.send(line)
        pipe.close()

    @coroutine
    def extract(expression, pipe, group = 0):
        """
        extract the group from a regex
        """
        import re
        r = re.compile(expression)
        while True:
            line = (yield)
            match = r.search(line)
            if match:
                pipe.send(match.group(0))

    @coroutine
    def sort(pipe):
        """
        sort the input on a pipe
        """
        import heapq
        heap = []
        try:
            while True:
                line = (yield)
                heapq.heappush(heap, line)
        except GeneratorExit:
            while heap:
                pipe.send(heapq.heappop(heap))

    @coroutine
    def group(groupPipe, pipe):
        """
        sends consectutive matching lines from pipe to groupPipe
        """
        cur = None
        g = None
        while True:
            line = (yield)
            if cur is None:
                g = groupPipe(pipe)
            elif cur != line:
                g.close()
                g = groupPipe(pipe)

            g.send(line)
            cur = line

    @coroutine
    def uniq(pipe):
        """
        implements uniq -c
        """
        lines = 0
        try:
            while True:
                line = (yield)
                lines += 1
        except GeneratorExit:
            pipe.send('%s\t%s' % (lines, line))

    @coroutine
    def output(theFile):
        while True:
            line = (yield)
            theFile.write(line + '\n')

    def main():
        input(sys.stdin,
            extract( r'^([^ ]+)',
                sort(
                    group( uniq,
                        output(sys.stdout)
                    )
                )
            )
        )

    if __name__ == '__main__':
        main()

You can achieve the same results with the unix command line:

cat  access.log | cut -d ' ' -f 1 | sort | uniq -c

Reading http://www.dabeaz.com/coroutines/ and thought this was a natural for a twitter client. Here is a pretty simple version that just prints the public timeline every 60 seconds. Next, up removing the time.sleep and scheduling the followStatus function as a task so I can follow more than one stream at a time.

    #!/usr/bin/env python
    # encoding: utf-8
    import time
    import twitter

    def coroutine(func):
        """
        A decorator function that takes care of starting a coroutine
        automatically on call.

        see: http://www.dabeaz.com/coroutines/
        """
        def start(*args,**kwargs):
            cr = func(*args,**kwargs)
            cr.next()
            return cr
        return start

    @coroutine
    def statusPrinter():
        """
        Just prints twitter status messages to the screen
        """
        while True:
             status = (yield)
             print status.id, status.user.name, status.text

    def followStatus(twitterGetter, target, timeout = 60):
        """
        Follows a twitter status message that takes a since_id
        """
        since_id = None
        while True:
            statuses = twitterGetter(since_id=since_id)
            if statuses:
                # pretty sure these are always in order
                since_id = statuses[0]
                for status in statuses:
                    target.send(status)
            # twitter caches for 60 seconds anyway
            time.sleep(timeout)

    def main():
        api = twitter.Api()
        followStatus(api.GetPublicTimeline, statusPrinter())

    if __name__ == '__main__':
        main()
 

About this Archive

This page is an archive of recent entries in the python category.

library is the previous category.

social responsibility is the next category.

Find recent content on the main index or look in the archives to find all content.