Recently in python Category

When you build a website it is often filled with objects that use serial column types, these are usually auto-incrementing integers. Often you want to obscure these numbers since they may convey some business value i.e. number of sales, users reviews etc. A Better idea is to use an actual Natural Key, which exposes the actual domain name of the object vs some numeric identifier. It's not always possible to produce a natural key for every object and when you can't do this, consider obscuring the serial id.

This doesn't secure your numbers that convey business value, it only conceals them from the casual observer. Here is an alternative that uses the bit mixing properties of exclusive or XOR, some compression by conversion into "base36" (via reddit) and some bit shuffling so that the least significant bit is moved which minimizes the serial appearance. You should be able to adapt this code to alternative bit sizes and shuffling patterns with some small changes. Just not that I am using signed integers and it is important to keep the high bit 0 to avoid negative numbers that cannot be converted via the "base36" algorithm.

Twiddling bits in python isn't fun so I used the excellent bitstring module

    from bitstring import Bits, BitArray

    #set the mask to whatever you want, just keep the high bit 0 (or use bitstring's uint)
    XOR_MASK = Bits(int=0x71234567, length=32)

    # base36 the reddit way 
    # happens to be easy to convert back to and int using int('foo', 36)
    # int with base conversion is case insensitive
     def to_base(q, alphabet):
        if q < 0: raise ValueError, "must supply a positive integer"
        l = len(alphabet)
        converted = []
        while q != 0:
            q, r = divmod(q, l)
            converted.insert(0, alphabet[r])
        return "".join(converted) or '0'

    def to36(q):
        return to_base(q, '0123456789abcdefghijklmnopqrstuvwxyz')

    def shuffle(ba, start=(1,16,8), end=(16,32,16), reverse=False):
        flip some bits around
        '0x10101010' -> '0x04200808'
        b = BitArray(ba)
        if reverse:
            map(b.reverse, reversed(start), reversed(end))
            map(b.reverse, start, end)
        return b  

    def encode(num):
        Encodes numbers to strings

        >>> encode(1)

        >>> encode(2)
        return to36((shuffle(BitArray(int=num,length=32)) ^ XOR_MASK).int)

    def decode(q):
        decodes strings to  (case insensitive)

        >>> decode('ve3b6v')

        >>> decode('Ve3b6V')

        return (shuffle(BitArray(int=int(q,36),length=32) ^ XOR_MASK, reverse=True) ).int

Removing A Django Application Completely with South

Let's pretend that the application that you want remove contains the following model in myapp/ class SomeModel(models.Model): data = models.TextField

Create the initial migration and apply it to the database:

    ./ schemamigration --initial myapp
    ./ migrate

To remove the Models edit myapp/ and remove all the model definitions

Create the deleting migration:

    ./ schemamigration myapp

Edit myapp/migrations/ to remove the related content types

    from django.contrib.contenttypes.models import ContentType
    def forwards(self, orm):

        # Deleting model 'SomeModel'
        for content_type in ContentType.objects.filter(app_label='myapp'):

Migrate the App and remove the table, then fake a zero migration to clean out the south tables

    ./ migrate
    ./ migrate myapp zero --fake

Remove the app from your and it should now be fully gone....

Django has a built in sitemap generation framework that uses views to build a sitemap on the fly. Sometimes your dataset is too large for this to work in a web application.  Here is a management command that will generate a static sitemap and index for your models.  You can extend it to handle multiple Models.

import os.path 
from import BaseCommand, CommandError
from django.contrib.sitemaps import GenericSitemap
from django.contrib.sites.models import Site
from django.template import loader 
from django.utils.encoding import smart_str

from myproject.models import MyModel

class Command(BaseCommand):
    help = """Generates the sitemaps for the site, pass in a output directory

    def handle(self, *args, **options):
        if len(args) != 1:
            raise CommandError('You need to specify a output directory')
        directory = args[0]
        if not os.path.isdir(directory):
            raise CommandError('directory %s does not exist' % directory)
        #modify to meet your needs
        sitemap = GenericSitemap({'queryset': MyModel.objects.order_by('id'), 'date_field':'modified' })
        current_site = Site.objects.get_current()

        index_files = []
        paginator = sitemap.paginator
        for page_num in range(1, paginator.num_pages+1):
            filename = 'sitemap_%s.xml' % page_num
            file_path = os.path.join(directory,filename)
            index_files.append("http://%s/%s" % (current_site.domain, filename))
            print "Generating sitemap %s" % file_path
            with open(file_path, 'w') as site_mapfile:
                site_mapfile.write(smart_str(loader.render_to_string('sitemap.xml', {'urlset': sitemap.get_urls(page_num)})))
        sitemap_index = os.path.join(directory,'sitemap_index.xml')
        with open(sitemap_index, 'w') as site_index:
            print "Generating sitemap_index.xml %s" % sitemap_index
            site_index.write(loader.render_to_string('sitemap_index.xml', {'sitemaps': index_files}))

A friend just asked how to do city/state lookup on input strings. I've used metaphones and Levenshtein distance in the past but that seems like over kill. Using a n-gram is a nice and easy solution

  1. easy_install ngram

  2. build file with all the city and state names one per line, place in Redwood City, CA Redwood, VA etc

  3. Experiment ( the .2 threshold is a little lax )

import string
import ngram
cityStateParser = ngram.NGram(
  items = (line.strip() for line in open('')) ,
  N=3, iconv=string.lower, qconv=string.lower,  threshold=.2

[('Redwood VA', 0.5),
('Redwood NY', 0.5),
('Redwood MS', 0.5),
('Redwood City CA', 0.36842105263157893),

Notes: Because these are NGrams you might get overmatch when the state is part of a ngram in the city i.e. search for "washington" would yield Washington IN with a bette score than "Washington OK"

You might also want read Using Superimposed Coding Of N-Gram Lists For Efficient Inexact Matching (PDF Download)

If this works for you, consider giving me a vote on

Cascading and Coroutines


Cascading looks quite interesting. Here is a python program that does something similar to the Technical Overview seen main in the python program.

    #!/usr/bin/env python
    # encoding: utf-8
    import sys

    def input(theFile, pipe):
        pushes a file a line at a time to a coroutine pipe
        for line in theFile:

    def extract(expression, pipe, group = 0):
        extract the group from a regex
        import re
        r = re.compile(expression)
        while True:
            line = (yield)
            match =
            if match:

    def sort(pipe):
        sort the input on a pipe
        import heapq
        heap = []
            while True:
                line = (yield)
                heapq.heappush(heap, line)
        except GeneratorExit:
            while heap:

    def group(groupPipe, pipe):
        sends consectutive matching lines from pipe to groupPipe
        cur = None
        g = None
        while True:
            line = (yield)
            if cur is None:
                g = groupPipe(pipe)
            elif cur != line:
                g = groupPipe(pipe)

            cur = line

    def uniq(pipe):
        implements uniq -c
        lines = 0
            while True:
                line = (yield)
                lines += 1
        except GeneratorExit:
            pipe.send('%s\t%s' % (lines, line))

    def output(theFile):
        while True:
            line = (yield)
            theFile.write(line + '\n')

    def main():
            extract( r'^([^ ]+)',
                    group( uniq,

    if __name__ == '__main__':

You can achieve the same results with the unix command line:

cat  access.log | cut -d ' ' -f 1 | sort | uniq -c

Reading and thought this was a natural for a twitter client. Here is a pretty simple version that just prints the public timeline every 60 seconds. Next, up removing the time.sleep and scheduling the followStatus function as a task so I can follow more than one stream at a time.

    #!/usr/bin/env python
    # encoding: utf-8
    import time
    import twitter

    def coroutine(func):
        A decorator function that takes care of starting a coroutine
        automatically on call.

        def start(*args,**kwargs):
            cr = func(*args,**kwargs)
            return cr
        return start

    def statusPrinter():
        Just prints twitter status messages to the screen
        while True:
             status = (yield)
             print,, status.text

    def followStatus(twitterGetter, target, timeout = 60):
        Follows a twitter status message that takes a since_id
        since_id = None
        while True:
            statuses = twitterGetter(since_id=since_id)
            if statuses:
                # pretty sure these are always in order
                since_id = statuses[0]
                for status in statuses:
            # twitter caches for 60 seconds anyway

    def main():
        api = twitter.Api()
        followStatus(api.GetPublicTimeline, statusPrinter())

    if __name__ == '__main__':

About this Archive

This page is an archive of recent entries in the python category.

node.js is the previous category.

social responsibility is the next category.

Find recent content on the main index or look in the archives to find all content.