Current results of "dictionary word count" programs...

Shane Hathaway shane at hathawaymix.org
Tue Mar 14 02:02:17 MST 2006


Jonathan Ellis wrote:
> #!/usr/bin/python2.4
> 
> import sys
> 
> WORDS_FNAME = '/usr/share/dict/words'
> words = dict((line.rstrip(), 0) for line in file(WORDS_FNAME))
> 
> source = len(sys.argv) > 1 and file(sys.argv[1]) or sys.stdin
> for line in source:
>     for word in line.strip().split():
>         if not word:
>             continue
>         try:
>             words[word] += 1
>         except KeyError:
>             pass
> 
> for word, count in words.iteritems():
>     if count > 0:
>         print word, count

Here's a version of that code that runs in about 40% less time. 
Optimizations:

- Locals are faster than globals, so I moved the code into a function.

- Since we don't report words not used in the document, and a given 
document is likely to use only a small fraction of the dictionary, it's 
better to build a new dictionary of word frequency rather than filter 
the larger dictionary later.

- There are several possible ways to copy words in a file to the keys of 
a dictionary; this way ran faster than the other ways.

- There's no need to strip() a line if all you're going to do with it is 
split() it.

- Exception handling is more expensive than conditions, so switched to 
"if word in words".

I though of another tiny optimization that shaves around .5%, but 
reduces readability: pre-fetch the .get attribute of the 'freq' 
dictionary.  I doubt it's worth the penalty.

Shane


#!/usr/bin/python2.4

import sys

def main():
     words_fname = '/usr/share/dict/words'
     if len(sys.argv) > 1:
         source = open(sys.argv[1])
     else:
         source = sys.stdin

     words = {}
     for line in open(words_fname):
         words[line.rstrip()] = 0

     freq = {}
     for line in source:
         for word in line.split():
             if word in words:
                 freq[word] = freq.get(word, 0) + 1

     for word, count in freq.iteritems():
         print word, count

main()



More information about the PLUG mailing list