Current results of "dictionary word count" programs...

Jonathan Ellis jonathan at carnageblender.com
Mon Mar 13 13:06:57 MST 2006


On Mon, 13 Mar 2006 12:33:48 -0700, "Bryan Sant" <bryan.sant at gmail.com>
said:
> Python 2.4.2 (bad algorithm?)
> ------
> LOC: 6
> Best Time: 31.724
> Worst Time: 32.417
> Avg. Time:  31.98
> 
> I'm still trying to get the lisp version to work (I have a load
> error).  I'd like a good PHP and Perl version as well as a better
> Python version (the python version isn't producing acurate output and
> is WAY slower than is reasonable).

Ouch.  Yeah, Tyler's python code is pretty screwed up.  (Ab)using list
comprehensions like that means you materialize the whole data set into
memory, and using a list for lookup instead of a dict is going to cause
efficiency problems.  Here's my quick-and-dirty version.

Notes:
 - Accepts input on stdin if no file specified on the commandline, which
 makes testing via pipes easier.
 - For consistency, the empty string is not considered a word (even
 though it's in my dictionary).  Otherwise, there is a question in my
 mind as to whether "  foo" should be one word or three or even two.

#!/usr/bin/python2.4

import sys

WORDS_FNAME = '/usr/share/dict/words'
words = dict((line.rstrip(), 0) for line in file(WORDS_FNAME))

source = len(sys.argv) > 1 and file(sys.argv[1]) or sys.stdin
for line in source:
    for word in line.strip().split():
        if not word:
            continue
        try:
            words[word] += 1
        except KeyError:
            pass

for word, count in words.iteritems():
    if count > 0:
        print word, count
-- 
C++ is history repeated as tragedy. Java is history repeated as farce.  --Scott McKay




More information about the PLUG mailing list