Current results of "dictionary word count" programs...
Shane Hathaway
shane at hathawaymix.org
Tue Mar 14 02:02:17 MST 2006
Jonathan Ellis wrote:
> #!/usr/bin/python2.4
>
> import sys
>
> WORDS_FNAME = '/usr/share/dict/words'
> words = dict((line.rstrip(), 0) for line in file(WORDS_FNAME))
>
> source = len(sys.argv) > 1 and file(sys.argv[1]) or sys.stdin
> for line in source:
> for word in line.strip().split():
> if not word:
> continue
> try:
> words[word] += 1
> except KeyError:
> pass
>
> for word, count in words.iteritems():
> if count > 0:
> print word, count
Here's a version of that code that runs in about 40% less time.
Optimizations:
- Locals are faster than globals, so I moved the code into a function.
- Since we don't report words not used in the document, and a given
document is likely to use only a small fraction of the dictionary, it's
better to build a new dictionary of word frequency rather than filter
the larger dictionary later.
- There are several possible ways to copy words in a file to the keys of
a dictionary; this way ran faster than the other ways.
- There's no need to strip() a line if all you're going to do with it is
split() it.
- Exception handling is more expensive than conditions, so switched to
"if word in words".
I though of another tiny optimization that shaves around .5%, but
reduces readability: pre-fetch the .get attribute of the 'freq'
dictionary. I doubt it's worth the penalty.
Shane
#!/usr/bin/python2.4
import sys
def main():
words_fname = '/usr/share/dict/words'
if len(sys.argv) > 1:
source = open(sys.argv[1])
else:
source = sys.stdin
words = {}
for line in open(words_fname):
words[line.rstrip()] = 0
freq = {}
for line in source:
for word in line.split():
if word in words:
freq[word] = freq.get(word, 0) + 1
for word, count in freq.iteritems():
print word, count
main()
More information about the PLUG
mailing list