Current results of "dictionary word count" programs...

Jason Holt jason at
Mon Mar 13 17:19:38 MST 2006

On Mon, 13 Mar 2006, Bryan Sant wrote:

> Here are the current results for the "count the dictionary words in a
> file" programs submitted thus far.

Would you run mine, please?  To be fair, we should probably standardize on a 
word list and test document.  (And for my algorithm to perform well, we should 
definitely use a larger input text file.)

My entry is here, including source code, executable (since compiling it may be 
problematic for many people), word list and kjv10, the KJV bible from project 
Gutenberg (823,156 words):

[jason at erg] ~$ ll dwords.c
-rw-r--r--  1 jason jason 63372299 Mar 13 17:05 dwords.c

I think I generated words.i with sort -f | uniq -i /usr/share/dict/words.  It 
includes only one copy of each word (ignoring case).  My entry is also case 

The base .c file was generated with gperf, a really neat utility for 
generating perfect (and often minimal) hash functions:

[jason at erg] ~$ time nice cat words.i |gperf -s 10 --output-file dwords.c \
--ignore-case -r

real    1010m33.269s
user    987m55.696s
sys     6m25.587s

(1010 minutes is 16.8 hours).

Nobody volunteered to compile with gcc, but I found that tcc compiles it using 
around 2GB RAM.  I'd still be interested to see gcc -O4 results.

Here's the runtime on my Athlon64 3000+:

[jason at erg] ~$ time ./a.out changelog >/dev/null

real    0m0.326s
user    0m0.270s
sys     0m0.056s

[jason at erg] ~$ time ./a.out kjv10  >/dev/null

real    0m0.829s
user    0m0.747s
sys     0m0.081s

To get an idea of how much of that is overhead for the 64MB executable, I made 
a file 10x larger:

[jason at erg] ~$ cat kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 kjv10 \
kjv10 >kjv10x10

[jason at erg] ~$ time ./a.out kjv10x10  >/dev/null

real    0m3.536s
user    0m3.375s
sys     0m0.119s

And then 100x larger:

[jason at erg] ~$ cat kjv10x10 kjv10x10 kjv10x10 kjv10x10 kjv10x10 kjv10x10 
kjv10x10 kjv10x10 kjv10x10 kjv10x10 >kjv10x100

[jason at erg] ~$ ll kjv10x100
-rw-r--r--  1 jason jason 444526000 Mar 13 17:25 kjv10x100

[jason at erg] ~$ time ./a.out kjv10x100  >/dev/null

real    0m30.041s
user    0m29.617s
sys     0m0.339s

So, the algorithm still wasn't hitting its stride with the 44MB file, but it 
starts to get pretty close to linear runtime at the 440MB mark:

[jason at erg] ~$ cat kjv10x100 kjv10x100 >kjv10x200
[jason at erg] ~$ time ./a.out kjv10x200 >/dev/null

real    0m58.551s
user    0m57.645s
sys     0m0.592s

Actually, I'm not quite sure where the extra overhead is coming from, since 
runtime for small files is only about .3 seconds.  I would have expected more 
or less linear increases in runtime for any job taking more than a few 
seconds; any ideas why it seems to get faster even for very large files?


More information about the PLUG mailing list