24 Million entries and I need to what?

Levi Pearson levipearson at gmail.com
Sat Dec 28 04:18:54 MST 2013


On Sat, Dec 28, 2013 at 1:59 AM, S. Dale Morrey <sdalemorrey at gmail.com> wrote:
> Yep the experiment itself is a bad idea.
> Still if you had 25 million distinct entries each 32 bytes long and you
> needed the fastest lookup and comparison time possible, what would you use?

Well, that would depend on how much memory I had to spare.  If you
load them into an array, they take up ~763MB, which is not terrible.
You can sort them in place and perform binary searches, which would
take ~25 comparisons (each composed of 1-4 64-bit compares) per
search, which is also not terrible. After sorting, any collisions
would be adjacent, so you could identify them with a single walk
through the array. Is this the fastest possible? Possibly not, but
it's sure simple to implement in C and if it's fast enough, it's fast
enough. For extra speed, you could multi-thread the binary search,
which is trivially parallelizable.

The danger of using more advanced data structures is that the
bookkeeping starts to take a significant amount of memory, and then
(depending on your available memory) you could lose the significant
advantage of having all your data in RAM. 25 million elements is not
exactly a large number for a modern processor that operates at the
scale of billions of operations per second per core, and the hit you
take from paging could be significant under memory pressure.

A typical hash table in a scripting language is not going to be very
space-efficient or as computationally efficient as possible. It might
be interesting to try a cuckoo hash, where one hash function for a
table of 2^n buckets simply takes the high-order n bits of the value
(it's already a hash, remember) and the other takes the low-order n
bits.  Cuckoo hashes can be more space-efficient, as they can store
values directly in the buckets and sustain a load factor >80%. With
n=25, you'd have a load factor of 74.5% with 1G of memory used and two
extremely fast hash functions (assuming they work adequately for this
task, which I haven't studied in depth).

I really think the sorted array and binary search would be sufficient
for most purposes, though, and far less work to get running and
debugged.

> FYI putting 24 million distinct entries into a single directory even /tmp
> completely corrupted my file system to a point it wouldn't even boot.  I'm
> trying to figure out how that's possible.  My understanding was that /tmp
> isn't supposed to persist after a reboot, but everything was there anyways
> (near as I could tell).  Took an hour to rm -rf that directory from a
> recovery disk.

That's a bit scary, though it's certainly not a typical use case for a
file system. I would have recommended against that one, and I hope a
bit more thinking before trying it out would have led you to avoid it
as well.

        --Levi


More information about the PLUG mailing list