24 Million entries and I need to what?

Adam Stevenson adamstevenson at gmail.com
Fri Dec 27 14:01:55 MST 2013


Selecting the correct data structure is key.  The fastest is going to be a
hash table with O(1).  With 24 million 256 bit strings, your space
requirements are about 800-900 megs of memory.

Try using a
stl::hash_map<sha_bits, bool> // c++
HashMap< sha_bits, boolean> // java

If you need to compress the space down you can use a trie
http://en.wikipedia.org/wiki/Trie


If you want quick and dirty you could just use memcached and still probably
get the performance you want.  I would avoid using a database if at all
possible.

Adam






On Fri, Dec 27, 2013 at 10:55 AM, S. Dale Morrey <sdalemorrey at gmail.com>wrote:

> I would love for you to tell me that, but still I'm trying to verify a
> particular implementation of the algorithm,
>
> Your CDB file idea is a good one.  I'm going to investigate it further.
>
>
>
> On Fri, Dec 27, 2013 at 10:50 AM, Steve Meyers <steve at plug.org> wrote:
>
> > On 12/27/13 10:43 AM, S. Dale Morrey wrote:
> >
> >> Yes, that is exactly what I'm doing.  I'm checking the propensity for a
> >> random string of characters to have a hash collision with an existing
> >> known
> >> set of words given an unsalted hashing algorithm.
> >>
> >
> > Can I save you the trouble by telling you how unlikely it is to happen?
> >
> > Since your 24 million existing hashes are static, I'd look into how big a
> > CDB file would be, and then use that to check for collisions.
> >
> > Steve
> >
> >
> >
> > /*
> > PLUG: http://plug.org, #utah on irc.freenode.net
> > Unsubscribe: http://plug.org/mailman/options/plug
> > Don't fear the penguin.
> > */
> >
>
> /*
> PLUG: http://plug.org, #utah on irc.freenode.net
> Unsubscribe: http://plug.org/mailman/options/plug
> Don't fear the penguin.
> */
>


More information about the PLUG mailing list