24 Million entries and I need to what?

Nicholas Stewart nicholas4 at gmail.com
Fri Dec 27 12:43:24 MST 2013


Using a DB (mysql, postgres) with the column indexed would be a good solution.

If you want to use commandline tools like grep then you could break
the hashes into 4096 files (16x16x16).  All hashes beginning with aaa
would be stored in aaa.txt, hashes beginning with aab stored aab.txt,
and so on.  Your grep would be (on average) 4,000 times faster.

If you have 1,000+ files in a directory you may consider breaking
things into subdirectories.  So for the problem at hand, all hashes
beginning with aaa would be stored in a/aa.txt, hashes beginning with
aab stored a/ab.txt, and so on.

I have used similar setups to store 100,000,000+ keys & values.  A
database might seem more elegant/traditional but this quick'n'dirty
approach worked really well for me.  It allowed me to use fast command
line tools like grep/sort/uniq/wc.

/2 cents


Thank you,
Nicholas Stewart


On Fri, Dec 27, 2013 at 11:32 AM, Joshua Marsh <joshua at themarshians.com> wrote:
> I've done similar things in the past. Here is how I did it:
>
> Generate and sort the first file.
> Generate and sort the second file.
> join first second
>
> The join command will only print lines that are in both the first and
> second file. This would give you a list possible collisions. You'd just
> need to verify that the collision is valid (i.e. the random string was not
> the actual string from the first). If you just wanted a count you could:
>
> join first second | wc -l
>
>
> On Fri, Dec 27, 2013 at 10:55 AM, S. Dale Morrey <sdalemorrey at gmail.com>wrote:
>
>> I would love for you to tell me that, but still I'm trying to verify a
>> particular implementation of the algorithm,
>>
>> Your CDB file idea is a good one.  I'm going to investigate it further.
>>
>>
>>
>> On Fri, Dec 27, 2013 at 10:50 AM, Steve Meyers <steve at plug.org> wrote:
>>
>> > On 12/27/13 10:43 AM, S. Dale Morrey wrote:
>> >
>> >> Yes, that is exactly what I'm doing.  I'm checking the propensity for a
>> >> random string of characters to have a hash collision with an existing
>> >> known
>> >> set of words given an unsalted hashing algorithm.
>> >>
>> >
>> > Can I save you the trouble by telling you how unlikely it is to happen?
>> >
>> > Since your 24 million existing hashes are static, I'd look into how big a
>> > CDB file would be, and then use that to check for collisions.
>> >
>> > Steve
>> >
>> >
>> >
>> > /*
>> > PLUG: http://plug.org, #utah on irc.freenode.net
>> > Unsubscribe: http://plug.org/mailman/options/plug
>> > Don't fear the penguin.
>> > */
>> >
>>
>> /*
>> PLUG: http://plug.org, #utah on irc.freenode.net
>> Unsubscribe: http://plug.org/mailman/options/plug
>> Don't fear the penguin.
>> */
>>
>
> /*
> PLUG: http://plug.org, #utah on irc.freenode.net
> Unsubscribe: http://plug.org/mailman/options/plug
> Don't fear the penguin.
> */


More information about the PLUG mailing list