24 Million entries and I need to what?

Joshua Marsh joshua at themarshians.com
Fri Dec 27 11:32:40 MST 2013


I've done similar things in the past. Here is how I did it:

Generate and sort the first file.
Generate and sort the second file.
join first second

The join command will only print lines that are in both the first and
second file. This would give you a list possible collisions. You'd just
need to verify that the collision is valid (i.e. the random string was not
the actual string from the first). If you just wanted a count you could:

join first second | wc -l


On Fri, Dec 27, 2013 at 10:55 AM, S. Dale Morrey <sdalemorrey at gmail.com>wrote:

> I would love for you to tell me that, but still I'm trying to verify a
> particular implementation of the algorithm,
>
> Your CDB file idea is a good one.  I'm going to investigate it further.
>
>
>
> On Fri, Dec 27, 2013 at 10:50 AM, Steve Meyers <steve at plug.org> wrote:
>
> > On 12/27/13 10:43 AM, S. Dale Morrey wrote:
> >
> >> Yes, that is exactly what I'm doing.  I'm checking the propensity for a
> >> random string of characters to have a hash collision with an existing
> >> known
> >> set of words given an unsalted hashing algorithm.
> >>
> >
> > Can I save you the trouble by telling you how unlikely it is to happen?
> >
> > Since your 24 million existing hashes are static, I'd look into how big a
> > CDB file would be, and then use that to check for collisions.
> >
> > Steve
> >
> >
> >
> > /*
> > PLUG: http://plug.org, #utah on irc.freenode.net
> > Unsubscribe: http://plug.org/mailman/options/plug
> > Don't fear the penguin.
> > */
> >
>
> /*
> PLUG: http://plug.org, #utah on irc.freenode.net
> Unsubscribe: http://plug.org/mailman/options/plug
> Don't fear the penguin.
> */
>


More information about the PLUG mailing list