24 Million entries and I need to what?

S. Dale Morrey sdalemorrey at gmail.com
Fri Dec 27 01:59:04 MST 2013


So here's the problem...

I'm exploring the strength of the SHA256 algorithm.
Specifically I'm looking for the possibility of a hash collision.

To that end I took a dictionary of common words and phrases and ran them
through the algorithm.
Now I've got a list with 24 million strings stored 1 to a line in a flat
text file.
The file is just shy of 1GB.  Not too bad considering the dictionary I
borrowed was about 700MB.

Now I want to check for collisions in random space.  I have another process
generating other seemingly random strings and I want to check the hashes of
those random strings against this file in the shortest amount of time per
unit possible.

I already used sort and now the hashes are in alphabetical order.

So now I need to find a way to do the comparison as quickly as possible.
If the string is a match I need to store the new string and it's
initialization vector.

I'm thinking grep would be good for this, but it seems to take a couple of
seconds to come back when searching a single item.  I don't see any way to
have it read stdin and look for a list.
I'd like to do this with posix tools, but I'm thinking I may have to write
my own app to slurp it up into a table of some sort.  A database is a
possibility I guess, but the latency seems like it might be higher than
some sort of in memory caching.

Just wondering, what would be the fastest way to do this?


More information about the PLUG mailing list