24 Million entries and I need to what?
nicholas4 at gmail.com
Fri Dec 27 09:10:43 MST 2013
Good question. You could load the lot into a ruby Hash. Hash lookups
are super fast but you may encounter some limitations with how big
your Hash objects can get. Or since the hashes are sorted you could
load them into an array and do a binary search.
You could also create 16 different files based on the first char of
the hash. hash a123beef... would be stored in file a, hash b23beef4g
would be stored in file b, etc. This would speed up your grep
searches by a factor of 16. You could go one step further and break
the files down based on the first two characters and a hash e.g. the
files would be aa.txt, ab.txt, etc.
This also seems like it would be a good candidate for some sort of
advanced keyvalue store like Redis or Riak.
On Fri, Dec 27, 2013 at 1:59 AM, S. Dale Morrey <sdalemorrey at gmail.com> wrote:
> So here's the problem...
> I'm exploring the strength of the SHA256 algorithm.
> Specifically I'm looking for the possibility of a hash collision.
> To that end I took a dictionary of common words and phrases and ran them
> through the algorithm.
> Now I've got a list with 24 million strings stored 1 to a line in a flat
> text file.
> The file is just shy of 1GB. Not too bad considering the dictionary I
> borrowed was about 700MB.
> Now I want to check for collisions in random space. I have another process
> generating other seemingly random strings and I want to check the hashes of
> those random strings against this file in the shortest amount of time per
> unit possible.
> I already used sort and now the hashes are in alphabetical order.
> So now I need to find a way to do the comparison as quickly as possible.
> If the string is a match I need to store the new string and it's
> initialization vector.
> I'm thinking grep would be good for this, but it seems to take a couple of
> seconds to come back when searching a single item. I don't see any way to
> have it read stdin and look for a list.
> I'd like to do this with posix tools, but I'm thinking I may have to write
> my own app to slurp it up into a table of some sort. A database is a
> possibility I guess, but the latency seems like it might be higher than
> some sort of in memory caching.
> Just wondering, what would be the fastest way to do this?
> PLUG: http://plug.org, #utah on irc.freenode.net
> Unsubscribe: http://plug.org/mailman/options/plug
> Don't fear the penguin.
More information about the PLUG