24 Million entries and I need to what?

S. Dale Morrey sdalemorrey at gmail.com
Fri Dec 27 10:06:20 MST 2013


For some reason, god help me, but this is starting to feel like a job for
perl unless I can find something more sensical.
I tried to write a Java app and the only solution that didn't run out of
memory was to search it using a scanner and go line by line.
scanner.findWithinHorizon just puked after a few seconds.

Search time for a single string near the end of the list was 65035 ms.

Compare that to grep -F for the same string which seems to come in at 0.5s
and appears to get faster the more often I grep the file (no idea why that
would be, I'm using btrfs for my filesystem though).

Still that's too dang slow.  Just 1000 hashes would take 16 minutes to
check.  I'm expecting to generate 1000 hashes per second.

I do wonder what would happen if I just "touched" a file for each entry and
used the filesystem tools to see if a file by that name exists.
What exactly happens if you have 24 million 0 byte files in a single
directory on a btrfs filesystem?




On Fri, Dec 27, 2013 at 9:22 AM, Ed Felt <edfelt at gmail.com> wrote:

> MySQL in memory table with full indexes is probably about as fast as you
> can get.
> On Dec 27, 2013 9:16 AM, "Lonnie Olson" <lists at kittypee.com> wrote:
>
> > On Fri, Dec 27, 2013 at 1:59 AM, S. Dale Morrey <sdalemorrey at gmail.com>
> > wrote:
> > > Just wondering, what would be the fastest way to do this?
> >
> > grep will have to scan the entire file every time.  Not a good idea.
> > You either need to store it all in memory, or use some kind of index.
> >
> > Memory options: Memcached, Redis, Custom data structure (PHP array,
> > Ruby Hash, Python dictionary, etc)
> > Indexed options: Postgres, MySQL, SQLite, BDB, etc.
> >
> > /*
> > PLUG: http://plug.org, #utah on irc.freenode.net
> > Unsubscribe: http://plug.org/mailman/options/plug
> > Don't fear the penguin.
> > */
> >
>
> /*
> PLUG: http://plug.org, #utah on irc.freenode.net
> Unsubscribe: http://plug.org/mailman/options/plug
> Don't fear the penguin.
> */
>


More information about the PLUG mailing list