24 Million entries and I need to what?

John Shaver bobjohnbob at gmail.com
Fri Dec 27 10:21:44 MST 2013


What was the point of sorting it if you're not going to use something like
a binary search?

If you're going to use files, split the files by first, then first and
second, then first second and third character in the hash and save them
into a corresponding tree of directories.

-John


On Fri, Dec 27, 2013 at 10:06 AM, S. Dale Morrey <sdalemorrey at gmail.com>wrote:

> For some reason, god help me, but this is starting to feel like a job for
> perl unless I can find something more sensical.
> I tried to write a Java app and the only solution that didn't run out of
> memory was to search it using a scanner and go line by line.
> scanner.findWithinHorizon just puked after a few seconds.
>
> Search time for a single string near the end of the list was 65035 ms.
>
> Compare that to grep -F for the same string which seems to come in at 0.5s
> and appears to get faster the more often I grep the file (no idea why that
> would be, I'm using btrfs for my filesystem though).
>
> Still that's too dang slow.  Just 1000 hashes would take 16 minutes to
> check.  I'm expecting to generate 1000 hashes per second.
>
> I do wonder what would happen if I just "touched" a file for each entry and
> used the filesystem tools to see if a file by that name exists.
> What exactly happens if you have 24 million 0 byte files in a single
> directory on a btrfs filesystem?
>
>
>
>
> On Fri, Dec 27, 2013 at 9:22 AM, Ed Felt <edfelt at gmail.com> wrote:
>
> > MySQL in memory table with full indexes is probably about as fast as you
> > can get.
> > On Dec 27, 2013 9:16 AM, "Lonnie Olson" <lists at kittypee.com> wrote:
> >
> > > On Fri, Dec 27, 2013 at 1:59 AM, S. Dale Morrey <sdalemorrey at gmail.com
> >
> > > wrote:
> > > > Just wondering, what would be the fastest way to do this?
> > >
> > > grep will have to scan the entire file every time.  Not a good idea.
> > > You either need to store it all in memory, or use some kind of index.
> > >
> > > Memory options: Memcached, Redis, Custom data structure (PHP array,
> > > Ruby Hash, Python dictionary, etc)
> > > Indexed options: Postgres, MySQL, SQLite, BDB, etc.
> > >
> > > /*
> > > PLUG: http://plug.org, #utah on irc.freenode.net
> > > Unsubscribe: http://plug.org/mailman/options/plug
> > > Don't fear the penguin.
> > > */
> > >
> >
> > /*
> > PLUG: http://plug.org, #utah on irc.freenode.net
> > Unsubscribe: http://plug.org/mailman/options/plug
> > Don't fear the penguin.
> > */
> >
>
> /*
> PLUG: http://plug.org, #utah on irc.freenode.net
> Unsubscribe: http://plug.org/mailman/options/plug
> Don't fear the penguin.
> */
>


More information about the PLUG mailing list