24 Million entries and I need to what?

Todd Millecam tyggna at gmail.com
Fri Dec 27 10:17:45 MST 2013


Most disks (7200RPM) will perform slowly after about 4000 files in a single
directory.  Not a filesystem issue, drive speed issue.

Most C-written programs will outperform even the best-written java
programs, so posix standard utilities are going to be helpful (sed, awk,
grep) and largely up to this task if your comfortable enough to script with
them.

I'm gonna have to go with the suggestions above: stick this in a proper
database system, or use perl/ruby/python and stick it in a hash table or
python dictionary object.

To mutate a gig file to be a python dictionary will likely take about 5-10
minutes upfront cost, a fair bit of time (thinking about 5-10 seconds) to
execute the import statement in the interpreter, and then to do anything
that iterates over the whole thing will perform in the vicinity of one
second--there are also built-in search functions for python dictionaries
that are pretty quick (as they're written in C)

For most machines (we're talking less than 3 years old) running a search on
a hash table of 24 million will be pretty much RAM bandwidth bound--and
DDR3 ram clocked at 1800Mhz has an effective throughput of 7-9GB/s




On Fri, Dec 27, 2013 at 10:06 AM, S. Dale Morrey <sdalemorrey at gmail.com>wrote:

> For some reason, god help me, but this is starting to feel like a job for
> perl unless I can find something more sensical.
> I tried to write a Java app and the only solution that didn't run out of
> memory was to search it using a scanner and go line by line.
> scanner.findWithinHorizon just puked after a few seconds.
>
> Search time for a single string near the end of the list was 65035 ms.
>
> Compare that to grep -F for the same string which seems to come in at 0.5s
> and appears to get faster the more often I grep the file (no idea why that
> would be, I'm using btrfs for my filesystem though).
>
> Still that's too dang slow.  Just 1000 hashes would take 16 minutes to
> check.  I'm expecting to generate 1000 hashes per second.
>
> I do wonder what would happen if I just "touched" a file for each entry and
> used the filesystem tools to see if a file by that name exists.
> What exactly happens if you have 24 million 0 byte files in a single
> directory on a btrfs filesystem?
>
>
>
>
> On Fri, Dec 27, 2013 at 9:22 AM, Ed Felt <edfelt at gmail.com> wrote:
>
> > MySQL in memory table with full indexes is probably about as fast as you
> > can get.
> > On Dec 27, 2013 9:16 AM, "Lonnie Olson" <lists at kittypee.com> wrote:
> >
> > > On Fri, Dec 27, 2013 at 1:59 AM, S. Dale Morrey <sdalemorrey at gmail.com
> >
> > > wrote:
> > > > Just wondering, what would be the fastest way to do this?
> > >
> > > grep will have to scan the entire file every time.  Not a good idea.
> > > You either need to store it all in memory, or use some kind of index.
> > >
> > > Memory options: Memcached, Redis, Custom data structure (PHP array,
> > > Ruby Hash, Python dictionary, etc)
> > > Indexed options: Postgres, MySQL, SQLite, BDB, etc.
> > >
> > > /*
> > > PLUG: http://plug.org, #utah on irc.freenode.net
> > > Unsubscribe: http://plug.org/mailman/options/plug
> > > Don't fear the penguin.
> > > */
> > >
> >
> > /*
> > PLUG: http://plug.org, #utah on irc.freenode.net
> > Unsubscribe: http://plug.org/mailman/options/plug
> > Don't fear the penguin.
> > */
> >
>
> /*
> PLUG: http://plug.org, #utah on irc.freenode.net
> Unsubscribe: http://plug.org/mailman/options/plug
> Don't fear the penguin.
> */
>



-- 
Todd Millecam


More information about the PLUG mailing list