24 Million entries and I need to what?

S. Dale Morrey sdalemorrey at gmail.com
Sat Dec 28 07:45:55 MST 2013


I wrote an import script to get all 24 million entries it into an H2
database as well a MySQL DB, both local, both in server mode.
Checking 1000 randomly generated hashes takes 129ms on H2. (I couldn't get
a figure for a single lookup, it returned too quickly).
By comparison MySQL takes ~ 50ms per hash.

Can't say I'm really sure what the difference is here, I used the same SQL
statement to create both tables and the same script to insert them (run
time was just over an hour for insertion to each).  They're both local and
running in server mode.  I could probably make H2 faster by making it in
memory, but having it in server mode has some serious advantages from a
scaling perspective.

Either way H2 seems to be up to the task.  Any advice or caveats for me?
 I've never used H2 before.


On Sat, Dec 28, 2013 at 5:46 AM, Levi Pearson <levipearson at gmail.com> wrote:

> On Sat, Dec 28, 2013 at 4:52 AM, S. Dale Morrey <sdalemorrey at gmail.com>
> wrote:
> > This won't make sense unless you have some background with distributed &
> > redundant monitoring setups, but...
> >
> > Years ago I was working for a company and I wrote an alerting system that
> > did a url shortening trick so alerts could be sent over SMS.
> > The server needed to be simplistic and resilient with as few entry/attack
> > points as possible because this was for a major infrastructure system
> > monitoring several thousand machines for dozens of businesses.  Obviously
> > the whole system was a prime target for attack.  It would have been a
> > disaster to have a SQL injection or some other non-sense go down so we
> > ruled out a typical LAMP stack right away.
> >
> > The solution was to have nagios write the alert acknowledgment URL
> > (shortened) to /tmp as a meta-redirect to the actual acknowledgement page
> > (which could reside at any one of a dozen or so servers).  We then used a
> > purpose built webserver (only responded to get, would only allow a length
> > of x characters etc) to serve the page as static HTML, letting the
> browser
> > handle the redirect to the actual acknowledgment page.  Since all alerts
> > expire after a set time, and because this was just the acknowledgement
> URL,
> > not the alert itself, having them disappear after a reboot was sort of a
> > non-issue.
> >
> > We had 10's of thousands of shortened URLs stored in /tmp with no
> problems.
> >  Obviously it wasn't millions and they had actual data instead of being
> the
> > result of touch, but the concept is the same.
>
> Sounds like a wise and reasonable solution to the problem. I am not a
> huge fan of the LAMP stack, especially when security is paramount or
> resources are tight and flexibility isn't a concern.
>
> I'd definitely consider 10's of thousands to be in the realm of the
> design scope of the filesystem, though pushing the fringes of it.  But
> consider that 20 million is three orders of magnitude larger.
>
> > I've mentioned before that I have major issues with the box I'm on,
> > particularly in the filesystem dept.  I'm seriously of a mind that the HD
> > may be failing.  I didn't realize that there was any mirroring of /tmp to
> > the HD and frankly it worries me that this occurred.  I'm tempted to spin
> > up a few different EC2 instances with various filesystems and repeat the
> > experiment just to see if they all have the same issue.  Obviously it's
> not
> > what a filesystem is intended for, but one would tend to think that 24
> > million files would be a snatch for any modern filesystem to handle.
>
> I wouldn't assume that.  Based on the way that modern applications
> like web browsers store their cache files in a shallow hierarchy, I
> would guess they've reacted to performance bottlenecks in modern
> filesystems with their designs.
>
> While it's possible that you tickled a hardware fault, it is also a
> real possibility that you tickled a software scaling bug by pushing
> things well beyond normal limits.
>
> > I guess I look at a filesystem differently.  I view any filesystem as
> > nothing more than a database of connected blocks which are joined to
> > represent files.  I started looking at them this way when I realized that
> > rm (or in the case of DOS del), doesn't remove the file itself, it merely
> > removes the entry from the file allocation table/tree/whatever.
> > Therefore it made sense in my mind to take advantage of what ought to be
> > one of the fastest DB's around, i.e. the filesystem to solve my problems.
>
> Well, filesystems are definitely a sort of database, and modern
> filesystems are pretty fast at some things, but as millions of files
> in a single directory is not a common use case for a filesystem, I
> seriously doubt much time has been spent specifically optimizing it in
> most filesystems.  I understand ReiserFS was supposed to do
> particularly well with many small files in comparison with ext3, but I
> haven't really kept track of what ext4 and other filesystems have done
> in that area.
>
> Just because something can be thought of as a database doesn't mean
> it's going to perform well at some arbitrary data management task,
> especially if that task is well outside of the normal operating
> parameters of the system.
>
> > Looks like I was wrong and I'm exploring the options.  Thanks for the
> info
> > about the Cuckoo hashes.
>
> You almost certainly don't need to implement one, but they're an
> interesting and relatively new idea.  They're somewhat related to
> bloom filters, which have been around for a while and give a time- and
> space-efficient method of determining (with a small probability of
> false positive) set membership. If your normal membership test is slow
> (say it requires disk access and some computation) and you expect that
> most tests will be negative, you can use a bloom filter as a cheap
> first-stage screen, then only run your expensive test on positive
> results from the bloom filter. The bloom filter lookups can be
> parallelized as well.
>
>         --Levi
>
> /*
> PLUG: http://plug.org, #utah on irc.freenode.net
> Unsubscribe: http://plug.org/mailman/options/plug
> Don't fear the penguin.
> */
>


More information about the PLUG mailing list