Hosting

S. Dale Morrey sdalemorrey at gmail.com
Thu Dec 12 03:53:04 MST 2013


I'm not sure what news you've been reading but....

The London Airport was shutdown due to a system failure and a backup system
that utterly failed to do it's job
http://www.bbc.co.uk/news/uk-25281675

BART was shut down due to a computer failure
http://blogs.kqed.org/newsfix/2013/11/22/no-bart-service-this-morning-due-to-computer-glitch/

RBS left it's customer high and dry and unable to access their accounts
http://www.channel4.com/news/rbs-already-under-investigation-over-computer-failures

Another airport shutdown because of a failure
http://www.keysnet.com/2013/12/04/492971/southwest.html

A computer failure allowed bad meat to ship resulting in a recall
http://www.kltv.com/story/23974187/nationwide-computer-failures-cause-millions-of-pound-of-meat-to-go-uninspected-weekly

These are all examples of notable failures in the last month. They were big
enough that they made the news.  None of them were "cloud" services.  All
of them had very significant impact.

At least when I host my systems with a provider I don't have to worry about
mean time between failures and replacing systems that go bad.
In fact if you have monitoring setup correctly to watch for important
metrics; then when something goes wrong on your system you just spin the
old one down and spin up a new one.
In fact all of my deployments do this automatically and I just get an email
when it's detected and again when it's done.

It's not a panacea, but what I spend in hosting costs for cloud services,
would easily be dwarfed by the costs of colo for my own boxes and time &
effort spent to monitor them and replace something when it goes awry.

Your point about mysql is valid.  MySQL is not very good in a situation
where the storage is remote like on an NFS or s3fs mount.  If the link goes
down, MySQL will never recover without a reboot.

You're generally much better off using different technologies and
rethinking your application.  In general if I come to a point in my design
where I'm looking at an RDS as the solution, I tend to wonder where I've
failed in my design.  For long term storage or anything that doesn't need
high availability, but needs the structure an RDS provides, sometimes there
is no alternative.  Usually though, there is.  In most cases it's just a
matter of thinking differently about your data.  If I must go with an RDS,
I do make sure that it has local storage as in Amazon SimpleRDS.

One final thing to note.  You should not assume that you are running on
anything approaching modern hardware.  An amazon t1.micro instance has
about the same specs as my 3 year old cellphone.  Something approaching
modern specs is going to cost you about $0.35/hr vs the 0.004/hr of a micro
instance.

On the whole, it's better to think of these instance things as dedicated
task processors rather than modern hardware that can run umpteen million
services.  You spin one up to run a specific task in a complicated work
flow.  If you do it correctly, you load balance that workflow across
multiple parallel instances and spin them down when the job is complete.

For example I have a customer who is a professional
photographer/videographer for high end clients (models, celebs and other
people with more dollars than sense).
He needs to index and process an absolutely huge amount of photos and
videos, there is no way he could do this by hand.

I built a dedicated facial recognition & tagging system running on AWS.  I
based my design for this service on something similar I did for a missing
kids service in china.

In all it's comprised of 1 web server, 1 database server, 1 facial
recognition engine, 1 image reprocessing engine and a whole bunch of S3
storage (with automated backup to glacier). When he uploads new photos to
the site, the upload is sent direct to S3.  The website sends a message to
the facial recognition engine which then begins to process the images &
videos and look for who's in them.  The actual engines are kept offline
until needed and the control service spins up n mod 10 instances for images
and n for videos.  In otherwords the number of instances running, is
entirely dependent upon the number of images & videos that need to be
processed.  The control server will spin them back down when their workload
is complete.

Once indexing is complete, a reference to the file along with the tags
created, are batched up and sent to the image reprocessing engine.  This
service will embed the tags as metadata into directly into the file.  The
tags and a file reference are then stored in the database for later queries
by the website.  The decision to embed the tags directly in the files is
actually a failsafe in case the DB becomes unrecoverable or the backup is
too stale.  If that happens, you can just re-index the images without
rerunning the workflow.

In the time since I built this, there is now a "cloud based" workflow
service from Amazon targeted at exactly this sort of workflow. If he ever
decides to make a 2.0 version I'll be leveraging that instead.

Back to the original point...

In the year and a half that this has been operational, there has been no
downtime and no "lost" data or images.  The work load is medium to high but
very spiky. Averaged out he's putting about 10GB of data into the system
daily.  Time is money, he doesn't want to have to deal with ANY downtime.

On the other hand not a single one these instances has had uptime in excess
of a week.  It seems like something is always going wrong, but when I built
it, I built it to self heal.  When I went into this I was well aware of
uptime & availability issues from cloud providers, especially AWS.  But I
try to never design any system where too many eggs are in a single
basket.So I built this in a distributed workload fashion with proper
monitoring, alerting & repair scripts.  The DB is non-responsive?  Spin it
down, spin up a new one.  The webserver is down? Deploy a new one and
repoint the DNS to it.  File missing from S3?  Call glacier and tell it to
bring it back.

You can't just move an existing application to the "cloud" and have it do
anything other than just "sorta work".  If you're going to be using these
systems you need to know what they are, how they work and most importantly
what their failure modes are.  Every design has to be undertaken with the
"what happens WHEN this part falls down?".  When you build for this you
need to take failure of each component as an inevitability.  It's a forgone
conclusion that something will fail at the absolute worst moment. You need
to always have something watching for signs of failure and ready to replace
it when it does.  Then you also need to have something watching that :)


On Wed, Dec 11, 2013 at 7:03 PM, Sasha Pachev <sasha at asksasha.com> wrote:

> >Not picking on you but do you honestly think someone hosting their own
> >server will have better uptime than using one of the current top tier
> cloud
> >providers?
>
> I do not work much with clouds, but I have had some experiences that
> makes me wonder about the stability of the current cloud solutions:
>
> * I have seen MySQL stuck due to failed I/O several times on Amazon
> cloud. Never quite like that on a dedicated machine - not so
> spectacularly where every read() syscall would just sit there
> indefinitely instead of coming back with some kind of an error.
> * Netflix outage due to cloud failure made the news recently. I do not
> recall a major news item that had to do with a regular dedicated
> server failure. In fact, it was quite exciting - does not happen
> often.
> * I ran the Big Cottonwood Canyon Half-Marathon this year. When I got
> home I went to their website to check the results and got an error
> several times. Retried several times after giving it some time to
> auto-heal or have the admin take care of it. Then after some time  the
> site started loading, but was extremely slow. I saw the domain of the
> backend scripts was rhcloud.com. I realize that a poorly written PHP
> script combined with a poorly written MySQL query can produce some
> wonderful results, but you can only botch it so much while fetching
> only 5K records on modern hardware. I have seen horrendously
> inefficient code perform just fine even under load on a normal
> dedicated server.
>
> Now the idea of clouds is great. However, I fear that our ability to
> get excited about them exceeds our ability to implement them properly
> which is not an easy task.
>
> --
> Sasha Pachev
>
> Fast Running Blog.
> http://fastrunningblog.com
> Run. Blog. Improve. Repeat.
>
> /*
> PLUG: http://plug.org, #utah on irc.freenode.net
> Unsubscribe: http://plug.org/mailman/options/plug
> Don't fear the penguin.
> */
>


More information about the PLUG mailing list