Justin Hileman justin at
Thu Dec 12 05:45:40 MST 2013

That sounds a lot like a twelve-factor app :)  

-- justin

I'm not sure what news you've been reading but....

The London Airport was shutdown due to a system failure and a backup system

that utterly failed to do it's job

BART was shut down due to a computer failure

RBS left it's customer high and dry and unable to access their accounts

Another airport shutdown because of a failure

A computer failure allowed bad meat to ship resulting in a recall

These are all examples of notable failures in the last month. They were big

enough that they made the news.  None of them were "cloud" services.  All

of them had very significant impact.

At least when I host my systems with a provider I don't have to worry about

mean time between failures and replacing systems that go bad.

In fact if you have monitoring setup correctly to watch for important

metrics; then when something goes wrong on your system you just spin the

old one down and spin up a new one.

In fact all of my deployments do this automatically and I just get an email

when it's detected and again when it's done.

It's not a panacea, but what I spend in hosting costs for cloud services,

would easily be dwarfed by the costs of colo for my own boxes and time &

effort spent to monitor them and replace something when it goes awry.

Your point about mysql is valid.  MySQL is not very good in a situation

where the storage is remote like on an NFS or s3fs mount.  If the link goes

down, MySQL will never recover without a reboot.

You're generally much better off using different technologies and

rethinking your application.  In general if I come to a point in my design

where I'm looking at an RDS as the solution, I tend to wonder where I've

failed in my design.  For long term storage or anything that doesn't need

high availability, but needs the structure an RDS provides, sometimes there

is no alternative.  Usually though, there is.  In most cases it's just a

matter of thinking differently about your data.  If I must go with an RDS,

I do make sure that it has local storage as in Amazon SimpleRDS.

One final thing to note.  You should not assume that you are running on

anything approaching modern hardware.  An amazon t1.micro instance has

about the same specs as my 3 year old cellphone.  Something approaching

modern specs is going to cost you about $0.35/hr vs the 0.004/hr of a micro


On the whole, it's better to think of these instance things as dedicated

task processors rather than modern hardware that can run umpteen million

services.  You spin one up to run a specific task in a complicated work

flow.  If you do it correctly, you load balance that workflow across

multiple parallel instances and spin them down when the job is complete.

For example I have a customer who is a professional

photographer/videographer for high end clients (models, celebs and other

people with more dollars than sense).

He needs to index and process an absolutely huge amount of photos and

videos, there is no way he could do this by hand.

I built a dedicated facial recognition & tagging system running on AWS.  I

based my design for this service on something similar I did for a missing

kids service in china.

In all it's comprised of 1 web server, 1 database server, 1 facial

recognition engine, 1 image reprocessing engine and a whole bunch of S3

storage (with automated backup to glacier). When he uploads new photos to

the site, the upload is sent direct to S3.  The website sends a message to

the facial recognition engine which then begins to process the images &

videos and look for who's in them.  The actual engines are kept offline

until needed and the control service spins up n mod 10 instances for images

and n for videos.  In otherwords the number of instances running, is

entirely dependent upon the number of images & videos that need to be

processed.  The control server will spin them back down when their workload

is complete.

Once indexing is complete, a reference to the file along with the tags

created, are batched up and sent to the image reprocessing engine.  This

service will embed the tags as metadata into directly into the file.  The

tags and a file reference are then stored in the database for later queries

by the website.  The decision to embed the tags directly in the files is

actually a failsafe in case the DB becomes unrecoverable or the backup is

too stale.  If that happens, you can just re-index the images without

rerunning the workflow.

In the time since I built this, there is now a "cloud based" workflow

service from Amazon targeted at exactly this sort of workflow. If he ever

decides to make a 2.0 version I'll be leveraging that instead.

Back to the original point...

In the year and a half that this has been operational, there has been no

downtime and no "lost" data or images.  The work load is medium to high but

very spiky. Averaged out he's putting about 10GB of data into the system

daily.  Time is money, he doesn't want to have to deal with ANY downtime.

On the other hand not a single one these instances has had uptime in excess

of a week.  It seems like something is always going wrong, but when I built

it, I built it to self heal.  When I went into this I was well aware of

uptime & availability issues from cloud providers, especially AWS.  But I

try to never design any system where too many eggs are in a single

basket.So I built this in a distributed workload fashion with proper

monitoring, alerting & repair scripts.  The DB is non-responsive?  Spin it

down, spin up a new one.  The webserver is down? Deploy a new one and

repoint the DNS to it.  File missing from S3?  Call glacier and tell it to

bring it back.

You can't just move an existing application to the "cloud" and have it do

anything other than just "sorta work".  If you're going to be using these

systems you need to know what they are, how they work and most importantly

what their failure modes are.  Every design has to be undertaken with the

"what happens WHEN this part falls down?".  When you build for this you

need to take failure of each component as an inevitability.  It's a forgone

conclusion that something will fail at the absolute worst moment. You need

to always have something watching for signs of failure and ready to replace

it when it does.  Then you also need to have something watching that :)

On Wed, Dec 11, 2013 at 7:03 PM, Sasha Pachev <sasha at> wrote:

> >Not picking on you but do you honestly think someone hosting their own

> >server will have better uptime than using one of the current top tier

> cloud

> >providers?


> I do not work much with clouds, but I have had some experiences that

> makes me wonder about the stability of the current cloud solutions:


> * I have seen MySQL stuck due to failed I/O several times on Amazon

> cloud. Never quite like that on a dedicated machine - not so

> spectacularly where every read() syscall would just sit there

> indefinitely instead of coming back with some kind of an error.

> * Netflix outage due to cloud failure made the news recently. I do not

> recall a major news item that had to do with a regular dedicated

> server failure. In fact, it was quite exciting - does not happen

> often.

> * I ran the Big Cottonwood Canyon Half-Marathon this year. When I got

> home I went to their website to check the results and got an error

> several times. Retried several times after giving it some time to

> auto-heal or have the admin take care of it. Then after some time  the

> site started loading, but was extremely slow. I saw the domain of the

> backend scripts was I realize that a poorly written PHP

> script combined with a poorly written MySQL query can produce some

> wonderful results, but you can only botch it so much while fetching

> only 5K records on modern hardware. I have seen horrendously

> inefficient code perform just fine even under load on a normal

> dedicated server.


> Now the idea of clouds is great. However, I fear that our ability to

> get excited about them exceeds our ability to implement them properly

> which is not an easy task.


> --

> Sasha Pachev


> Fast Running Blog.


> Run. Blog. Improve. Repeat.


> /*

> PLUG:, #utah on

> Unsubscribe:

> Don't fear the penguin.

> */



PLUG:, #utah on


Don't fear the penguin.


More information about the PLUG mailing list