S. Dale Morrey sdalemorrey at
Thu Dec 12 08:03:01 MST 2013

In case any of you are considering migrating/developing new services on a
PaaS like this, I feel like I should also mention something about costs and
scalability here.

The cost to build out this system was $50,000.  That included my normal
charges for time for the architecture/design stage.
It also more than covered my costs for wrangling up and paying a handful of
moderately competent coders for the project, because frankly there was no
way I could whip this out on my own in a reasonable timeframe.

In order to protect the project from scope creep and ensure that it stays
operational; maintenance and change requests (mostly wishlist features) are
done on a time & materials basis,

Monthly costs are a little less than $300 for infrastructure including the
spot instances that spin up or down as needed.
I typically put in about 10 hours a month to maintain it with new system
images, backups, updates, putting out fires etc so that's $1,000 in
Glacier costs are utterly negligible, but the S3 bill is something along
the lines of $400 (too much i/o & transfer in this app).

Nevertheless this same system could scale well into the petabyte range if
needed and likely still maintain the same availability metrics.
The website is accompanied by a response time monitor that will spin up new
instances if needed, so the only limit to scale (should the customer ever
decide to lease out the system or make it otherwise public) is funds.

Grand total $50,000 upfront.  On going costs of $1700 per month including
labor.  Can you even hire a competent sys-admin for that price?
The flip side of that is that this is completely custom, as with all custom
software there is HUGE learning curve ahead for anyone I might need to
transfer responsibility to.

For the most part, you can't just migrate existing systems to "the
cloud(tm)".  You really do need to think of it as a re-implementation task
and expect your costs to follow accordingly.

On Thu, Dec 12, 2013 at 6:24 AM, Nicholas Stewart <nicholas4 at>wrote:

> Thank you for providing such a detailed response!
> Thank you,
> Nicholas Stewart
> On Thu, Dec 12, 2013 at 3:53 AM, S. Dale Morrey <sdalemorrey at>
> wrote:
> > I'm not sure what news you've been reading but....
> >
> > The London Airport was shutdown due to a system failure and a backup
> system
> > that utterly failed to do it's job
> >
> >
> > BART was shut down due to a computer failure
> >
> >
> > RBS left it's customer high and dry and unable to access their accounts
> >
> >
> > Another airport shutdown because of a failure
> >
> >
> > A computer failure allowed bad meat to ship resulting in a recall
> >
> >
> > These are all examples of notable failures in the last month. They were
> big
> > enough that they made the news.  None of them were "cloud" services.  All
> > of them had very significant impact.
> >
> > At least when I host my systems with a provider I don't have to worry
> about
> > mean time between failures and replacing systems that go bad.
> > In fact if you have monitoring setup correctly to watch for important
> > metrics; then when something goes wrong on your system you just spin the
> > old one down and spin up a new one.
> > In fact all of my deployments do this automatically and I just get an
> email
> > when it's detected and again when it's done.
> >
> > It's not a panacea, but what I spend in hosting costs for cloud services,
> > would easily be dwarfed by the costs of colo for my own boxes and time &
> > effort spent to monitor them and replace something when it goes awry.
> >
> > Your point about mysql is valid.  MySQL is not very good in a situation
> > where the storage is remote like on an NFS or s3fs mount.  If the link
> goes
> > down, MySQL will never recover without a reboot.
> >
> > You're generally much better off using different technologies and
> > rethinking your application.  In general if I come to a point in my
> design
> > where I'm looking at an RDS as the solution, I tend to wonder where I've
> > failed in my design.  For long term storage or anything that doesn't need
> > high availability, but needs the structure an RDS provides, sometimes
> there
> > is no alternative.  Usually though, there is.  In most cases it's just a
> > matter of thinking differently about your data.  If I must go with an
> RDS,
> > I do make sure that it has local storage as in Amazon SimpleRDS.
> >
> > One final thing to note.  You should not assume that you are running on
> > anything approaching modern hardware.  An amazon t1.micro instance has
> > about the same specs as my 3 year old cellphone.  Something approaching
> > modern specs is going to cost you about $0.35/hr vs the 0.004/hr of a
> micro
> > instance.
> >
> > On the whole, it's better to think of these instance things as dedicated
> > task processors rather than modern hardware that can run umpteen million
> > services.  You spin one up to run a specific task in a complicated work
> > flow.  If you do it correctly, you load balance that workflow across
> > multiple parallel instances and spin them down when the job is complete.
> >
> > For example I have a customer who is a professional
> > photographer/videographer for high end clients (models, celebs and other
> > people with more dollars than sense).
> > He needs to index and process an absolutely huge amount of photos and
> > videos, there is no way he could do this by hand.
> >
> > I built a dedicated facial recognition & tagging system running on AWS.
>  I
> > based my design for this service on something similar I did for a missing
> > kids service in china.
> >
> > In all it's comprised of 1 web server, 1 database server, 1 facial
> > recognition engine, 1 image reprocessing engine and a whole bunch of S3
> > storage (with automated backup to glacier). When he uploads new photos to
> > the site, the upload is sent direct to S3.  The website sends a message
> to
> > the facial recognition engine which then begins to process the images &
> > videos and look for who's in them.  The actual engines are kept offline
> > until needed and the control service spins up n mod 10 instances for
> images
> > and n for videos.  In otherwords the number of instances running, is
> > entirely dependent upon the number of images & videos that need to be
> > processed.  The control server will spin them back down when their
> workload
> > is complete.
> >
> > Once indexing is complete, a reference to the file along with the tags
> > created, are batched up and sent to the image reprocessing engine.  This
> > service will embed the tags as metadata into directly into the file.  The
> > tags and a file reference are then stored in the database for later
> queries
> > by the website.  The decision to embed the tags directly in the files is
> > actually a failsafe in case the DB becomes unrecoverable or the backup is
> > too stale.  If that happens, you can just re-index the images without
> > rerunning the workflow.
> >
> > In the time since I built this, there is now a "cloud based" workflow
> > service from Amazon targeted at exactly this sort of workflow. If he ever
> > decides to make a 2.0 version I'll be leveraging that instead.
> >
> > Back to the original point...
> >
> > In the year and a half that this has been operational, there has been no
> > downtime and no "lost" data or images.  The work load is medium to high
> but
> > very spiky. Averaged out he's putting about 10GB of data into the system
> > daily.  Time is money, he doesn't want to have to deal with ANY downtime.
> >
> > On the other hand not a single one these instances has had uptime in
> excess
> > of a week.  It seems like something is always going wrong, but when I
> built
> > it, I built it to self heal.  When I went into this I was well aware of
> > uptime & availability issues from cloud providers, especially AWS.  But I
> > try to never design any system where too many eggs are in a single
> > basket.So I built this in a distributed workload fashion with proper
> > monitoring, alerting & repair scripts.  The DB is non-responsive?  Spin
> it
> > down, spin up a new one.  The webserver is down? Deploy a new one and
> > repoint the DNS to it.  File missing from S3?  Call glacier and tell it
> to
> > bring it back.
> >
> > You can't just move an existing application to the "cloud" and have it do
> > anything other than just "sorta work".  If you're going to be using these
> > systems you need to know what they are, how they work and most
> importantly
> > what their failure modes are.  Every design has to be undertaken with the
> > "what happens WHEN this part falls down?".  When you build for this you
> > need to take failure of each component as an inevitability.  It's a
> forgone
> > conclusion that something will fail at the absolute worst moment. You
> need
> > to always have something watching for signs of failure and ready to
> replace
> > it when it does.  Then you also need to have something watching that :)
> >
> >
> > On Wed, Dec 11, 2013 at 7:03 PM, Sasha Pachev <sasha at>
> wrote:
> >
> >> >Not picking on you but do you honestly think someone hosting their own
> >> >server will have better uptime than using one of the current top tier
> >> cloud
> >> >providers?
> >>
> >> I do not work much with clouds, but I have had some experiences that
> >> makes me wonder about the stability of the current cloud solutions:
> >>
> >> * I have seen MySQL stuck due to failed I/O several times on Amazon
> >> cloud. Never quite like that on a dedicated machine - not so
> >> spectacularly where every read() syscall would just sit there
> >> indefinitely instead of coming back with some kind of an error.
> >> * Netflix outage due to cloud failure made the news recently. I do not
> >> recall a major news item that had to do with a regular dedicated
> >> server failure. In fact, it was quite exciting - does not happen
> >> often.
> >> * I ran the Big Cottonwood Canyon Half-Marathon this year. When I got
> >> home I went to their website to check the results and got an error
> >> several times. Retried several times after giving it some time to
> >> auto-heal or have the admin take care of it. Then after some time  the
> >> site started loading, but was extremely slow. I saw the domain of the
> >> backend scripts was I realize that a poorly written PHP
> >> script combined with a poorly written MySQL query can produce some
> >> wonderful results, but you can only botch it so much while fetching
> >> only 5K records on modern hardware. I have seen horrendously
> >> inefficient code perform just fine even under load on a normal
> >> dedicated server.
> >>
> >> Now the idea of clouds is great. However, I fear that our ability to
> >> get excited about them exceeds our ability to implement them properly
> >> which is not an easy task.
> >>
> >> --
> >> Sasha Pachev
> >>
> >> Fast Running Blog.
> >>
> >> Run. Blog. Improve. Repeat.
> >>
> >> /*
> >> PLUG:, #utah on
> >> Unsubscribe:
> >> Don't fear the penguin.
> >> */
> >>
> >
> > /*
> > PLUG:, #utah on
> > Unsubscribe:
> > Don't fear the penguin.
> > */

More information about the PLUG mailing list