April 2011 Archives

Yes, Amazon’s recent issues were a bit of a spectacle, but a couple of posts I read went beyond the typical Nelson-esque “Heaaaah Haaaa” and looked at things from the perspective of those who design and run the client systems.

Y’all Got What You Deserve

Ted Dziuba’s post indicates that you relying upon the cloud is destined to fail.

Pain is nature’s way of telling you that you have just fucked up. It’s a hint to your future self that maybe you should never do that again. Yet, you dumbasses continue to host things full-bore in Amazon.

Amazon — The Purpose of Pain [Ted Dziuba]

We’re Living Proof

Jeff Attwood illustrates Netflix’s counterexample, VERY interesting because Netflix has a program (on purpose!) that runs around destroying things to make sure it’s all fault-tolerant. Pretty gutsy…

Which, let’s face it, seems like insane advice at first glance. I’m not sure many companies even understand why this would be a good idea, much less have the guts to attempt it. Raise your hand if where you work, someone deployed a daemon or service that randomly kills servers and processes in your server farm.

Working With the Chaos Monkey [Coding Horror]

on_off.jpg

As I enter the golden years/era of my software-rigging career, I find that I REALLY value having a convenient off-switch for new features.

If you’re working in a shop of moderate to large size, chances are you can’t get code changed in the production environment in under 30 minutes and without considerable stress. This is when it pays to have a toggle switch to enable and disable your feature/site/nuclear-release-countdown.

It’s one of those things that’s easy to build but often ignored. When you have the off-switch, you take some of the stress out of production moves.

If things start going south, you just flip the switch and reassess. You don’t need to back out an entire release, or run around screaming for an hour and a half, trying to convince people that your changes need to be backed out or hacking a hasty patch.