Two Factor Auth: Allow AWS IAM users to manage their own MFA devices

02 September 2014 #

(all info and screenshots are from 09/02/2014)

In light of all the recent incidents involving attackers taking control of a company's root AWS account, myself and most everyone I know that is managing any sort of infrastructure have been re-auditing accounts and stepping up efforts to get everyone within our teams to turn on MFA (multi-factor authentication). MFA makes it impossible for someone to log in as you with just a username/password combo. An additional "factor" is required to confirm the user's identity - typically a code from a synchronized number sequence. This has been standard practice in larger companies and capital-E Enterprise for many years, and is now starting to be taken seriously by folks operating at a smaller scale and in the cloud. No one wants to be the next tragedy

MFA (or 2-factor auth) has traditionally been embodied by RSA tokens attached to a keychain or a badge lanyard. These days, your phone can act as an adequate substitute.

Turning on MFA for your root AWS account is fairly easy:

mfa device for root acct

However, it took me an unfortunate amount of time to figure out how to allow users created as IAM accounts to manage their own MFA devices. Setting people's devices up by hand through the root account was simply not an acceptable solution. Even at our size it was going to be a major headache, especially for our remote employee.

In the end, it's all documented in AWS docs, but it's a bit buried, and multiple steps are involved. Hopefully this post saves you some time.

Just The Right Amount

The critical thing is to give everyone JUST what they need and no more. Since you've already secured your root account, you can likely curtail the breach of an IAM account reasonably quickly, but it's best if the account can wreak minimal havoc in the first place. For example, if a compromised account was able to fiddle with the credentials of other users, the exposure and cleanup effort would increase greatly.

Unfortunately, the IAM permissions policy system is rather arcane. That is an undesirable property for a security-related system to have (easy to get wrong), but alas, it's the one we've got.

IAM Policies are made up of combinations of JSON blobs ("stanzas") each containing a unique identifier, an effect (Allow, Deny), an action, and a resource to which the effect/action combo should be applied. There's a whole bunch of documentation on the subject here so I won't spend too much time elucidating it. Let's cut straight to what we need.

MFA Device Permissions

When you create an IAM user, by default they are unable to do literally anything. When you pull up the IAM dashboard (where you have to go in order to set up your MFA device), you literally just see permissions errors everywhere:

no permissions by default

"Well that sucks," I thought, looking over a co-workers shoulder. Googling "allow IAM user to manage own mfa device," we find this lovely page: Example Policies for Administering IAM Resources Under the heading "Allow Users to Manage Their Own Virtual MFA Devices (AWS Management Console)", we find an example policy that should do the trick.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowUsersToCreateDeleteTheirOwnVirtualMFADevices",
      "Effect": "Allow",
      "Action": ["iam:*VirtualMFADevice"],
      "Resource": ["arn:aws:iam::ACCOUNT-ID-WITHOUT-HYPHENS:mfa/${aws:username}"]
    },
    {
      "Sid": "AllowUsersToEnableSyncDisableTheirOwnMFADevices",
      "Effect": "Allow",
      "Action": [
        "iam:DeactivateMFADevice",
        "iam:EnableMFADevice",
        "iam:ListMFADevices",
        "iam:ResyncMFADevice"
      ],
      "Resource": ["arn:aws:iam::ACCOUNT-ID-WITHOUT-HYPHENS:user/${aws:username}"]
    },
    {
      "Sid": "AllowUsersToListVirtualMFADevices",
      "Effect": "Allow",
      "Action": ["iam:ListVirtualMFADevices"],
      "Resource": ["arn:aws:iam::ACCOUNT-ID-WITHOUT-HYPHENS:mfa/*"]
    },
    {
      "Sid": "AllowUsersToListUsersInConsole",
      "Effect": "Allow",
      "Action": ["iam:ListUsers"],
      "Resource": ["arn:aws:iam::ACCOUNT-ID-WITHOUT-HYPHENS:user/*"]
    }
  ]
}

Since this is in no way obvious, I will also note that the account ID is found on the "Security Credentials" page of the root AWS account.

aws account ids

This appears to be sufficient to let users find themselves in the "Users" menu, click the "Manage MFA Device" button, and go through the rest of the process.

test user's mfa button

Passwords etc

I also found it useful to give our users the ability to manage the rest of their own credentials. The relevant policy stanzas can be found here.

Surprisingly, the default "Password Policy" on our AWS account was set to allow passwords as short as 6 characters with no additional requirements. Even with MFA enabled, you'll want to crank that up to something quite a bit more robust.

Keeping the robots at bay

One other important aspect of our setup is the fact that only humanoid users are able to mange their own credentials. We have a number of automation-related "bot" accounts who have security policies tailored specifically to their purpose - the backup user only has access to a specific S3 bucket, the dnsupdater user only has access to a specific Route53 zone, etc. Even with this limited set of permissions, it's important to make it difficult for an attacker to gain control of these users. They do not have passwords, and they are never granted permissions to manage their own credentials. This is accomplished by attaching the policies described above to a humans group and only adding users with a verified heartbeat to that group.

Enforcing a Policy

We have a policy of not allowing access to any AWS resources without an MFA device enabled. However a policy is only as good as its enforcement. I did a brief google and didn't find any automated tools to do the job, though I did not try very hard. I did find that the AWS CLI tool has a aws iam get-credential-report command, which returns a base64-encoded CSV file containing information about all the IAM users' credentials. One of the columns is mfa_active, so the data is all there to automatically enforce an MFA policy.

(NB: you have to run aws iam generate-credential-report beforehand. Full docs are here)

For example, the following python snippet (available as a gist here) will parse the contents of the report and tell you who doesn't have MFA enabled. All you have to do is chmod +x the file to make it executable, then pipe the report into it like so: aws iam get-credential-report | ./scripts/parse_credential_report.py.

#!/usr/bin/env python
from sys import stdin
import json
import base64

report = json.loads(stdin.read())
table = base64.b64decode(report["Content"]).splitlines()
head = table[0].split(",")
table = table[1:]

for row in iter(table):
    user = dict(zip(head, row.split(",")))
    # you now have a dictionary with keys like `user`, `mfa_active`,
    # and `password_last_changed`
    print "%s %s" % (user["user"], user["mfa_active"])

For our current team size and growth rate, and compliance needs, this is sufficient. I did come across an example of what a fully-fleshed out tool would look like in the excellent DevOps Weekly: The Guardian's gu-who for performing account audits on GitHub accounts.

Low-hassle HTTP metrics with Tigertonic and Go-metrics

07 February 2014 #

First things first: What the shit is tigertonic?

Tigertonic is a framework for making webservices in Go written by Richard Crowley (I have contributed a bug fix or a feature here and there). Its defining characteristic is that it allows you to translate functions which take and return specific Go types into http.Handler implementations that understand and return JSON payloads. Define your signature, pass it into the correct Tigertonic wrapper, and out comes a web service that take in JSON, unmarshals it to the input type, passes it to your handler, then takes the return value from your handler and marshals it into JSON for the response.

It's similar to JAX-RS/Jersey annotations, but with much less code, and with most of the ugly bits hidden from the framework's user.

Check out the README for more info. Richard has also written and spoken about Tigertonic on various occasions. It's all well worth reading.

Here's an example of a very simple tigertonic service:

type Book struct {
        Author, Title string
}

// this takes a Book object and returns an empty body
func PutBook(u *url.URL, h http.Header, book *Book) (status int, responseHeaders http.Header, _ interface{}, err error){ ... } 
// this takes an empty body and returns a Book object
func GetBook(u *url.URL, h http.Header, _ interface{}) (status int, responseHeaders http.Header, book *Book, err error) {}

func main() {
        mux := tigertonic.NewTrieServeMux()
        mux.Handle("GET", "/books/{book_id}", tigertonic.Marshaled(GetBook))
        mux.Handle("PUT", "/books/{book_id}", tigertonic.Marshaled(PutBook))

        server := tigertonic.NewServer("localhost:34334", mux)
        log.Fatal(server.ListenAndServe())
}

(full code is here)

So You Want Some Metrics

At Opsmatic we strive to be a "learning organization" - we want to learn something from every release, every change, every customer interaction. An important component of that philosophy is an obsession with measuring things. Jim, our CEO, wants "If you can't measure it, don't ship it" written on his headstone when the time is right. No joke.

One of the things we wanted to measure was the number of requests served by our API. While we were at it, we thought we'd grab the timing data too for operational purposes.

go-metrics and Tigertonic

Richard is adamant about everything in Tigertonic reducing to an implementation of http.Handler, and with good reason: doing so enables the Handler that actually performs the business logic to be wrapped in any number of completely orthogonal Handlers that handle all sorts of other concerns - logging, CORS rules, authentication.. and metrics! (the README lists the available handlers.) The separation of concerns afforded by this approach is truly refreshing.

Go-metrics is a library, also maintained by Richard, that provides similar capabilities to Coda Hale's great Java metrics library. It makes it very easy to time and count things, as well as to extract the data from the timers and counters.

Tigertonic comes with a few wrappers that hook up our Handlers directly to these metrics. We're going to look at a couple in particular: Timed and CountedByStatusXX. The former is a very thin wrapper around the functionality of a go-metrics Timer - it just times the request and records the reading:

func (t *Timer) ServeHTTP(w http.ResponseWriter, r *http.Request) {
        defer t.UpdateSince(time.Now())
        t.handler.ServeHTTP(w, r)
}

The latter is a bit more involved, but is also ultimately a thin wrapper around some go-metrics primivites which counts the number of requests that result in a given class of response codes 2XX, 5XX, etc. You can look at the code here

Adding a counter is done by calling tigertonic.Counted(yourHandlerHere, ...). Since the return value is also an http.Handler, you can pass that to tigertonic's multiplexer or really anything that operatoes on http.Handler - including the stdlib http server.

Putting it all together

The goal at the outset was to easily capture metrics on all our endpoints. How are we doing on that?

Quite well, it turns out. All we have to do to achieve the goals is some wrapping:

func wrapHandler(name string, h http.Handler) http.Handler {
        return tigertonic.CountedByStatusXX(
                tigertonic.Timed(
                        tigertonic.ApacheLogged(h),
                        name,
                        metrics.DefaultRegistry,
                ),
                name,
                metrics.DefaultRegistry,
        )
}

Then we invoke this wrapper before registering our handlers:

mux.Handle("GET", "/books/{book_id}", wrapHandler("get-book", tigertonic.Marshaled(GetBook)))
mux.Handle("PUT", "/books/{book_id}", wrapHandler("put-book", tigertonic.Marshaled(PutBook)))

ET VOILA. We need to give our handlers some names for the purposes of metrics collection, so we create a little wrapper function that takes that name and a Handler and wraps it in all the properly named metrics collectors. When we need to add more handlers, we wrap those too and the data shows up for free. In the instrumented version of the code you can see that I've also made a call to metrics.Log which spawns a reporter goroutine off into the background, printing out the stats every 10 seconds. There are a number of more useful reporters available - for example, I've contribued a Librato reporter which posts the metrics to the Librato API.

Slightly More Advanced

The full Opsmatic version of the above code is included below for additional illustration. It is expanded to include the name of the service, some CORS defaults, and two versions of the wrap method - one that includes a call to tigertonic.Marshal and one that does not; we need the latter to accommodate a couple of endpoints we have that do not return JSON.

type OpsmaticService struct {
        serviceName    string
        allowedOrigins []string
        allowedHeaders []string
}

func NewOpsmaticService(name string, origins []string, headers []string) *OpsmaticService {
        return &OpsmaticService{name, origins, headers}
}

func NewDefaultOpsmaticService(name string) *OpsmaticService {
        return NewOpsmaticService(name, []string{"[redacted]"}, []string{"Authorization"})
}

func (self *OpsmaticService) WrapHandler(name string, h http.Handler) http.Handler {
        cors := tigertonic.NewCORSBuilder().AddAllowedOrigins(self.allowedOrigins...).AddAllowedHeaders(self.allowedHeaders...)

        return cors.Build(
                tigertonic.CountedByStatusXX(
                        tigertonic.Timed(
                                tigertonic.ApacheLogged(h),
                                fmt.Sprintf("%s-%s", self.serviceName, name),
                                metrics.DefaultRegistry,
                        ),
                        fmt.Sprintf("%s-%s", self.serviceName, name),
                        metrics.DefaultRegistry,
                ),
        )
}

func (self *OpsmaticService) MarshalAndWrapHandler(name string, f interface{}) http.Handler {
        return self.WrapHandler(name, tigertonic.Marshaled(f))
}

Conclusion

Using this little bit of boilerplate code, we can readily instrument new endpoints as they come online without cluttering the code with counters and timers. Using the aforementioned Librato reporter, we get graphs for new endpoints that we deploy instantly and with zero additional wrangling. It's quite a nice setup that required a fairly modest amount of code and requires very minimal marginal effort on new endpoints. We hope that you enjoy it as well.

The Myth of the Uninterrupted Programmer

17 November 2013 #

This post about office noise level and distractions came through my inbox, and a particular voice in the comments section caught my eye.

"Show me an office with caves and I'll show you my resume"

Plenty of comments followed echoing this sentiment.

While I agree that stretches of concentration are important for figuring out a specific task, I think that this chorus is at the heart of a serious misunderstanding many engineers have about their value as members of an organization that is resulting in a tremendous amount of waste.

Sure, constant interruptions and context switches are exhausting and difficult. I'm not suggesting that we should spend all day turning from one conversation to another. It's easy to overdo meetings and office shennanigans. However, a healthy amount of interaction and socialization has some very important benefits.

Interruptions cause you to retrace your steps - this is often good

There is a much less edifying real-life counterpoint to the widely romanticized deeply concentrated programmer. It's that of a programmer spending 4 hours trying to track down a confusing, elusive bug, only to figure it all out 5 minutes after walking away from it. I've done it, I've seen it, and I continue doing it and seeing it.

There's a very simple explanation for this phenomenon: in order to be able to reason about an algorithm, especially a complex one, we have to assume and take a whole load of things for granted. The stack, the configuration, the interfaces on top of which we're working.

An incorrect assumption is a common source of confusion and infuriating debugging. If you're lucky, the false assumption will be illuminated by a debugger or a log line. However, the longer you'd been staring at the same problem, the more likely you are to miss something much more simple. That helper function you stubbed out earlier while testing something else? Yeah, that's still there. You'll feel real dumb when you remember.

Interruptions - planned or unplanned - cause you to "resurface" and to have to re-engage the problem almost from scratch. Part of that process is rebuilding that chain of assumptions. Stepping back from a problem and seeing the bigger picture is often much more productive than spinning down in the bowels of your code.

(Here's a great talk Joe Damato with a pretty good discussion of discovering violations in your basic assumptions)

Re-reading your own code is the best way to write readable code

If you're writing a bunch of code in a hurry, and especially if you're doing so while fighting through bugs, you're likely leaving a disaster zone in your wake. Even if you think you're writing "clean code" and writing tests to go along with it, there are probably sections in your code that barely make any sense by the time you've gotten them to do what you want.

Pair programming is one way of solving this - your passenger will point at the screen and call you out for getting too fancy or too casual with your single-letter variables. I'm still torn on pair programming, but I do think it's a great idea to re-read your own code regularly for reasons related to the first secion.

While an interruption causing you to lose context can be annoying, the forced re-construction of context can point out flaws in your reasoning and force you to recognize sections of code that are hard to read - because you'll have trouble reading them too.

Your peer has likely seen the same problem before

We spend a lot of time talking about sharing code and know-how in the OSS community. We've also been putting lots of emphasis on DRY - "Don't Repeat Yourself." Well, it's more like DRO - "Don't repeat others." This broader message applies to your peers as well. When you're dealing with OSS code and you find a bug you can't sort out, you ask the internet and see if anyone else has had the same problem. For whatever reason, we find this easy, but we find turning to our neighbor and asking the same thing difficult - PROBABLY because we're afraid of the stigma of interrupting them. So we spin our wheels. Awesome.

Don't forget that someone in the room is very likely to have used the same software and tools you're using, seen similar problems in the same or similar systems, or, if you're really lucky, wrote the damn thing in the first place.

Interruptions often come with an opportunity to ask your colleagues - they may well be interrupted too.

Are you even solving the correct problem?

Many conversations between engineers about productivity make it sound like the goal of programming is to write as many lines of code as possible. This has been reinforced by stories of companies like Google which were "run by the engineers." I believe this has caused people to imagine the original Google employees all furisouly writing code for 16 hours a day without uttering a word to each other or anyone else, inevitably producing the world's best search engine.


Photo by Paul Simpson

This is pure professional hubris. Hubris is all I hear when engineers bitch about product and project managers interrupting them with all their "process." Sure, it's easy to overdo, but it brings us back to that whole "know your business" thing.

Sure, if you sit in your little cave for 16 hours, you're going to write a whole bunch of code. But... what did you just produce? Sure, it's "correct" in the strict engineering sense of the way - the right inputs produce the right outputs, etc.. But is it correct in the context of a product? Did you actually build something people will want? Does it work, as in, does it behave the way a customer would expect? Chances are it does not, because it's hard to build things for humans without talking to them.

The reality of the matter is that Google's early engineers were successful because they were good at all those other things as well, not because they ignored everything around them and ground code.

How hard are you concentrating, anyway?

You can tell engineers don't REALLY mind being interrupted by just looking at the constant shitpile of activity on HackerNews, Twitter, Google Plus, IRC, etc. It's not about interruptions. It's just flat out whining. We don't like getting out of our comfort zone and thinking about things about which we're not that good at thinking. Stop coming up with excuses and get better at it.

Interruptions force you to ship.

There's no disputing that interruptions and context switches are painful and difficult, but knowing that they're coming can have a positive impact - if you anticipate only having a couple of hours before you're interrupted, you will work in more incremental chunks, which lend itself better to testing, documentation, abstraction etc. These are all good things.

For example - there are guests coming over for dinner shortly, so I'm just going to wrap this up and post it. It's too long as is.

tl;dr

Sitting in a dark basement in silence great for leveling-up your World of Warcraft character. It's no way to build good, usable software. There's no substitute for good communication.

A Reliable, simple way to get a PDF out of Showoff

17 November 2013 #

Perpetually agonized by actually using Keynote or Powerpoint to make slides, I continue to use Showoff to make my slide decks. Unfortunately, the codebase appears a bit neglected, and certain features have stopped working very well over the course of re-installs. I have neither the Ruby-fu nor the time nor the patience to figure out why PDF generation has stopped working (I actually don't think that particular feature ever worked for me at all), so I've had to resort to trickery.

I am posting this here because I keep forgetting how to do this and having to blindly figure it out each time. Hopefully my own blog will be an obvious enough place to look. This has only been tested on a Mac using Chrome, but it looks like Safari will work to with a bit of tweaking

  1. Add the following to a css file that is included in your preso
    # preso {
        width: 11in;
        height: 8in;  // this may need to be lowered slightly for Safari
    }
    .slide {
        width: 11in;
        height: 8in; // this may need to be lowered slightly for Safari
    }
  1. run showoff serve from your repo
  2. Go to http://localhost:9090/singlepage (obviously port may vary if you used -p)
  3. Use your browsers's Print function to generate a PDF

DONE. Happy PDFin.

print dialog

How Do I DevOps?

11 June 2013 #

There is lots of talk about what DevOps is and means, even a Wikipedia page, to which I may soon give some much needed love. However, a friend recently asked if I knew anyone worth hiring for a "devops" role, and I found myself asking clarifying questions about the sort of person he had in mind. Seemed worth writing down.

The friend was looking for engineers. So what does it mean for an engineer to be devops-y?

TL;DR

  1. Understand the Whole Company as a System
  2. Respect Other Functions Within The Organization Profoundly
  3. Have a Strong Sense of Personal Accountability

Build your software like you give a shit about the people whose jobs and lives are affected by it.

1. Understand the Whole Company as a System

bottles!!
Photo by verifex

Your company has inputs (money, labor, etc) and outputs (product, money, etc). I've grown to loathe the phrase "above my pay grade" because it tends to betray a complete lack of interest in the big picture. Hanging around my new colleague Jim, aka Mr Manager, I've recently started to identify things as "tactical" vs "strategic." Strategic is the big picture - where is the company going; what are the company's goals; what will make or break our success. Tactical is the every day - what features are left on the current project and which one should I work on next; how much time should I spend on this bug, what with the massive deadline looming; hell, should I even be looking at bugs? If you don't have a good grip on how you and your project fit into the bigger picture of the company, you are always tactical. Tactical can quickly become boring, repetitive, and un-rewarding. It's also a nice way to never grow as an individual. In the DevOps picture, it means you probably don't make judgment calls well with regards to what is and isn't important, distributing your time poorly. Your colleagues probably notice; they probably don't like it.

This is a great segue to:

2. Respect Other Functions Within The Organization Profoundly

For our immediate purposes, we can focus on just the ops team, but it applies well beyond. Understanding and respecting the priorities and needs of non-technical teams and taking them seriously helps greatly reduce the number of surprises on both sides. Also, if you're really living number 1 above, you probably won't be surprised that your goals are very closely related.

But back to your relationship with the ops team (or, if you're living in devops dream land, your colleagues, since you're all part of the combined devops utopia, right?) What makes them tick? What wakes or keeps them up at night? What makes their job harder? Easier? I like to make it personal: how have I made their lives better or worse?

Let's look inwards for a moment: what if someone is asking these questions about me? Well, I'm a software engineer. I grind code for a living. I get some requirements (new product spec, a bug, something I think up in my free time and don't tell anyone about, etc), figure out how to meet those requirements, write some code, and push it to production.

What are the things that make me happy while performing these functions? Well, there's a whole bunch of them, but they can all be summed up very easily: lack of friction. A relatively low number of things I have to do beyond my core activities in order to get to the end; a limited number of context switches. A clean, consistent, reproducible dev environment. A responsive, intelligible build system. A mostly-automated way of moving my code through various environments.

What has ops done for me?

Well, shit, I'm actually mad spoiled. Flickr was a PHP site with a well oiled deploy machine that we've all heard about - since you didn't need to restart anything to get your code out (an under-appreciated side effect of the way PHP is traditionally served), we'd literally just push a button and the new code got rsynced to the boxes while also keeping a nice, visible record of the what, when, and why (a form of this now available to the masses in the form of Etsy's Deployinator). SimpleGeo and Urban Airship use(d) Puppet and Chef respectively to great success, and there was an ever-improving set of tools available to make it easier to start working on a project and to test it as I went along. When I was done, it got reviewed, merged, built and sent off to a package repo, then deployed to production using automation. I spent most of my time actually debugging or writing code, not sheparding it around environments or struggling to get it to run in the first place. It's also easy to forget the little things that helped keep computers out of my way - federated logins etc.

These are just the more salient examples - specific things ops has done to make my life easier; it is by no means an exhaustive list of what I see as the core strength of my prior ops teams.

What have I done for ops?

a opsian, elbow deep in
'it'
Photo by Business Insider

Let's look at what my teams at each of these orgs did that I think was helpful to and appreciated by the ops teams. This is in no particular order, and I'm going to forego the names of the organizations because there's a ton of overlap.

  • Painstakingly instrumented our services so that their state could be more easily examined in the wild
  • Pumped as much data as we could into the monitoring tools kindly provided us
  • Thoughtfully considered what metrics and properties were helpful in determining the health of each particular system being worked on. Business people might call this a KPI; Mathias Meyer called it a "Soul Metric" in his monitorama talk.
  • Carefully set up alerts that interpreted the above to try to minimize noise and non-actionable alerts.
  • Learned at least enough about the configuration management tools to be able to submit pull requests for desired changes in production without personal involvement and hand holding from someone on the ops team.
  • Considered and tested how the software being written behaved itself before an emergency - how is failover handled? how are configuration changes handled?
  • Automated or helped automate parts of the process that were difficult to remember or tedious.
  • Worked on tools in our spare time that made any of the above easier.

Broadly, we tried to be sensitive to how the operators interacted with the thing in production and how reasonable the experience was - during changes, during outages and failures, etc. We focused on operability.

Why did we do all this?

3. Have a Strong Sense of Personal Accountability

Because it felt like the right thing to do. When people got woken up at three in the morning because something I had deployed broke in a confusing, difficult-to-debug way, it felt bad. I wanted it to be less confusing the next time. If we're being honest with ourselves, it probably helped with the motivation that I woke up too and was just as frustrated and annoyed.

Go back to #2 and think "Do people in the other organizations have the right tools to perform their jobs?" The better the tools, the less friction there is, the more quickly people can perform their reactive tasks (ops responding to pages; marketing compiling a traffic report that the CEO suddenly needs for a board meeting; support dealing with a massive DDoS or spam influx) The less time people spend reacting, the better - reacting is by definition tactical, and spending all your time in tactical mode, as we've covered, is not great. The list in section 2 was focused on ops, but a lot of the same stuff, especially the tools bit, applies to other teams as well.

It'll never be perfect, but often the smallest change makes the biggest difference. Re-arranging a dashboard ever-so-slightly could be the difference between someone getting RSI while trying to track down spammers until late at night and them going home in time for dinner. A good DevOps engineer in my mind is one that feels personally responsible and accountable for the parts of his or her job that have an effect on colleagues' happiness and success. Remember, everyone likes going home for dinner.

Conclusion

Coming back to what this all means for a software engineer: it's all about the big picture. In an organization whose primary output is software, everybody depends on how well that software is equipped to help them succeed in their particular job. Understanding your effect on these needs and striving to meet them - that's what DevOps means to me.

Further Reading

  • DevOps: These Soft Parts A post by John Allspaw about the soft skills involved in making DevOps-style cooperation work
  • Developing Operability (slides) A talk to Richard Crowley with specific advice for smoothing the journey of code to production for both devs and operators; more on the meaning of "DevOps" (warning: a wall of text)
  • DevOps - The Title Match A post by John Vincent on a common misconception about the organizational meaning of DevOps

It's a train.. no, it's a computer.. can't it be both??

02 May 2013 #

I am delighted to let you spread the word about an amazing innovation from Lian Li, the acclaimed maker of computer cases. They have thrown caution to the wind and finally introduced the thing we've all been waiting for - the Choo Choo Train Computer Case.

COMPUTER TRAIN!

Yes. Yes. Let that sink in. It's a computer case shaped like an old steam-powered locomotive. It has a 300 watt power supply in the front section, and the cart can fit a Mini ITX motherboard, a slim optical drive, and a single internal hard drive. One might point out that these are somewhat weak specs as far as cases go, but hey, IT'S A FUCKING TRAIN.

But wait. There's more. No, seriously, there's more.

I saw that the case had 5 star revies, so I clicked to see what proud owners had to say about it.

This SKU, which ends in an S, does NOT move compared to the more expensive SKU that ends in an L. It has no motor, it's just a case that looks like a train. The more expensive model [...] actually has a motor and a transmission, and comes with extra rails, so it will roll back and forth when the computer is turned on.

Yup. Lian Li's product page for this puppy is epic Not only is there a more expensive version that moves (and comes with "Rail x6" instead of "Rail x1"), there's a limited edition one that has an atomizer. That's right. It makes steam!

power train

Basically, I'm spent just thinking about this. The amount of space accommodated by the case isn't ideal for the plans I have for an HTPC (I was on Newegg for a reason, afterall), and I definitely couldn't handle "Rail x6" and a computer case scooting back and forth, but y'all know what to get me for my birthday now. I'll make it work.

Unfortunately, it's out of stock on NewEgg and I can't find it for sale anywhere, so I fear that the opportunity may have passed. Who knows when Lian Li will elect to share their genius with us again? I'll probably end up having to purchase one of these guys on Ebay for thousands of dollars as a collectors item years from now.

Some Love For Ishmael

29 April 2013 #

Back in the days of fire fighting and database optimizing at Flickr, when I could debate the merits of different MVCC options comfortably, I built a little tool called Ishmael to help us make sense of mk-query-digest data more easily (apparently, the project has been moved to the "Percona Toolit" and renamed pt-query-digest). Tim Denike made some improvements during his remaining time at Flickr after I had left, and then Asher Feldman took the project with him to The Wikimedia Foundation. Eventually, he sent in a large enough pull request that I simply did not have the capacity to test it - I, after all, have not used MySQL in anger in ages. So I did the natural thing and made Asher a collaborator on the repo.

This past week, during a moment of vanity, I noticed that there were quite a few more stars on the repo than there had been. I wondered what might have caused it, and shrugged. Then on Sunday the DevOps Weekly email provided the answer: Asher had written a post about MariaDB on Wikimedia's blog, in which he mentions their use of Ishmael in comparing performance between old and new database versions. It is a good read for anyone interested in database migrations and upgrades, especially "doing it live!"

Everyone, look, this is my "proud open source moment" face.

Readmeme - a README generator for those in doubt

23 March 2013 #

Programmers have been lamenting each others' inability to properly document software for many decades. Some recent examples include:

  • Tom Preston Werner's call for README-driven development

    a beautifully crafted library with no documentation is also damn near worthless.

  • Ted Nyman's recent Basho Chats diatribe about software expressing itself poorly (video not yet uploaded)
  • Peter Morgan's eloquent Go issue report

    Is that all we want to do is explain bits of code, and its runningz what it does.. Hopefully there is a lot of coders looking at it cos we can auto gen documentation..

Powerful stuff. There's been a corresponding surge in tools to assist in creating documentation. Some examples:

Still I see repositories with READMEs like "Gathers and relays system metrics" (not to pick on anyone in particular, just an example) and ones that skip straight to installation instructions and licensing info. That's bad.

The README is a project's face to the world. It should tell the audience what the project does and what motivated its creation at a minimum. Not to belabor the already-well-made point too much further, I've created a tool for myself and others to help formulate a basic README when inspiration betrays us.

It's called readmeme - and yet again, Richard Crowley helped me name it. It is apt. I hope it starts a meme of informative READMEs.

For more info on the project, check it's README (get it?)

The 3 Little Mistakes

03 December 2012 #

There are many mistakes people make when programming for the web. Here are three that I've seen everywhere I've worked. I think they deserve extra attention because they are relatively easy to avoid, but very difficult to recover from.

Encodings

Even if you only ever have US based users, enough folks have accents in their names and all sorts of other reasons to introduce non-ascii characters into your systems. Enforce UTF-8 at all levels, and especially at storage time. Fixing an encoding issue is difficult, and usually involves re-writing all the data. It is unpleasant, no matter the underlying datastore.

Limits and Pagination on API Requests

This is pretty straight-forward, but I've seen it neglected in practice quite a few times. Whether it's a POST where the bath size isn't bounded or a GET that returns "all the _____," it inevitably becomes a nightmare that is difficult to fix due to clients not expecting to have to paginate. This is particularly exacerbated in APIs used by mobile apps - a total fix requires getting all the client apps to upgrade to a new version of the library AND for all their users to upgrade.

Timezones

This issue is similar to the encoding one - a matter of consistency. Storing everythign at UTC forces consistency on the data. Not doing so leaves an opening for rows in the same table to be written with different time zones, making querying and comparisons more complicated and possibly expensive. Almost without exception, you only want to think about timezones at display or query time; it's much easier to deal with DST when your data is normalized to the same, consistent, season-immune representation.

Cal's book has a lot of really good info on other details that merit attention when working on the Internets.

Just Use Your Words

19 September 2012 #

It is becoming increasingly plausible that I have a partially torn miniscus in my left knee. The unusual, difficult to place pain, the occasional painful popping and "catching" in the joint, the frequent unprompted discomfort.. I run into people on a regular basis playing soccer - any number of recent collisions could damage a ligament.

But the ligament is not the issue.

The issue is the reluctance to call it what it is - the orthopedic sugeon dictated "possible meniscus irregularity" into his recorder. The physical therapist opted for the cautious "possibility of some miniscus involvement." Having already been primed, I pressed a bit: "A torn miniscus, you mean?" "Yes, a partial tear."

Come on guys, I'm not going to die of a torn miniscus. The upper bound on recovery from a miniscus repair is 4 months, and I'm 26 - it would likely take less than that. Surely they don't expect me to break down crying right there in the office when confronted with the news..

Why am I writing about this harmless, albeit frustrating, experience? I do have a reason: it reminded me of a great piece The Economist ran on the subject of "euphemisms" last December. It is particularly relevant in the current political climate, especially the very last section titled "Little white lies."

The piece ends with a call to abandon euphemisms for a day. I highly recommend this exercise, having lived it long before this article was published. You'll piss some people off, but at least they (and you) will know where you stand.

The Lonely Wait

07 September 2012 #

As I exited the Civic Center metro stop on this beautiful sunny morning, I glanced up Hyde street. There was no 19 bus to be seen up the street, which meant it was at least 5 minutes away. At best, it would make it to my destination at 8th and Bryant around the same time as I would on foot.

Rounding the corner onto 8th, just past the 19 stop, I came upon 20 to 30 people, mostly men, mostly in their mid 20s, lined up against a fence. Nobody was talking, everybody had headphones on, some were smoking. They were waiting for the Xynga shuttle to take them to their office 6 blocks away, about 2/3 of a mile down 8th street. 15 minutes by foot, 5 by bike.

At their feet started a freshly painted bikelane, possibly the widest in the city.

Somewhere far away, in an empty bar completely unaware of the human tragedy unfolding at 8th and Market, the jukebox played "Don't Know What You Got Till It's Gone."

Windshield: Display Incoming Geo Data Using PolyMaps

05 September 2012 #

While discussing what to do for a Free Friday project, Neil and I decided we want to build some sort of visualization related to the location data UA had been processing. I quickly thought of the dashboard that Jon Rohan wrote for SimpleGeo, which would plot the coordinates that API requests were targetted at as those requests came in. After finding Jon's code, I realized that the front end portion was going to be very easy to adopt, as well as make generic. Having obtained Jon's blessing, I tidied the code up a bit and open-sourced it. Of course, the backend code that supplies our geo data will remain closed source and proprietary, but there is an example data source included to help others get started.

I called the project Windshield because the points look like bugs that show up on the glass over the course of a long drive. The source is here. I have an example up here.

I could make this a wordy post about programming practices, javascript, and technology, but this was a really simple project. Besides, other people did most of the hard work:

  • PolyMaps did all the map parts of it.
  • CloudMade made the gorgeous tiles.
  • Jon Rohan wrote the code I aped heavy-handedly to get a grip on the thing.
  • I made the function that supplies the points pluggable so that it was easy to test and extend, so HURRAY for higher order functions.

Probably the most surprising thing was Aaron submitting a pull request hours after I had open sourced the damn thing to make the demo work correclty in FireFox. Thanks, Aaron!

Things I know I still need to do:

  • Make it easier to manipulate the map once it's created (perhaps return something from the windshield function)
  • Explore the concept of a PolyMaps "layer"
    • Can I create a custom layer for my points and use the DOM element for that layer to more efficeintly prune points over time? The present implementation of selecting all circle tags, then removing the parent of their parent of their parent brings the browser to its knees
  • Remove points over time in a FIFO arrangement? Would require quite a bit more javascript than I presently have appetite for, but who knows...

Highly relevant:

On "Infrastructure for Startups"

01 September 2012 #

One of Paul's Slides

Conference talks on the subject of infrastructure are often lacking in actionable advice - especially for fledgling startups. I am shamefully guilty of this myself.

A notable exception is a recent talk by Paul Hammond, my old manager and good friend. His Velocity 2012 talk titled "Infrastructure for Startups" was a refreshing dose of pragmatism, drawn from the experience of building and growing TypeKit. Paul ran Flickr Engineering before that - he has street cred for weeks. Unfortunately, video of the talk does not appear to be available, though that may change soon according to Paul.

Though I am prone to lengthy rants about building things the "right" way and am often heard advocating more rigorous planning at the start, I can't agree more with most of what Paul says - a 2-3 person startup just doesn't have the time to be mucking around with anything but the product they're building. This being 2012, there is an army of service providers ready to share the burden - for much less than the opportunity cost of building everything yourself.

Don't forget to measure

I'd add one more thing to Paul's lists (here and here) - you need good graphs right away. I'm surprised Paul didn't mention this after "all performance problems have been on things we don't yet measure." Good metrics collection and display are critical to both business success and technical efficiency. The easier it is to put together dashboards that zero in on meaningful metrics and correlations, the more you'll do it, and the more quickly you'll identify inefficiencies and opportunities.

I've yet to hear a favorable review of the baked in ec2 monitoring tools (CloudWatch), so I, as usual, recommend the slick, easy to use, gorgeous Librato Metrics for these purposes. As a bonus, the product comes with some basic alerting features (haven't tried yet, in the interest of honesty), so it may help stall or obviate the need to set up nagios or one of the related monsters. All the tools for getting data in have already been written.

Speaking of alerts, PagerDuty is another no-brainer for small teams starting to set up more fine-grained monitoring. Big surprise: Librato has PagerDuty integration.

I had some experience with the competing Cloudkick product and sadly don't have many kind words, although much has probably changed since our last interaction.

Start, New Game

26 August 2012 #

I'm tired of Wordpress and the associated bullshit (upgrades, vulnerabilities, etc), so I'm finally ripping the bandaid off and moving my blog to a Jekyll-generated static site hosted on s3. I will not be migrating the old content, as that would be an epic pain in the ass, the benefits of which do not seem substantial to me. The old site will live on, its database and directories set to read-only mode, to serve what little google traffic still washes up there.

ONWARDS!