Thoughts Evoked By CircleCI's July 2015 Outage

19 July 2015 #

After having a bit of downtime, CircleCI’s team have been very kind to post a very detailed Post Mortem. I’m a post mortem junkie, so I always appreciate when companies are honest enough to openly discuss what went wrong.

I also greatly enjoy analyzing these things, especially through the complex systems lens. Each one of these posts is an opportunity to learn and to reinforce otherwise abstract concepts.

NOTE: This post is NOT about what the CircleCI team should or shouldn’t have done - hindsight is always 20/20, complex systems are difficult, and hidden interactions actually are hidden. Everyone’s infrastructures are full of traps like the one that ensnared them, and some days, you just land on the wrong square. Basically, that PM made me think of stuff, so here is that stuff. Nothing more.

Database As A Queue

The post mortem states:

Our build queue is not a simple queue, but must take into account customer plans and container allocation in a complex platform. As such, it’s built on top of our main database.

As soon as I read that, I knew exactly what happened. I’d lived this exact problem before, so here’s that story:

At Flickr, we would put everything into MySQL until it didn’t work anymore. This included the Offline Tasks queue (aside: good grief, this post was written in 2008). One day, we had an issue that slowed down the processing of tasks. The queue filled up like it was supposed to, but when we finished fixing the original problem, we noticed that the queue was not draining. In fact, it was still filling up at almost the same rate as during the outage.

When you put tasks into mysql, you have to index them, presumably by some date field, to be able to fetch the oldest tasks efficiently. If you have additional ways you want to slice your queues, which both CircleCI and Flickr did, that index probably contains several columns. Inserting data into RDMS indexes is relatively expensive, and usually involves at least some locking. Note that dequeueing jobs also involves an index update, so even marking jobs as in progress or deleting on completion runs into the same locks. So now you have contention from a bunch of producers on a single resource, the updates to which are getting more and more expensive and time consuming. Before long, you’re spending more time updating the job index than you are actually performing the jobs. The “queue” essentially fails to perform one of its very basic functions.

Maybe my reading is not quite right on the CircleCI issue, but I’d bet it was something very similar.

In the aftermath of that event at Flickr, we swapped the mysql table out for a bunch of lists in Redis. There were pros and cons involved, of course, and we had to replace the job processing logic completely. Redis came with its own set of challenges (failover and data durability being the big ones), but it was a much better tool for the job. In 2015, Redis almost certainly isn’t the first thing I’d reach for, but options are plentiful for all sorts of usecases.

Coupling at the Load Balancer

too many papers
From nrc.gov

First we tried to stop new builds from joining the queue, and we tried it from an unusual place: the load balancer. Theoretically, if the hooks could not reach us, they couldn’t join the queue. A quick attempt at this proved ill-advised: when we reduced capacity to throttle the hooks naturally they significantly outnumbered our customer traffic, making it impossible for our customers to reach us and effectively shutting down our site.

I don’t actually think that’s an “Unusual” place to start at all. If one of the problems is that updates to the queue are becoming too expensive and every additional update is exacerbating the problem, start eliminating updates!

The rest of that paragraph is also not unusual at all. It hints at some details about the CircleCI infrastructure that you would find in an overwhelming majority of infrastructures.

  • The public site and the Github hooks endpoint share a loadbalancer
  • The processes serving the site and the github hooks run on the same hardware (likely in the same process, as they’re probably just endpoints in the same app)
  • There is no way to turn off one without turning off the other

Everyone that knows me knows I LOVE to talk about “unnecessary coupling” in complex systems. This is a really good example.

The two functions have key differences - for one, their audience. Let’s focus on that. The hooks serve an army of robots residing somewhere in Github’s datacenter. The site serves humans. As a general rule, robots can always wait, but making humans on the internet wait for anything is a big no-no. To me, this is a natural place to split things up, all the way through. You can still use the same physical load balancer or ELB instance, but you could make two paths through it - one for the human oriented stuff, another for the robots. Sure, there’ll probably be some coupling farther down the line, like when both processes query the same databases. But at least now the site will only go down if the database is actually inaccessible, not when it has a single contended resource that has nothing to do with serving the site.

A Long Aside: Traffic Segregation At Opsmatic

I do obsess over this stuff, and we’ve already had our fair share of outages with very similar causes. I want to talk a bit about how traffic is currently handled at Opsmatic. This section is full of admissions of having flavors of the same issues as above to further drive the point home that no one’s infrastructure is perfect, certainly not ours. It’s also meant to demonstrate that following some very high level guidelines built on prior learning can go a long way towards improving an infrastructure’s posture in the event of unexpected issues, especially surges.

There are three entry points into Opsmatic:

  • opsmatic.com is our company’s website and the actual product app
  • api.opsmatic.com is our REST API, which has historically been used mostly by the app (that’s changing quickly)
  • ingest.opsmatic.com is the API to which our collection agents talk

Here’s an ugly drawing to help you along:

too many papers

The first two are configured to talk to the same AWS Elastic Loadbalancer (ELB). The ELB forwards the traffic on ports 80 and 443 to a pool of instances where nginx is listening. nginx in turn directs the requests. Traffic to (www.)opsmatic.com goes to one process (a Django app run under gunicorn), traffic to api.opsmatic.com goes through a completely different pipeline where it’s teed off to the appropriate backend depending on the URL pattern. Currently, most of the API traffic is actually coming from humans using the app. As we flesh out, expand, and document our REST api, that’s bound to change, at which point we may put even more buffer between the two traffic streams - separate nginx processes with appropriate tuning, possibly even separate hardware.

The third ingest.opsmatic.com subdomain is pointed at a completely different ELB. That’s our equivalent for the Github hooks - the agents are always running, always sending heartbeats, always sending updates. An unexpected surge in traffic - for example, an enormous new customer spinning up agents on their whole fleet of servers all at once without warning us - could certainly overwhelm the currently provisioned hardware. At the moment, this would take the app down as well - while the Opsmatic backend is extremely modular, we currently run all those pieces on the same machines. This limits the operational overhead at the expense of introducing gratuitously unnecessary coupling.

However, just having the separate ELB gives us recourse in the event of a sudden surge in robot traffic: we can just blackhole THAT traffic at the ELB and continue serving site and API read traffic. The robots would be mad, and the data you were browsing would gradually get more and more stale, but it beats the big ugly 500 page.

The Opsmatic agent is also built to accumulate data locally if it can’t phone home, so the robots would build up a local version of the change history without losing any data or timestamp accuracy. When we were back up, they’d eventually backfill all that data. This event itself could cause a stampede, but we’ve found it to be a real nice luxury to have.

The modularity combined with reasonably healthy automation allows us to regain our balance quickly. If a certain service is overloading a shared database, we can kill just that service while we work out what’s going on or scrambling to add capacity.

Every Incident Is A Push Towards Self Improvement

The next time this sort of event does happen, we’d likely follow up with a few more steps that have been put off solely due to resource constraints:

  • Split up stack-role into smaller pieces, likely along the lines of “human-facing services” and “robot-facing services”. That is, physically separate services that deal with agent traffic from services that deal with human traffic. Possibly we’d go a step further and split up web services from background job processors that pull work from queues.
  • Split the opsmatic.com and api.opsmatic.com load balancers up
  • A bunch of auxiliary work on various internal tools to better accommodate the fragmentation

The upshot - we currently have a bit of coupling and resource sharing going on for things that really shouldn’t be coupled, but it’s only because we’ve postponed actually splitting everything up in favor of other projects. We are:

  • Seconds away from being able to blackhole automation traffic in favor of preserving the app, as well as turning off any background processing that might be causing issues - we can just let that queue grow, turn the service on and off as we try different fixes, etc.
  • A few minutes of fast typing away from adding capacity while most of our customers likely don’t even know anything is amiss
  • A few more minutes of fast typing from completely decoupling robot traffic from human traffic so that the next surge doesn’t affect the app at all

Hey, that’s pretty good! If we have to fight a fire, at least we can fight it mostly calmly. That, in and of itself, is huge. Being able to isolate the problem and say “OK, this is the problem, it is not the whole infrastructure, it is contained to a particular set of actions and now we’re going to work on it” is huge for morale during an outage. I do not envy the feeling the CircleCI team must have felt when attempts to bring back the queue took down the main site.

I used the word “posture” earlier - I have in mind a very specific property when I use that word. It’s not so much about “how resilient to failures is our infrastructure?” but rather “how operator-friendly is our infrastructure during an incident?” Things like well-labeled kill swtiches, well segmented traffic, well behaved background and batch processing systems that operate indepenently from the transactional part of the app go a long way towards decreasing stress levels during incidents.

Conclusion?.. What is this post, even..

This turned into a bit of a rambling piece. Hope you found it interesting. Here’s my key takeaways:

  • You can use a database as a queue, but you should keep a close eye on the timing data for the “work about work” your database is doing just to get jobs in and out. One day, you’re going to have a bad time. That is ok. It’ll make you stronger.
  • It pays to think about the sources of traffic to your infrastructure and how they interact with each other. Over time, it pays even more to have parallel, as-decoupled-as-time-allows paths through your system, any of which can be shut off in isolation.
  • Every infrastructure is a work in progress; computers are hard, and distributed systems are even harder

A Story and Some Tips For Sustainable OSS Projects

12 July 2015 #

This past week Kyle Kingsbury tweeted about being flooded with pull requests caused by changes to the InfluxDB API. Concidentally, I had just spent several hours over the July 4th weekend dealing with the same problem in go-metrics, albeit on a smaller scale. I think these are symptoms of a very very common problem with OSS projects.

A bit of history

The Metrics library has a very simple core API made up of various metrics-related interfaces - you can create metrics, push in new values, and read the metrics’ current values and aggregates. Simple and beautiful.

The library was originally put together by the epic Richard Crowley while he was working at Betable. He was starting to experiment with using Go for services, and needed a way to keep track of them. Finding no satisfactory equivalent to Coda Hale’s metrics library for Java, Richard made his own. Folks quickly wrote adapters to push metrics into their time series system of choice - I wrote one for Librato. Richard happily merged the PRs.

The core features were built, everything worked reasonably well, and Richard moved on to a job that doesn’t use Go nearly as heavily. Several months later, I noticed go-metrics had 20+ open pull requests. I pinged Richard and offered to help maintain the project. We were using it heavily, and were happy to pay our dues. Richard immediately made myself and Wade, a Betable employee, collaborators on the repository. I started looking over the PRs.

The Paralysis

too many papers
Cropped from photo by wheatfields

I quickly realized that I was not qualified to review a good chunk of the PRs:

  • Update for InfluxDB 0.9
  • Fallback to old influxdb client snapshot
  • Update influxdb client

“I don’t know jack about InfluxDB,” I thought. “How am I supposed to decide what gets merged and what doesn’t?” There was a Riemann client in there too. Who am I to judge a Riemann client lib?

I had also observed that the InfluxDB API was still changing quite a bit. I remembered that there had previously been a wave of PRs about InfluxDB. Wait, was this the same wave?

Another issue that gave me pause was that I had no idea how many people were already using this library with Influx, expecting the current client to continue working. How many builds would break? Go’s notoriously loosey-goosey dependency management made it likely that as soon as I merged any API changing PR, I would get another PR changing it back the next day.

There was also a PR about adding a Riemann client. Welp, I don’t use that regularly either..

Clarity

In the summer of 2012, I did a brief contacting stint with Librato. Among other things, I helped build a Java client library. They also asked me to tie that client to Coda’s library, so I obliged and submited a PR. Coda replied fairly tersely:

Really cool functionality, but I’ve been declining further modules for the main Metrics distribution. I suggest you run this as your own project. I’ll be adding a section in the Metrics documentation with links to related libraries, and this should definitely be in it.

At the time, I thought “Well that kinda sucks. I want my code up there, with the cool kids’ code in the really popular library.” Now, literally 3 years later, I understood exactly why Coda made that move. He didn’t use Librato. He had no idea what would make a good or bad Librato client. It was just more surface area to support. He had enough to worry about with core Metrics and DropWizard features, keeping up with JVM changes and compatibility issues, etc, Never mind other projects.

The Path Forward

well fitted pieces
Cropped from photo by matthewbyrne

Though Kyle points out that this may not be the best approach for every project, it seemed very clear to me that the only way the go-metrics lib could continue to be maintained, at least by myself and Wade, was to modularize and move any external dependencies out to their own libraries - with their own maintainers, and hopefully their own communities. It’s not going to make the “moving target API” problem any easier, but it’ll put the solution into the hands of the people who are actually interacting with the problem and have a vested interest in achieving and maintaining a palatable solution. It removes myself, Richard, and Wade, completely disinterested and uninitiated bystanders, from the critical path to a solution.

At the end of the day, it’s just Separation of Concerns. It’s just good organization. The task is broken up into small semi-independent pieces with responsibility for each piece given to the person with the most interest in that piece. There’s a corresponding and very palpable feeling of psychological relief. “Review the PRs for go-metrics” is no longer this huge nebulous task that will require a huge amount of context and deep understanding of some additional system. I know the core APIs. I can evaluate changes to that fairly quickly.

Practical Tips For Maintainers

If you find yourself maintaining a small OSS project with a fairly well defined scope and API, here are some tips to keep yourself sane (some of these are more general, not specific to the above story):

  • Always have a buddy. If your project gets any traction and you start seeing community adoption, find one or more particularly enthusiastic users and convince them to help carry the load. We all want to take care of our baby projects, but real life is what it is. People change jobs, have health issues, go on lengthy vacations, start families, become vampires. Some combination of those things will likely make your interest in any given project oscillate, and you should have a framework in place for making sure you don’t create another zombie on GitHub.
  • Resist dependencies. If someone creates a PR which brings in a new library, especially code that talks to something over the network - a server or SaaS of some kind - strongly consider pushing the author towards starting their own library. If this is not possible due to a lack of APIs, invest the time in adding hooks instead. It’ll be worth it.
  • Have a concise contribution policy. This will greatly reduce the burden of having to reply to PRs that suffer from obvious code quality issues. It is an absolute MUST to have a pre-written set of rules to appeal to instead of having to post seemingly arbitrary responses to individual PR authors.
  • Enforce guidelines automatically whenever possible. We are living in a remarkable age. The tools available to maintaners are simply amazing. With the help of services like GitHub, TravisCI, CodeClimate, etc., there’s no need to maintain a mailing list, apply patches by hand, set up some jury-rigged systems for running tests. It’s all free, and it’s all great. Use it. go-metrics and go-tigertonic do not take advantage of the OSS ecosystem, and I am about to fix that. One other small note here: you should make it very easy to replicate the exact process that the build is going to perform locally. There should be a Makefile or something similar containing the one command that the build tool is going to run so that folks can validate their branches easily without having to wait on the CI tool to run against their PR.

Hopefully you find our experience with maintaining and reviving go-metrics helpful, and this story helps you avoid similar pitfalls. Happy hacking.

A failure months in the making

08 November 2014 #

This is the story of an outage that occurred on September 25th 2014 and has previously been discussed in the context of blameless post mortems on the PagerDuty blog.

If you attended Surge 2014, you may have noticed something strange: a man was sitting on one of the cube-shaped stools in the Fastly expo area hunched over his laptop almost the entire day, and well into the evening hours. Even if you didn’t notice, and even if you weren’t even AT the conference, you may be curious about this man. The security guard certainly was, as he made his rounds after dark, long after everyone had left the expo area..

That man was yours truly; I was fixin’ stuff. This is the story of what happened.

The Outage

On September 24th Opsmatic was one of the many AWS customers to receive one of these emails:

One or more of your Amazon EC2 instances are scheduled to be rebooted for required host maintenance. The maintenance will occur sometime during the window provided for each instance. Each instance will experience a clean reboot and will be unavailable while the updates are applied to the underlying host. This generally takes no more than a few minutes to complete.

The EC2 Event Console confirmed that quite a few instances in our infrastructure would be affected:

reboot schedule

All the servers would be rebooted early Friday or Saturday morning SF time.. while I was at the conference. There was not much certainty in the exact timing or order of the reboots (the windows were 4 hours long), but we did eventually discover some good news:

  • Any instances using EBS for their root volume could be put through a stop/start cycle in advance of the window to avoid the reboot. When you “stop” an instance, you’re essentially destroying it, but the EBS volume survives. When you “start” it back up, you get no guarantees about which “host” will receive the instance that will then boot that volume. This is where “ephemeral” drives get their names - they are attached to the “host” and do not survive a stop/start.
  • Any instances provisioned after the notifications went out would not need to be rebooted. As we later learned, the reboots were necessary for Amazon to roll out a patch to Xen which fixed XSA 108. Many hypervisor “hosts” were already running patched code, so Amazon would simply put new instances on already-patched hosts.

Since every single piece of Opsmatic’s infrastructure is redundant at least at the instance level, we quickly concluded that this was actually not that big of a deal:

  • All of our nodes used EBS root volumes, so they could be stop/started
  • Most of our nodes do not use ephemeral storage for anything important
  • The affected nodes that DID use ephemeral storage were Cassandra nodes. Since we use a replication factor of 3, we can afford to have at least one of those rebooted at any time.

We briefly debated pre-emptively re-provisioning the Cassandra nodes anyway, but decided that it was better to just let the reboot happen. Copying data is time consuming, and the reboots were hours away. We would just get up just before the maintenance window started and gracefully stop Cassandra on the node about to be rebooted out of an over-abundance of caution.

To minimize the amount of odd-hours activity, we decided to stop/start all the stateless nodes that were scheduled to be rebooted on our own terms, during business hours. Since I was already at a conference, I’d take care of it in order to minimize disruption to the rest of the team back home, cranking away.

At around 13:50 PDT I started the process. I stop/started one of our NAT nodes without incident. Then things get a little murky.

For some reason, I decided to actually replace one of the nodes, but I don’t remember why. I did not make any record of my reasoning. It is entirely possible that I got distracted between the last node and the next one and went to reprovision it instead of just doing a stop/start cycle. It’s also possible there was some other issue with the node, and I simply failed to document it.

At about 14:15 PDT, I terminated one of our “stack” nodes (they run all the services that power the Opsmatic app) and then went to replace it.

We had provisioned our AWS infrastructure using Chef Metal so replacing the node should have been as simple as terminating it and then “converging” the infrastructure - a single, global command that does not take any parameters other than the declaration of what your infrastructure should look like (number of nodes in each cluster, etc). Chef, in theory, would detect that the “stack” cluster was missing a node and provision a new one to replace it.

So that is what I did. Replacing a node in our infrastructure is a routine operation that we had practiced several times without incident.

At 14:20 PDT Opsmatic went down in flames. The Chef run restarted every single instance in our infrastructure.

Talk about a “Game Day”…

pages galore

As soon as the instances came back up, we scrambled to make sure that all the services were back to normal. We were down for a total of about 30 minutes, in part because there were certain parts of the recovery process that were not as smoothly automated as we had thought; these defects became very apparent during the previously un-tested “restart the entire infrastructure” scenario.

The Causes

Once service was restored, we started trying to figure out what the hell had happened. Meanwhile, the delightful Surge lightning talks were drawing uproarious laughter in the main ballroom behind me.

As I scrolled frantically through the log from my fateful Chef run, I saw a bunch of lines like this:

[2014-09-25T21:18:39+00:00] WARN: Machine ******.opsmatic.com (i-*******
on fog:AWS:************:*********) was started but SSH did not come up.
Rebooting machine in an attempt to unstick it ...

One per server. We quickly confirmed in the #chef IRC channel that this was a bug - because Chef could not establish an SSH connection to these nodes, it decided to reboot them. That, apparently, should not have happened.

[2014-09-25T18:30:13-0400]
<johnewart> Ah, well -- you managed to uncover a bug by doing that
<johnewart> we should only reboot it if it's within the first 10 minute window
<johnewart> like, you create, and then try to run again 5 minutes later and it can't connect

After a bit more digging, we sorted out that chef-metal had been relying on the ubuntu user being present on all our machines along with a specific private key. Something had caused the home directory for the ubuntu user to be deleted.

At this point I remembered something: a LONG time ago, before Opsmatic even had a name, I had done some experiments with AWS. As part of that, I had a bootstrapping scheme which relied on the same ubuntu user (standard practice when provisioning Ubuntu AMIs), but also included a recipe called remove_default_users which nuked the ubuntu user once bootstrap was complete.

This bootstrap process was never used for anything serious - the initial iteration of Opsmatic’s infrastructure was one big server at an MSP; from there, we moved straight to the Chef-driven AWS setup. However, that small bit of cruft persevered in our chef-repo.

My hunch was correct. Although remove_default_users was never part of any roles or run lists in the new infrastructture, we were able to confirm that it was applied on all the nodes on August 31st (just a couple of days after the last time we had practiced replacing a node) by performing a search in Opsmatic itself:

chef report

However, by the time of the outage it was once again absent from all run lists. So how did it get there on August 31st and how was it ultimately removed? That would take another couple of weeks to figure out.

The remove_default_users recipe was clearly dead weight; we had gotten a little sloppy and let a bit of invisible technical debt accumulate. In order to prevent the same thing happening again, we immediately deleted the recipe. This has another nice side-effect: the next time this recipe appeared in a run list, Chef would fail. We have good visibility into those failures in Opsmatic, so we would be able to react and debug “in the moment.”

That exact thing happened on October 14th: as I was doing some refactoring in our cookbooks and roles, I found chef failing because it could not find remove_default_users. I knew I was about to find something important

  • something slippery, elusive, confusing, and damaging. Indeed.

The recipe was originally part of a cookbook called base - a collection of resources that needed to be applied to all nodes. As we moved to a “more-than-one-node” setup, we started using Chef roles to define run lists. The base cookbook was pulled apart and reconstituted as a role to be included in other roles. There was a step in the refactor where “parity” was achieved - the role was made to replicate the previous behavior exactly. At that point, the role was copied into another file called base-original.json to be used as a reference as pieces of it were pulled into other cookbooks etc. Many edits were then made to the role in the base.json file.

The base-original.json file stuck around in the roles directory.

But here’s the thing about a role file: unlike cookbooks, the name of the role doesn’t just come from the filename; it comes from the name field defined inside.

$ head roles/base.json 
{
  "name": "base",
  "description": "base role configures all the defaults every host should have",
  "json_class": "Chef::Role",
...
$ head roles/base-original.json 
{
  "name": "base",
  "description": "base role configures all the defaults every host should have",
  "json_class": "Chef::Role",
...

The majority of time spent working on Chef is spent working on cookbooks, so it’s easy to forget the subtle differences in behavior with roles.

So what had happened was this: while modifying something else about the base role, I had assumed that base and base-original were different roles that were both in use. I had modified both files and uploaded them both to the Chef server, first base, then base-original. In reality, they both updated the same role, and the base-original content won out because it was uploaded second. Chef ran at least once with this configuration, deleting the ubuntu user. Some time later, someone who DID know that base-original was not to be uploaded made yet more changes and only uploaded base, wiping remove_default_users out once more. By the time the epic reboot happened, it was gone from the run list again, leaving us to scratch our heads.

Because the ubuntu user was created by the provisioning process and not explicitly managed by Chef, it was not re-created.

Whoever ran chef-metal next was going to cause a global reboot. It just so happened that I did it from a conference and ended up spending my evening plugged into an expo booth’s outlet.

outage selfie

Remediations and Learnings

Computers are Hard

Managing even a small infrastructure requires discipline, precision, and thoroughness. The smallest bit of cruft can combine with other bits of cruft to form a cruft snowball (cruftball?) of considerable heft over a relatively short time period.

Cookbooks vs Roles

This sort of failure is exactly the cause of the trend towards “role cookbooks” replacing the role primitive. Having a recipe that is simply a collection of other recipes is functionally identical to a role, but has a few advantages - namely versioning (enough said) and consistent behavior with resource cookbooks. Having a recipe named base-original.rb would have had no effect on a recipe named base.rb.

chef-metal

While the theory behind chef-metal sounds good, we have started switching away from it. Bugs and maturity are the immediate problems, but it would be foolish to act like those don’t exist in all software, including whatever other scheme we end up using. This single bug is not why we’re migrating away.

The theory behind chef-metal itself sounds good, and it’s the “right” sort of automation, e.g. it’s not just scripting steps normally performed by a human However, it was very alarming how easily a very localized, routine change which had been successfully executed fairly recently turned into a global disaster. This is a big red flag for any system. It is an indicator of unnecessary coupling. Every time we wanted to add any node to our infrastructure, however minor and auxiliary, we’d have to perform an operation that touches everything. Having witnessed the potential for disaster, this would elicit a healthy dose of The Fear each time. In the long run, if we’re afraid to perform simple tasks with the the provisioning system, we’re not going to provision and replace nodes as frequently. Whenever you stop doing something regularly, you become bad at it. Routine operations should have routine consequences.

There are also more tactical concerns: “can’t SSH to this server, better reboot it” sounds EXACTLY like automating a manual ops process, and a bad one at that. Then there’s the security angle: even with the bug fixed, chef-metal still requires SSH access to the servers it manages with elevated credentials. In other words, you have to keep the provisioning user (ubuntu in our case) around on your instances forever. We strongly dislike that - it adds another little bit to the surface area. Sure, you need to be on a private network in order to get to SSH in the first place, but it’s another hidden back door that’s easy to neglect. We’d rather not have it.

We haven’t had much time to think about it, but this approach may work much better when applied at the container level, one step removed from the actual infrastructure. We may investigate it in the future. For now, our infrastructure is small, homogenous and simple enough that we will simply be switching to a more “transactional” provisioning process.

Documenting and Finishing Big Migrations Quickly

A huge part of this was just technical debt - recipes, cookbooks, and roles left over through consecutive refactors. Even in a “simple” infrastructure, success and safety depend on a vast set of shared assumption about how things work. As individuals change the systems’ behavior, the change has to be explicit, easy to understand, and easy to remember. Pieces being left around from “the old way” make it easy to make a no-longer-valid assumption.

Things We Should Add To Opsmatic

We’re constantly improving teams’ visibility into changes and important events in their infrastructure. That we were able to find when a particular recipe was great, but the experience also illuminated some gaps in our view of CM (e.g. role/run list changes, and some “meta” features to surface such changes). We’re hard at work, converting what we learned into real improvement in the product.

Parting Thoughts

As soon as we recovered from this outage, I thought “I’m going to have to write about this.” It is a great example of a complex system failure, “like the ones you read about.” It served as a great, rapid refresher course on complex system theory; it reminded us that we have to minimize coupling and interactions within our systems constantly and ruthlessly.

If you enjoyed this story (you sadist), you’ll probably like the following posts and books in the broader literature.

  • The Field Guide to Understanding Human Error by Sidney Dekker, and pretty much anything else by Dekker on the subject of human error and human factors.
  • Normal Accidents by Charles Perrow - a great introduction to complex systems, complete with great anecdotes from a number of different fields.
  • Make It Easy by Camille Fournier is a great concise post on the importance of designing systems and processes with the operator in mind.
  • Kitchen Soap Blog by John Allspaw is a great source for keeping abreast of developments in complex system failure, as well as ops and ops management in general.
  • Amazon’s Epic 2011 Post Mortem - I mentioned this post in my Surge 2011 talk because it read so much like parts of the Three Mile Island nuclear accident’s description in Normal Accidents

Two Factor Auth: Allow AWS IAM users to manage their own MFA devices

02 September 2014 #

(all info and screenshots are from 09/02/2014)

In light of all the recent incidents involving attackers taking control of a company’s root AWS account, myself and most everyone I know that is managing any sort of infrastructure have been re-auditing accounts and stepping up efforts to get everyone within our teams to turn on MFA (multi-factor authentication). MFA makes it impossible for someone to log in as you with just a username/password combo. An additional “factor” is required to confirm the user’s identity - typically a code from a synchronized number sequence. This has been standard practice in larger companies and capital-E Enterprise for many years, and is now starting to be taken seriously by folks operating at a smaller scale and in the cloud. No one wants to be the next tragedy

MFA (or 2-factor auth) has traditionally been embodied by RSA tokens attached to a keychain or a badge lanyard. These days, your phone can act as an adequate substitute.

Turning on MFA for your root AWS account is fairly easy:

mfa device for root acct

However, it took me an unfortunate amount of time to figure out how to allow users created as IAM accounts to manage their own MFA devices. Setting people’s devices up by hand through the root account was simply not an acceptable solution. Even at our size it was going to be a major headache, especially for our remote employee.

In the end, it’s all documented in AWS docs, but it’s a bit buried, and multiple steps are involved. Hopefully this post saves you some time.

Just The Right Amount

The critical thing is to give everyone JUST what they need and no more. Since you’ve already secured your root account, you can likely curtail the breach of an IAM account reasonably quickly, but it’s best if the account can wreak minimal havoc in the first place. For example, if a compromised account was able to fiddle with the credentials of other users, the exposure and cleanup effort would increase greatly.

Unfortunately, the IAM permissions policy system is rather arcane. That is an undesirable property for a security-related system to have (easy to get wrong), but alas, it’s the one we’ve got.

IAM Policies are made up of combinations of JSON blobs (“stanzas”) each containing a unique identifier, an effect (Allow, Deny), an action, and a resource to which the effect/action combo should be applied. There’s a whole bunch of documentation on the subject here so I won’t spend too much time elucidating it. Let’s cut straight to what we need.

MFA Device Permissions

When you create an IAM user, by default they are unable to do literally anything. When you pull up the IAM dashboard (where you have to go in order to set up your MFA device), you literally just see permissions errors everywhere:

no permissions by default

“Well that sucks,” I thought, looking over a co-workers shoulder. Googling “allow IAM user to manage own mfa device,” we find this lovely page: Example Policies for Administering IAM Resources Under the heading “Allow Users to Manage Their Own Virtual MFA Devices (AWS Management Console)”, we find an example policy that should do the trick.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowUsersToCreateDeleteTheirOwnVirtualMFADevices",
      "Effect": "Allow",
      "Action": ["iam:*VirtualMFADevice"],
      "Resource": ["arn:aws:iam::ACCOUNT-ID-WITHOUT-HYPHENS:mfa/${aws:username}"]
    },
    {
      "Sid": "AllowUsersToEnableSyncDisableTheirOwnMFADevices",
      "Effect": "Allow",
      "Action": [
        "iam:DeactivateMFADevice",
        "iam:EnableMFADevice",
        "iam:ListMFADevices",
        "iam:ResyncMFADevice"
      ],
      "Resource": ["arn:aws:iam::ACCOUNT-ID-WITHOUT-HYPHENS:user/${aws:username}"]
    },
    {
      "Sid": "AllowUsersToListVirtualMFADevices",
      "Effect": "Allow",
      "Action": ["iam:ListVirtualMFADevices"],
      "Resource": ["arn:aws:iam::ACCOUNT-ID-WITHOUT-HYPHENS:mfa/*"]
    },
    {
      "Sid": "AllowUsersToListUsersInConsole",
      "Effect": "Allow",
      "Action": ["iam:ListUsers"],
      "Resource": ["arn:aws:iam::ACCOUNT-ID-WITHOUT-HYPHENS:user/*"]
    }
  ]
}
```                                                                                                                              }

Since this is in no way obvious, I will also note that the account ID is found
on the "Security Credentials" page of the root AWS account.

<p class="center">
    <img src="/imgs/posts/awsmfa/account_id.png" alt="aws account ids"
    class="constrained" />
</p>

This appears to be sufficient to let users find themselves in the "Users" menu,
click the "Manage MFA Device" button, and go through the rest of the process.

<p class="center">
    <img src="/imgs/posts/awsmfa/iamtestuser_mfa.png" alt="test user's mfa button"
    class="constrained" />
</p>

### Passwords etc

I also found it useful to give our users the ability to manage the rest of their
own credentials.  The relevant policy stanzas can be found
[here](http://docs.aws.amazon.com/IAM/latest/UserGuide/Credentials-Permissions-examples.html#creds-policies-credentials).

Surprisingly, the default "Password Policy" on our AWS account was set to
allow passwords as short as 6 characters with no additional requirements. Even
with MFA enabled, you'll want to crank that up to something quite a bit more
robust.

### Keeping the robots at bay

One other important aspect of our setup is the fact that only humanoid users are
able to mange their own credentials. We have a number of automation-related
"bot" accounts who have security policies tailored specifically to their
purpose - the `backup` user only has access to a specific S3 bucket, the
`dnsupdater` user only has access to a specific Route53 zone, etc. Even with
this limited set of permissions, it's important to make it difficult for an
attacker to gain control of these users. They do not have passwords, and they
are never granted permissions to manage their own credentials. This is
accomplished by attaching the policies described above to a `humans` group and
only adding users with a verified heartbeat to that group.

### Enforcing a Policy

We have a policy of not allowing access to any AWS resources without an MFA
device enabled. However a policy is only as good as its enforcement. I did a
brief google and didn't find any automated tools to do the job, though I did not
try very hard. I did find that the [AWS CLI
tool](http://aws.amazon.com/cli/) has a `aws iam get-credential-report`
command, which returns a base64-encoded CSV file containing information about
all the IAM users' credentials. One of the columns is `mfa_active`, so the data
is all there to automatically enforce an MFA policy. 

(**NB:** you have to run `aws iam generate-credential-report` beforehand. Full docs are [here](http://docs.aws.amazon.com/IAM/latest/UserGuide/credential-reports.html))

For example, the following python snippet (available as a gist
[here](https://gist.github.com/mihasya/a1fd1c4bbef04495a12b)) will parse the
contents of the report and tell you who doesn't have MFA enabled. All you have
to do is `chmod +x` the file to make it executable, then pipe the report into it
like so: `aws iam get-credential-report | ./scripts/parse_credential_report.py`.

#!/usr/bin/env python from sys import stdin import json import base64

report = json.loads(stdin.read()) table = base64.b64decode(report[“Content”]).splitlines() head = table[0].split(“,”) table = table[1:]

for row in iter(table): user = dict(zip(head, row.split(“,”))) # you now have a dictionary with keys like user, mfa_active, # and password_last_changed print “%s %s” % (user[“user”], user[“mfa_active”]) ```

For our current team size and growth rate, and compliance needs, this is sufficient. I did come across an example of what a fully-fleshed out tool would look like in the excellent DevOps Weekly: The Guardian’s gu-who for performing account audits on GitHub accounts.

Low-hassle HTTP metrics with Tigertonic and Go-metrics

07 February 2014 #

First things first: What the shit is tigertonic?

Tigertonic is a framework for making webservices in Go written by Richard Crowley (I have contributed a bug fix or a feature here and there). Its defining characteristic is that it allows you to translate functions which take and return specific Go types into http.Handler implementations that understand and return JSON payloads. Define your signature, pass it into the correct Tigertonic wrapper, and out comes a web service that take in JSON, unmarshals it to the input type, passes it to your handler, then takes the return value from your handler and marshals it into JSON for the response.

It’s similar to JAX-RS/Jersey annotations, but with much less code, and with most of the ugly bits hidden from the framework’s user.

Check out the README for more info. Richard has also written and spoken about Tigertonic on various occasions. It’s all well worth reading.

Here’s an example of a very simple tigertonic service:

type Book struct {
        Author, Title string
}

// this takes a Book object and returns an empty body
func PutBook(u *url.URL, h http.Header, book *Book) (status int, responseHeaders http.Header, _ interface{}, err error){ ... } 
// this takes an empty body and returns a Book object
func GetBook(u *url.URL, h http.Header, _ interface{}) (status int, responseHeaders http.Header, book *Book, err error) {}

func main() {
        mux := tigertonic.NewTrieServeMux()
        mux.Handle("GET", "/books/{book_id}", tigertonic.Marshaled(GetBook))
        mux.Handle("PUT", "/books/{book_id}", tigertonic.Marshaled(PutBook))

        server := tigertonic.NewServer("localhost:34334", mux)
        log.Fatal(server.ListenAndServe())
}

(full code is here)

So You Want Some Metrics

At Opsmatic we strive to be a “learning organization” - we want to learn something from every release, every change, every customer interaction. An important component of that philosophy is an obsession with measuring things. Jim, our CEO, wants “If you can’t measure it, don’t ship it” written on his headstone when the time is right. No joke.

One of the things we wanted to measure was the number of requests served by our API. While we were at it, we thought we’d grab the timing data too for operational purposes.

go-metrics and Tigertonic

Richard is adamant about everything in Tigertonic reducing to an implementation of http.Handler, and with good reason: doing so enables the Handler that actually performs the business logic to be wrapped in any number of completely orthogonal Handlers that handle all sorts of other concerns - logging, CORS rules, authentication.. and metrics! (the README lists the available handlers.) The separation of concerns afforded by this approach is truly refreshing.

Go-metrics is a library, also maintained by Richard, that provides similar capabilities to Coda Hale’s great Java metrics library. It makes it very easy to time and count things, as well as to extract the data from the timers and counters.

Tigertonic comes with a few wrappers that hook up our Handlers directly to these metrics. We’re going to look at a couple in particular: Timed and CountedByStatusXX. The former is a very thin wrapper around the functionality of a go-metrics Timer - it just times the request and records the reading:

func (t *Timer) ServeHTTP(w http.ResponseWriter, r *http.Request) {
        defer t.UpdateSince(time.Now())
        t.handler.ServeHTTP(w, r)
}

The latter is a bit more involved, but is also ultimately a thin wrapper around some go-metrics primivites which counts the number of requests that result in a given class of response codes 2XX, 5XX, etc. You can look at the code here

Adding a counter is done by calling tigertonic.Counted(yourHandlerHere, ...). Since the return value is also an http.Handler, you can pass that to tigertonic’s multiplexer or really anything that operatoes on http.Handler - including the stdlib http server.

Putting it all together

The goal at the outset was to easily capture metrics on all our endpoints. How are we doing on that?

Quite well, it turns out. All we have to do to achieve the goals is some wrapping:

func wrapHandler(name string, h http.Handler) http.Handler {
        return tigertonic.CountedByStatusXX(
                tigertonic.Timed(
                        tigertonic.ApacheLogged(h),
                        name,
                        metrics.DefaultRegistry,
                ),
                name,
                metrics.DefaultRegistry,
        )
}

Then we invoke this wrapper before registering our handlers:

mux.Handle("GET", "/books/{book_id}", wrapHandler("get-book", tigertonic.Marshaled(GetBook)))
mux.Handle("PUT", "/books/{book_id}", wrapHandler("put-book", tigertonic.Marshaled(PutBook)))

ET VOILA. We need to give our handlers some names for the purposes of metrics collection, so we create a little wrapper function that takes that name and a Handler and wraps it in all the properly named metrics collectors. When we need to add more handlers, we wrap those too and the data shows up for free. In the instrumented version of the code you can see that I’ve also made a call to metrics.Log which spawns a reporter goroutine off into the background, printing out the stats every 10 seconds. There are a number of more useful reporters available - for example, I’ve contribued a Librato reporter which posts the metrics to the Librato API.

Slightly More Advanced

The full Opsmatic version of the above code is included below for additional illustration. It is expanded to include the name of the service, some CORS defaults, and two versions of the wrap method - one that includes a call to tigertonic.Marshal and one that does not; we need the latter to accommodate a couple of endpoints we have that do not return JSON.

type OpsmaticService struct {
        serviceName    string
        allowedOrigins []string
        allowedHeaders []string
}

func NewOpsmaticService(name string, origins []string, headers []string) *OpsmaticService {
        return &OpsmaticService{name, origins, headers}
}

func NewDefaultOpsmaticService(name string) *OpsmaticService {
        return NewOpsmaticService(name, []string{"[redacted]"}, []string{"Authorization"})
}

func (self *OpsmaticService) WrapHandler(name string, h http.Handler) http.Handler {
        cors := tigertonic.NewCORSBuilder().AddAllowedOrigins(self.allowedOrigins...).AddAllowedHeaders(self.allowedHeaders...)

        return cors.Build(
                tigertonic.CountedByStatusXX(
                        tigertonic.Timed(
                                tigertonic.ApacheLogged(h),
                                fmt.Sprintf("%s-%s", self.serviceName, name),
                                metrics.DefaultRegistry,
                        ),
                        fmt.Sprintf("%s-%s", self.serviceName, name),
                        metrics.DefaultRegistry,
                ),
        )
}

func (self *OpsmaticService) MarshalAndWrapHandler(name string, f interface{}) http.Handler {
        return self.WrapHandler(name, tigertonic.Marshaled(f))
}

Conclusion

Using this little bit of boilerplate code, we can readily instrument new endpoints as they come online without cluttering the code with counters and timers. Using the aforementioned Librato reporter, we get graphs for new endpoints that we deploy instantly and with zero additional wrangling. It’s quite a nice setup that required a fairly modest amount of code and requires very minimal marginal effort on new endpoints. We hope that you enjoy it as well.

The Myth of the Uninterrupted Programmer

17 November 2013 #

This post about office noise level and distractions came through my inbox, and a particular voice in the comments section caught my eye.

“Show me an office with caves and I’ll show you my resume”

Plenty of comments followed echoing this sentiment.

While I agree that stretches of concentration are important for figuring out a specific task, I think that this chorus is at the heart of a serious misunderstanding many engineers have about their value as members of an organization that is resulting in a tremendous amount of waste.

Sure, constant interruptions and context switches are exhausting and difficult. I’m not suggesting that we should spend all day turning from one conversation to another. It’s easy to overdo meetings and office shennanigans. However, a healthy amount of interaction and socialization has some very important benefits.

Interruptions cause you to retrace your steps - this is often good

There is a much less edifying real-life counterpoint to the widely romanticized deeply concentrated programmer. It’s that of a programmer spending 4 hours trying to track down a confusing, elusive bug, only to figure it all out 5 minutes after walking away from it. I’ve done it, I’ve seen it, and I continue doing it and seeing it.

There’s a very simple explanation for this phenomenon: in order to be able to reason about an algorithm, especially a complex one, we have to assume and take a whole load of things for granted. The stack, the configuration, the interfaces on top of which we’re working.

An incorrect assumption is a common source of confusion and infuriating debugging. If you’re lucky, the false assumption will be illuminated by a debugger or a log line. However, the longer you’d been staring at the same problem, the more likely you are to miss something much more simple. That helper function you stubbed out earlier while testing something else? Yeah, that’s still there. You’ll feel real dumb when you remember.

Interruptions - planned or unplanned - cause you to “resurface” and to have to re-engage the problem almost from scratch. Part of that process is rebuilding that chain of assumptions. Stepping back from a problem and seeing the bigger picture is often much more productive than spinning down in the bowels of your code.

(Here’s a great talk Joe Damato with a pretty good discussion of discovering violations in your basic assumptions)

Re-reading your own code is the best way to write readable code

If you’re writing a bunch of code in a hurry, and especially if you’re doing so while fighting through bugs, you’re likely leaving a disaster zone in your wake. Even if you think you’re writing “clean code” and writing tests to go along with it, there are probably sections in your code that barely make any sense by the time you’ve gotten them to do what you want.

Pair programming is one way of solving this - your passenger will point at the screen and call you out for getting too fancy or too casual with your single-letter variables. I’m still torn on pair programming, but I do think it’s a great idea to re-read your own code regularly for reasons related to the first secion.

While an interruption causing you to lose context can be annoying, the forced re-construction of context can point out flaws in your reasoning and force you to recognize sections of code that are hard to read - because you’ll have trouble reading them too.

Your peer has likely seen the same problem before

We spend a lot of time talking about sharing code and know-how in the OSS community. We’ve also been putting lots of emphasis on DRY - “Don’t Repeat Yourself.” Well, it’s more like DRO - “Don’t repeat others.” This broader message applies to your peers as well. When you’re dealing with OSS code and you find a bug you can’t sort out, you ask the internet and see if anyone else has had the same problem. For whatever reason, we find this easy, but we find turning to our neighbor and asking the same thing difficult - PROBABLY because we’re afraid of the stigma of interrupting them. So we spin our wheels. Awesome.

Don’t forget that someone in the room is very likely to have used the same software and tools you’re using, seen similar problems in the same or similar systems, or, if you’re really lucky, wrote the damn thing in the first place.

Interruptions often come with an opportunity to ask your colleagues - they may well be interrupted too.

Are you even solving the correct problem?

Many conversations between engineers about productivity make it sound like the goal of programming is to write as many lines of code as possible. This has been reinforced by stories of companies like Google which were “run by the engineers.” I believe this has caused people to imagine the original Google employees all furisouly writing code for 16 hours a day without uttering a word to each other or anyone else, inevitably producing the world’s best search engine.


Photo by Paul Simpson

This is pure professional hubris. Hubris is all I hear when engineers bitch about product and project managers interrupting them with all their “process.” Sure, it’s easy to overdo, but it brings us back to that whole “know your business” thing.

Sure, if you sit in your little cave for 16 hours, you’re going to write a whole bunch of code. But… what did you just produce? Sure, it’s “correct” in the strict engineering sense of the way - the right inputs produce the right outputs, etc.. But is it correct in the context of a product? Did you actually build something people will want? Does it work, as in, does it behave the way a customer would expect? Chances are it does not, because it’s hard to build things for humans without talking to them.

The reality of the matter is that Google’s early engineers were successful because they were good at all those other things as well, not because they ignored everything around them and ground code.

How hard are you concentrating, anyway?

You can tell engineers don’t REALLY mind being interrupted by just looking at the constant shitpile of activity on HackerNews, Twitter, Google Plus, IRC, etc. It’s not about interruptions. It’s just flat out whining. We don’t like getting out of our comfort zone and thinking about things about which we’re not that good at thinking. Stop coming up with excuses and get better at it.

Interruptions force you to ship.

There’s no disputing that interruptions and context switches are painful and difficult, but knowing that they’re coming can have a positive impact - if you anticipate only having a couple of hours before you’re interrupted, you will work in more incremental chunks, which lend itself better to testing, documentation, abstraction etc. These are all good things.

For example - there are guests coming over for dinner shortly, so I’m just going to wrap this up and post it. It’s too long as is.

tl;dr

Sitting in a dark basement in silence great for leveling-up your World of Warcraft character. It’s no way to build good, usable software. There’s no substitute for good communication.

A Reliable, simple way to get a PDF out of Showoff

17 November 2013 #

Perpetually agonized by actually using Keynote or Powerpoint to make slides, I continue to use Showoff to make my slide decks. Unfortunately, the codebase appears a bit neglected, and certain features have stopped working very well over the course of re-installs. I have neither the Ruby-fu nor the time nor the patience to figure out why PDF generation has stopped working (I actually don’t think that particular feature ever worked for me at all), so I’ve had to resort to trickery.

I am posting this here because I keep forgetting how to do this and having to blindly figure it out each time. Hopefully my own blog will be an obvious enough place to look. This has only been tested on a Mac using Chrome, but it looks like Safari will work to with a bit of tweaking

  1. Add the following to a css file that is included in your preso
    # preso {
        width: 11in;
        height: 8in;  // this may need to be lowered slightly for Safari
    }
    .slide {
        width: 11in;
        height: 8in; // this may need to be lowered slightly for Safari
    }
  1. run showoff serve from your repo
  2. Go to http://localhost:9090/singlepage (obviously port may vary if you used -p)
  3. Use your browsers’s Print function to generate a PDF

DONE. Happy PDFin.

print dialog

How Do I DevOps?

11 June 2013 #

There is lots of talk about what DevOps is and means, even a Wikipedia page, to which I may soon give some much needed love. However, a friend recently asked if I knew anyone worth hiring for a “devops” role, and I found myself asking clarifying questions about the sort of person he had in mind. Seemed worth writing down.

The friend was looking for engineers. So what does it mean for an engineer to be devops-y?

TL;DR

  1. Understand the Whole Company as a System
  2. Respect Other Functions Within The Organization Profoundly
  3. Have a Strong Sense of Personal Accountability

Build your software like you give a shit about the people whose jobs and lives are affected by it.

1. Understand the Whole Company as a System

bottles!!
Photo by verifex

Your company has inputs (money, labor, etc) and outputs (product, money, etc). I’ve grown to loathe the phrase “above my pay grade” because it tends to betray a complete lack of interest in the big picture. Hanging around my new colleague Jim, aka Mr Manager, I’ve recently started to identify things as “tactical” vs “strategic.” Strategic is the big picture - where is the company going; what are the company’s goals; what will make or break our success. Tactical is the every day - what features are left on the current project and which one should I work on next; how much time should I spend on this bug, what with the massive deadline looming; hell, should I even be looking at bugs? If you don’t have a good grip on how you and your project fit into the bigger picture of the company, you are always tactical. Tactical can quickly become boring, repetitive, and un-rewarding. It’s also a nice way to never grow as an individual. In the DevOps picture, it means you probably don’t make judgment calls well with regards to what is and isn’t important, distributing your time poorly. Your colleagues probably notice; they probably don’t like it.

This is a great segue to:

2. Respect Other Functions Within The Organization Profoundly

For our immediate purposes, we can focus on just the ops team, but it applies well beyond. Understanding and respecting the priorities and needs of non-technical teams and taking them seriously helps greatly reduce the number of surprises on both sides. Also, if you’re really living number 1 above, you probably won’t be surprised that your goals are very closely related.

But back to your relationship with the ops team (or, if you’re living in devops dream land, your colleagues, since you’re all part of the combined devops utopia, right?) What makes them tick? What wakes or keeps them up at night? What makes their job harder? Easier? I like to make it personal: how have I made their lives better or worse?

Let’s look inwards for a moment: what if someone is asking these questions about me? Well, I’m a software engineer. I grind code for a living. I get some requirements (new product spec, a bug, something I think up in my free time and don’t tell anyone about, etc), figure out how to meet those requirements, write some code, and push it to production.

What are the things that make me happy while performing these functions? Well, there’s a whole bunch of them, but they can all be summed up very easily: lack of friction. A relatively low number of things I have to do beyond my core activities in order to get to the end; a limited number of context switches. A clean, consistent, reproducible dev environment. A responsive, intelligible build system. A mostly-automated way of moving my code through various environments.

What has ops done for me?

Well, shit, I’m actually mad spoiled. Flickr was a PHP site with a well oiled deploy machine that we’ve all heard about - since you didn’t need to restart anything to get your code out (an under-appreciated side effect of the way PHP is traditionally served), we’d literally just push a button and the new code got rsynced to the boxes while also keeping a nice, visible record of the what, when, and why (a form of this now available to the masses in the form of Etsy’s Deployinator). SimpleGeo and Urban Airship use(d) Puppet and Chef respectively to great success, and there was an ever-improving set of tools available to make it easier to start working on a project and to test it as I went along. When I was done, it got reviewed, merged, built and sent off to a package repo, then deployed to production using automation. I spent most of my time actually debugging or writing code, not sheparding it around environments or struggling to get it to run in the first place. It’s also easy to forget the little things that helped keep computers out of my way - federated logins etc.

These are just the more salient examples - specific things ops has done to make my life easier; it is by no means an exhaustive list of what I see as the core strength of my prior ops teams.

What have I done for ops?

a opsian, elbow deep in
'it'
Photo by Business Insider

Let’s look at what my teams at each of these orgs did that I think was helpful to and appreciated by the ops teams. This is in no particular order, and I’m going to forego the names of the organizations because there’s a ton of overlap.

  • Painstakingly instrumented our services so that their state could be more easily examined in the wild
  • Pumped as much data as we could into the monitoring tools kindly provided us
  • Thoughtfully considered what metrics and properties were helpful in determining the health of each particular system being worked on. Business people might call this a KPI; Mathias Meyer called it a “Soul Metric” in his monitorama talk.
  • Carefully set up alerts that interpreted the above to try to minimize noise and non-actionable alerts.
  • Learned at least enough about the configuration management tools to be able to submit pull requests for desired changes in production without personal involvement and hand holding from someone on the ops team.
  • Considered and tested how the software being written behaved itself before an emergency - how is failover handled? how are configuration changes handled?
  • Automated or helped automate parts of the process that were difficult to remember or tedious.
  • Worked on tools in our spare time that made any of the above easier.

Broadly, we tried to be sensitive to how the operators interacted with the thing in production and how reasonable the experience was - during changes, during outages and failures, etc. We focused on operability.

Why did we do all this?

3. Have a Strong Sense of Personal Accountability

Because it felt like the right thing to do. When people got woken up at three in the morning because something I had deployed broke in a confusing, difficult-to-debug way, it felt bad. I wanted it to be less confusing the next time. If we’re being honest with ourselves, it probably helped with the motivation that I woke up too and was just as frustrated and annoyed.

Go back to #2 and think “Do people in the other organizations have the right tools to perform their jobs?” The better the tools, the less friction there is, the more quickly people can perform their reactive tasks (ops responding to pages; marketing compiling a traffic report that the CEO suddenly needs for a board meeting; support dealing with a massive DDoS or spam influx) The less time people spend reacting, the better - reacting is by definition tactical, and spending all your time in tactical mode, as we’ve covered, is not great. The list in section 2 was focused on ops, but a lot of the same stuff, especially the tools bit, applies to other teams as well.

It’ll never be perfect, but often the smallest change makes the biggest difference. Re-arranging a dashboard ever-so-slightly could be the difference between someone getting RSI while trying to track down spammers until late at night and them going home in time for dinner. A good DevOps engineer in my mind is one that feels personally responsible and accountable for the parts of his or her job that have an effect on colleagues’ happiness and success. Remember, everyone likes going home for dinner.

Conclusion

Coming back to what this all means for a software engineer: it’s all about the big picture. In an organization whose primary output is software, everybody depends on how well that software is equipped to help them succeed in their particular job. Understanding your effect on these needs and striving to meet them - that’s what DevOps means to me.

Further Reading

  • DevOps: These Soft Parts A post by John Allspaw about the soft skills involved in making DevOps-style cooperation work
  • Developing Operability (slides) A talk to Richard Crowley with specific advice for smoothing the journey of code to production for both devs and operators; more on the meaning of “DevOps” (warning: a wall of text)
  • DevOps - The Title Match A post by John Vincent on a common misconception about the organizational meaning of DevOps

It's a train.. no, it's a computer.. can't it be both??

02 May 2013 #

I am delighted to let you spread the word about an amazing innovation from Lian Li, the acclaimed maker of computer cases. They have thrown caution to the wind and finally introduced the thing we’ve all been waiting for - the Choo Choo Train Computer Case.

COMPUTER TRAIN!

Yes. Yes. Let that sink in. It’s a computer case shaped like an old steam-powered locomotive. It has a 300 watt power supply in the front section, and the cart can fit a Mini ITX motherboard, a slim optical drive, and a single internal hard drive. One might point out that these are somewhat weak specs as far as cases go, but hey, IT’S A FUCKING TRAIN.

But wait. There’s more. No, seriously, there’s more.

I saw that the case had 5 star revies, so I clicked to see what proud owners had to say about it.

This SKU, which ends in an S, does NOT move compared to the more expensive SKU that ends in an L. It has no motor, it’s just a case that looks like a train. The more expensive model [] actually has a motor and a transmission, and comes with extra rails, so it will roll back and forth when the computer is turned on.

Yup. Lian Li’s product page for this puppy is epic Not only is there a more expensive version that moves (and comes with “Rail x6” instead of “Rail x1”), there’s a limited edition one that has an atomizer. That’s right. It makes steam!

power train

Basically, I’m spent just thinking about this. The amount of space accommodated by the case isn’t ideal for the plans I have for an HTPC (I was on Newegg for a reason, afterall), and I definitely couldn’t handle “Rail x6” and a computer case scooting back and forth, but y’all know what to get me for my birthday now. I’ll make it work.

Unfortunately, it’s out of stock on NewEgg and I can’t find it for sale anywhere, so I fear that the opportunity may have passed. Who knows when Lian Li will elect to share their genius with us again? I’ll probably end up having to purchase one of these guys on Ebay for thousands of dollars as a collectors item years from now.

Some Love For Ishmael

29 April 2013 #

Back in the days of fire fighting and database optimizing at Flickr, when I could debate the merits of different MVCC options comfortably, I built a little tool called Ishmael to help us make sense of mk-query-digest data more easily (apparently, the project has been moved to the “Percona Toolit” and renamed pt-query-digest). Tim Denike made some improvements during his remaining time at Flickr after I had left, and then Asher Feldman took the project with him to The Wikimedia Foundation. Eventually, he sent in a large enough pull request that I simply did not have the capacity to test it - I, after all, have not used MySQL in anger in ages. So I did the natural thing and made Asher a collaborator on the repo.

This past week, during a moment of vanity, I noticed that there were quite a few more stars on the repo than there had been. I wondered what might have caused it, and shrugged. Then on Sunday the DevOps Weekly email provided the answer: Asher had written a post about MariaDB on Wikimedia’s blog, in which he mentions their use of Ishmael in comparing performance between old and new database versions. It is a good read for anyone interested in database migrations and upgrades, especially “doing it live!”

Everyone, look, this is my “proud open source moment” face.

Readmeme - a README generator for those in doubt

23 March 2013 #

Programmers have been lamenting each others’ inability to properly document software for many decades. Some recent examples include:

  • Tom Preston Werner’s call for README-driven development

    a beautifully crafted library with no documentation is also damn near worthless.

  • Ted Nyman’s recent Basho Chats diatribe about software expressing itself poorly (video not yet uploaded)
  • Peter Morgan’s eloquent Go issue report

    Is that all we want to do is explain bits of code, and its runningz what it does.. Hopefully there is a lot of coders looking at it cos we can auto gen documentation..

Powerful stuff. There’s been a corresponding surge in tools to assist in creating documentation. Some examples:

Still I see repositories with READMEs like “Gathers and relays system metrics” (not to pick on anyone in particular, just an example) and ones that skip straight to installation instructions and licensing info. That’s bad.

The README is a project’s face to the world. It should tell the audience what the project does and what motivated its creation at a minimum. Not to belabor the already-well-made point too much further, I’ve created a tool for myself and others to help formulate a basic README when inspiration betrays us.

It’s called readmeme - and yet again, Richard Crowley helped me name it. It is apt. I hope it starts a meme of informative READMEs.

For more info on the project, check it’s README (get it?)

The 3 Little Mistakes

03 December 2012 #

There are many mistakes people make when programming for the web. Here are three that I’ve seen everywhere I’ve worked. I think they deserve extra attention because they are relatively easy to avoid, but very difficult to recover from.

Encodings

Even if you only ever have US based users, enough folks have accents in their names and all sorts of other reasons to introduce non-ascii characters into your systems. Enforce UTF-8 at all levels, and especially at storage time. Fixing an encoding issue is difficult, and usually involves re-writing all the data. It is unpleasant, no matter the underlying datastore.

Limits and Pagination on API Requests

This is pretty straight-forward, but I’ve seen it neglected in practice quite a few times. Whether it’s a POST where the bath size isn’t bounded or a GET that returns “all the _____,” it inevitably becomes a nightmare that is difficult to fix due to clients not expecting to have to paginate. This is particularly exacerbated in APIs used by mobile apps - a total fix requires getting all the client apps to upgrade to a new version of the library AND for all their users to upgrade.

Timezones

This issue is similar to the encoding one - a matter of consistency. Storing everythign at UTC forces consistency on the data. Not doing so leaves an opening for rows in the same table to be written with different time zones, making querying and comparisons more complicated and possibly expensive. Almost without exception, you only want to think about timezones at display or query time; it’s much easier to deal with DST when your data is normalized to the same, consistent, season-immune representation.

Cal’s book has a lot of really good info on other details that merit attention when working on the Internets.

Just Use Your Words

19 September 2012 #

It is becoming increasingly plausible that I have a partially torn miniscus in my left knee. The unusual, difficult to place pain, the occasional painful popping and “catching” in the joint, the frequent unprompted discomfort.. I run into people on a regular basis playing soccer - any number of recent collisions could damage a ligament.

But the ligament is not the issue.

The issue is the reluctance to call it what it is - the orthopedic sugeon dictated “possible meniscus irregularity” into his recorder. The physical therapist opted for the cautious “possibility of some miniscus involvement.” Having already been primed, I pressed a bit: “A torn miniscus, you mean?” “Yes, a partial tear.”

Come on guys, I’m not going to die of a torn miniscus. The upper bound on recovery from a miniscus repair is 4 months, and I’m 26 - it would likely take less than that. Surely they don’t expect me to break down crying right there in the office when confronted with the news..

Why am I writing about this harmless, albeit frustrating, experience? I do have a reason: it reminded me of a great piece The Economist ran on the subject of “euphemisms” last December. It is particularly relevant in the current political climate, especially the very last section titled “Little white lies.”

The piece ends with a call to abandon euphemisms for a day. I highly recommend this exercise, having lived it long before this article was published. You’ll piss some people off, but at least they (and you) will know where you stand.

The Lonely Wait

07 September 2012 #

As I exited the Civic Center metro stop on this beautiful sunny morning, I glanced up Hyde street. There was no 19 bus to be seen up the street, which meant it was at least 5 minutes away. At best, it would make it to my destination at 8th and Bryant around the same time as I would on foot.

Rounding the corner onto 8th, just past the 19 stop, I came upon 20 to 30 people, mostly men, mostly in their mid 20s, lined up against a fence. Nobody was talking, everybody had headphones on, some were smoking. They were waiting for the Xynga shuttle to take them to their office 6 blocks away, about 2/3 of a mile down 8th street. 15 minutes by foot, 5 by bike.

At their feet started a freshly painted bikelane, possibly the widest in the city.

Somewhere far away, in an empty bar completely unaware of the human tragedy unfolding at 8th and Market, the jukebox played “Don’t Know What You Got Till It’s Gone.”

Windshield: Display Incoming Geo Data Using PolyMaps

05 September 2012 #

While discussing what to do for a Free Friday project, Neil and I decided we want to build some sort of visualization related to the location data UA had been processing. I quickly thought of the dashboard that Jon Rohan wrote for SimpleGeo, which would plot the coordinates that API requests were targetted at as those requests came in. After finding Jon’s code, I realized that the front end portion was going to be very easy to adopt, as well as make generic. Having obtained Jon’s blessing, I tidied the code up a bit and open-sourced it. Of course, the backend code that supplies our geo data will remain closed source and proprietary, but there is an example data source included to help others get started.

I called the project Windshield because the points look like bugs that show up on the glass over the course of a long drive. The source is here. I have an example up here.

I could make this a wordy post about programming practices, javascript, and technology, but this was a really simple project. Besides, other people did most of the hard work:

  • PolyMaps did all the map parts of it.
  • CloudMade made the gorgeous tiles.
  • Jon Rohan wrote the code I aped heavy-handedly to get a grip on the thing.
  • I made the function that supplies the points pluggable so that it was easy to test and extend, so HURRAY for higher order functions.

Probably the most surprising thing was Aaron submitting a pull request hours after I had open sourced the damn thing to make the demo work correclty in FireFox. Thanks, Aaron!

Things I know I still need to do:

  • Make it easier to manipulate the map once it’s created (perhaps return something from the windshield function)
  • Explore the concept of a PolyMaps “layer”
    • Can I create a custom layer for my points and use the DOM element for that layer to more efficeintly prune points over time? The present implementation of selecting all circle tags, then removing the parent of their parent of their parent brings the browser to its knees
  • Remove points over time in a FIFO arrangement? Would require quite a bit more javascript than I presently have appetite for, but who knows…

Highly relevant:

On "Infrastructure for Startups"

01 September 2012 #

One of Paul's Slides

Conference talks on the subject of infrastructure are often lacking in actionable advice - especially for fledgling startups. I am shamefully guilty of this myself.

A notable exception is a recent talk by Paul Hammond, my old manager and good friend. His Velocity 2012 talk titled “Infrastructure for Startups” was a refreshing dose of pragmatism, drawn from the experience of building and growing TypeKit. Paul ran Flickr Engineering before that - he has street cred for weeks. Unfortunately, video of the talk does not appear to be available, though that may change soon according to Paul.

Though I am prone to lengthy rants about building things the “right” way and am often heard advocating more rigorous planning at the start, I can’t agree more with most of what Paul says - a 2-3 person startup just doesn’t have the time to be mucking around with anything but the product they’re building. This being 2012, there is an army of service providers ready to share the burden - for much less than the opportunity cost of building everything yourself.

Don’t forget to measure

I’d add one more thing to Paul’s lists (here and here) - you need good graphs right away. I’m surprised Paul didn’t mention this after “all performance problems have been on things we don’t yet measure.” Good metrics collection and display are critical to both business success and technical efficiency. The easier it is to put together dashboards that zero in on meaningful metrics and correlations, the more you’ll do it, and the more quickly you’ll identify inefficiencies and opportunities.

I’ve yet to hear a favorable review of the baked in ec2 monitoring tools (CloudWatch), so I, as usual, recommend the slick, easy to use, gorgeous Librato Metrics for these purposes. As a bonus, the product comes with some basic alerting features (haven’t tried yet, in the interest of honesty), so it may help stall or obviate the need to set up nagios or one of the related monsters. All the tools for getting data in have already been written.

Speaking of alerts, PagerDuty is another no-brainer for small teams starting to set up more fine-grained monitoring. Big surprise: Librato has PagerDuty integration.

I had some experience with the competing Cloudkick product and sadly don’t have many kind words, although much has probably changed since our last interaction.

Start, New Game

26 August 2012 #

I’m tired of Wordpress and the associated bullshit (upgrades, vulnerabilities, etc), so I’m finally ripping the bandaid off and moving my blog to a Jekyll-generated static site hosted on s3. I will not be migrating the old content, as that would be an epic pain in the ass, the benefits of which do not seem substantial to me. The old site will live on, its database and directories set to read-only mode, to serve what little google traffic still washes up there.

ONWARDS!