Chasing the unicorn - zero-downtime Rails deploys

Background

At Fosubo we use Amazon Opsworks to host our Rails servers. Our configuration consists of two c3.large servers running 16 Rails workers each, both connected to a shared load balancer.

Code merged into master is automatically deployed using the default Opsworks deploy command.

This worked fine in the early days when we had a low volume of requests, however since then our traffic volume has increased dramatically and we have started to see exceptions on deploys. This means that while deploys are running, some users are seeing 500 pages. When you deploy multiple times a day, this can result in quite a few unhappy users.

Our goal: To eliminate exceptions on deploys and be able to deploy multiple times a day with no downtime.

After digging into this, it turns out that zero-downtime deploys with Rails are quite difficult and there are three major issues:

  1. Multiple Rails migration processes executing simultanously are subject to a race condition.
  2. Rails’ prepared statement cache is out of date during the window of time between schema changes and restarting the rails servers which results in SQL exceptions.
  3. Code that references removed columns will also fail during this window.

Here’s how I addressed the first concern via pull request to Rails core.

Part 1 - Fixing concurrent migrations in Rails

The default Opsworks deploy works like this:

It turns out that running migrations on all deploy targets simultaneously is actually quite dangerous.

Lets simulate what happens when we run a simple migration (in this case adding an index) in multiple simultaneous processes:

class FirstTestDummy < ActiveRecord::Migration
  # Add a hash index to a large table (500,000 rows)
  def up
    add_index :review_requests, :url_code, using: :hash
  end
end
for run in {1..3}
do
bundle exec rake db:migrate &
done

One migration succeeds, but the other two fail:

PG::UniqueViolation: ERROR:  duplicate key value violates unique constraint "pg_class_relname_nsp_index"
DETAIL:  Key (relname, relnamespace)=(hash_index_review_requests_on_url_code, 5605052) already exists.

Result: disaster

After reproducing this I dug into the Rails source and was surprised to find that Rails migrations were basically running with scissors and performed absolutely no locking whatsoever. That is to say, if you run them in parallel they will execute concurrently and tread on each other’s toes with undefined and potentially dangerous results.

Now, this bug may not be apparent in quick migrations because its probable that one machine will be slightly faster than the other and finish before the others even start. However if you have a migration that takes a long time to complete (or if you get unlucky) you might see a race condition as multiple instances simultaneously try to make the same changes.

For the above case the result is some failed deploys - annoying but not catastrophic. However, lets look at another example:

class SecondTestDummy < ActiveRecord::Migration
  # John has too much money in his account because of the bug we recently fixed, make sure to reset it.
  def up
    john = User.find(1)
    john.money -= 100
    john.save!
  end
end

The expectation for migrations is that they get run only once. However, if we repeated our ‘deploy’ with the migration above, it is possible that John could end up with -200 or -300 money, rather than -100 as intended.

Result: catastrophe

As developers we would like to have confidence that a migration will be run once and exactly once, regardless of the deployment method.

Thankfully I managed to get this problem fixed in upstream Rails - I filed an issue on the Rails project page about it, and submitted a patch that uses advisory locking to enforce that only one migration can run at a time.

Kudos to sgriff, tenderlove, and the other Rails maintainers for reviewing my code and helping to get this PR merged in so quickly. A two-day turnaround is pretty fast for a project of this size.

Now if you try to run multiple migrations in parallel, only one will get the lock and run the migration - the others will take no actions and instead fail with a ConcurrentMigrationError which could be rescued allowing for a clean exit.

This patch won’t be in production until at least the next Rails release, so for the time being I wrote a gem called opsworks_interactor that works around the issue by using rolling deploys.

Result: disaster averted