Asynchronicity in Elixir - Best effort vs. Guaranteed execution
We often run into a scenario in web applications where we need to do some work at “some point soon”, but we don’t want to make the user wait right now.
As a simple example, let’s say we have a user complete a website signup and we want to send them a welcome email. We don’t want to make the user wait for this, or have their request fail if the email failed to send for whatever reason.
Luckily, this is Erlang! Processes are super cheap, why don’t we use a Task?
You can run a fire-and-forget task using
def signup(user) do Task.start(EmailService, :send_welcome_email, user) :ok end
Problem solved right?
Well… not quite
Erlang was originally developed to run on machines that had long uptimes.
You’d have a telephony box in the middle of the woods somewhere with two nodes in it.
Every two years you send out an engineer to do an upgrade, which would have been thoroughly tested on the exact hardware previously and would work via hot code reloading that works with OTP to upgrade all your GenServer code and do any necessary state transformation without stopping any processes.
This code might run for years without being rebooted.
In this scenario, you can be reasonably sure that any spawned task will get executed.
Contrast this to how we run Elixir in production today:
A lot of us are using ephemeral containers e.g. Docker or VMs.
We don’t use hot code reloading to do deploys since it’s time-intensive to get that right and we’re willing to trade a little downtime for faster development speed.
We do deploys by booting an entirely new version of the VM and throwing the old one away.
This means that currently running processes and in-memory state are thrown away on every deploy, which can be multiple times per day.
This means there is a small possibility of losing any Task, but it is especially problematic if:
- The task takes a long time
- The task has a likelihood of failing and we might want to automatically or manually retry it
We need to classify our tasks into two different categories:
- Best effort
- Guaranteed execution
An example of a best-effort task:
Let’s say you have a customer tell us their address in the signup process and sometime later we need to show them a pretty map with their property on it. Our software already must handle a case where we can’t geocode the property (invalid address etc).
A regular Task is probably OK for this because the job is:
- Of relatively short duration
- Not mission-critical
- Potentially high volume
An example of a task that requires guaranteed execution:
Adding a customer to the CRM and sending them a welcome email after signup. This task must complete or the business could risk losing a multi-thousand pound deal. Additionally it might fail and require automatic or even manual retry, so it needs to hang around for a while.
In order to guarantee execution of tasks we need to persist data about them somewhere outside of the BEAM, so that if the VM restarts we can read that data out of the database and guarantee that the job runs.
We need something with the following characteristics:
- Holds state outside of the BEAM
- Really good at keeping data safe
- Does not lose data on restarts
We need something like a BASE for our DATA.
Can anybody think of anything that fulfils these requirements?
OMG! THE DATABASE.
Postgres is good enough
There are a lot of people who flinch when they hear “database-backed job queue”, and with good reason.
Delayed::Job is a famous database-backed queue from Ruby-land and it is famously appalling at scale. It maxes out at around 100 jobs/s even on huge database boxes.
Traditionally the community has reached to Redis to solve this problem, some well-known examples are Resque and Sidekiq for Ruby, and ExQ for Elixir.
However this comes with the overhead of having to manage another service. For many small apps and especially for beginners, this seems like overkill.
In addition, Redis is an in-memory key-value store. It is not designed for durable, transactional storage of data. It’s not ACID compliant - you can force it to persist everything to disk (synchronous writes) but you lose a lot of the performance that it’s known for.
A database is the ideal solution to our problem. But can we make it fast?
Luckily this is no longer 2008 and there are now several Postgres-specific features we can take advantage of to negate these downsides:
- FOR UPDATE SKIP LOCKED
- Advisory locking
The Que library for Ruby uses some of these features and has been benchmarked at just under 10,000 jobs per second.
With this kind of performance I no longer see a place for Redis. If you need more than 10,000 jobs/s then Redis is probably not the right solution for you either. You are well into the territory of needing a “real” queue system like Kafka or ActiveMQ at that point.
Rihanna is a fast, reliable and easy-to-use Postgres-backed distributed job queue for Elixir.
It is designed for the following very common use-case:
- I have a simple Phoenix/Raxx app with a database (probably > 90% of Elixir deployments in the wild)
- I want to run some task asynchronously so I don’t make the user wait in a request
- I want to be sure that this task is going to run even if I deploy my app and I want to be able to retry the task if it fails
Rihanna is a drop-in solution with no dependencies on any other services. It is based on Ruby’s Que library and uses advisory locks for speed. Que has been benchmarked at up to 10,000 jobs per second and Rihanna’s performance should be similar if not better since this is Elixir, not Ruby.