How to handle the lack of transactions across microservices using the Saga pattern

One of the challenges of using microservices is the lack of transactions in operations that span multiple services. We will see how to solve it using a standard patten.


What are transactions?

Many operations require modifying data in different parts of the system. For example paying an invoice may require changing its status, modifying an account balance, registering it in a bookkeeping system, etc. These changes should either happen together or be completely cancelled. Imagine how bad would be otherwise if due to an error the money was taken out from the account but the invoice remained unpaid or the payment wasn’t recorded.

Transactions handle this. They track all the places that need to be modified and rollback the changes in case of error. This way the data cannot get into an inconsistent status. This is usually done in the database level and we can configure in the application code which changes we want to include in the transaction.

Which is the challenge with microservices?

The operations described before may be done in different services (invoice, account, payment, ledger…). Each of them may use a different schema, a database in a different server or even a completely different type of database. In these cases we can use transactions inside a service but not across multiple services.

The aim then is to achieve eventual consistency. Some of the data may get in an inconsistent status for a few seconds and the application code will handle it. There are some ways of doing it and one of the most common patterns is Saga.

Let’s see an example

We could divide the previous example in a few steps:

  1. Check that the account has enough balance
  2. Check that the invoice is active and has’t been paid yet
  3. Deduct the amount from the account
  4. Change the status of the invoice
  5. Do the actual payment
  6. Store the payment in the ledger

Note that I have put the validations at the top. This way we check that we can perform the changes before starting them, so we avoid modifying data and restoring it later if we know from the beginning that the whole operation cannot be done. These validations should be done again later on the operations that modify the data.

Also please note that I have put the most restrictive operation at the beginning. Different processes may try to pay different invoices with the same account so let’s modify the account balance first. This way we will notice any lack of fonds as soon as possible.

How can we handle the errors in this example?

There may be different types of errors. For example the payment API may be down and we wouldn’t be able to process the payment at that moment but we could do it after a few minutes, or there may be a bug that doesn’t allow to write in the ledger and that would take more time to be fixed. Depending on the case we may decide to either undo all the previous steps (backward-recovery) or try again later the pending ones (forward-recovery). The forward recovery is easy, let’s focus on the backward one:

If there was an error in the step 3 taking the money from the account (for example other user may have done other payment in parallel) we could cancel the whole operation. If the error was instead in the step 5 (imagine that the banking system is down for a few minutes) we would also have to revert the steps 3 and 4 so the account balance and the invoice have the previous status.

However we cannot really revert the account balance to the previous amount because other changes may have happened in parallel. For example the account owner may have topped it up at the same time. What we could do is to increase the account balance with the amount that we had deducted before. It would work like this:

  1. The account balance was initially $1000.
  2. We deducted $100 for paying an invoice and the balance became $900.
  3. The account owner topped up $300 and the balance became $1200.
  4. The payment of the invoice failed and we topped up $100. The balance became $1300.

This way of reverting the changes is called compensating transaction. We have to be careful and ensure that the change is idempotent as a retrier may try to apply the operation more than once (e.g. for timeouts) and it should increase the balance only once. We could do it for example sending an operation id so the service can check if it has been received before.

How to implement it

The first thing to do is to track the status of the whole operation so we can recover from it in case that any part of the system fails, for example storing it in the database and/or in a log that allows to see the history of the changes (e.g. Kafka).

As mentioned before, in the case of forward-recovery we just have to check later if it is possible to do the operation that failed. This could be done with a scheduler that gets the failed operations from the database or from a queue, and tries them again.

Let’s see how to do backward-recovery with two types of architectures:

1) In service orchestration architectures:

In this case one service manages the communication between the others, that work separately without knowing about each other. It is usually modelled with a state machine in which we specify the actions that should happen in case that everything goes well, and which compensating transactions should be executed otherwise.

2) In service choreography architectures:

This is more challenging because the logic to revert the whole operation is distributed between all the services, and they should know which ones they should call in case that something goes wrong. Also adding more services may complicate the whole thing as we may have to figure out which services we have to change. In this case we need an operation id that is sent across the services so they know how to retrieve the information they need for doing the changes.

Conclusion

We have seen different approaches to solve the problem depending on the business logic, the error type and the architecture type. The solutions are more complicated that using a simple transaction in a monolith, but they have advantages as they allow to decouple the data between services and deploy them separately without losing data integrity.

Have you used them? If so, which are your experiences? As usual I would love to read them in the comments section below.

Rafael Borrego

Consultant and security champion specialised in Java, with experience in architecture and team management in both startups and big corporations.

Disclaimer: the posts are based on my own experience and may not reflect the views of my current or any previous employer

Facebook Twitter LinkedIn 

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>