We should design and build systems in a way that ensures any change can be easily reverted even if the people who have worked on it are not available at that time. Imagine for example that a change starts causing memory issues on a Saturday evening, surely you want to be confident that the support teams can revert to a previous version without downtime and without involving multiple teams. The below changes are easy to implement and will give you peace of mind.

API changes

Changes to avoid

API releases should not contain breaking changes like adding new mandatory fields to an endpoint, changing the format of the response, etc. as it may be needed to update multiple consumers and it may not be straightforward to release all of them at the same time, mostly if they are managed by different teams.

API design

API endpoints and request and response bodies should be designed without thinking about a specific consumer (e.g. endpoints that just return data filtered for a specific case), as soon it may be needed to change them to support other customers. And they should hide any implementation detail as it should be possible to even change the underneath infrastructure (e.g. retrieving data from Redis or Elasticsearch instead of from a database) without impacting any consumer.

Versioning

Using versioning on the urls allows having an old and a new version of the API running in parallel until all the consumers have time to update. It can be done by having duplicated endpoints inside the same API or deploying separate instances with the old and new versions. The latter approach is usually easier to maintain as the code can get complicated if it has two support different versions, and it is also easier to release a change on any of the versions as they can evolve separately.

Contract testing

Contract tests detect breaking changes between APIs, so you can automatically notice if a change in a service impacts any of its consumers. This is very useful before and during releases as these tests can be run as part of the pipelines and abort them if they find any breaking change.

Database changes

Some database changes are difficult to roll back, e.g. deleting a column or changing its content or type.

Column deletion

Imagine deleting a column and having to roll back the following day. Not only you will have to recover the data from a backup but also you won’t have anywhere the values of the last few hours. The deletion can be done in three steps:

Modify the code so it keeps populating the column but it is not read anymore. This way the values will be present if it is needed to roll back.
Make the column optional and remove any existing usage in the code so it is not populated anymore. This could help detect any unexpected issue, e.g. an old report that nobody may know about and is using that table.
Do the actual column deletion.

Column change

Imagine you change an enum value or a date format and it breaks some critical month-end reports, it would be quite hard for the ops teams to figure out how to roll back. Similar to the previous case, these changes can be done in multiple steps:

Create a column with the new format, and modify the code to write on the new and old columns and only read from the new one.
Make the column optional and modify the code so it doesn’t write to the old column.
Delete the old column

Queue/topic changes

Messages usually have to be modified to add more fields and the code should be configured to ignore new fields it is not expecting to see, supporting new and old consumers of the message. Apart from this, in case it is needed to change a field, a new one can be added and the code modified to only read from the new one. In this case, it may not be viable to delete old fields as a topic may have different consumers that may want to reprocess all the messages, e.g. when Kafka topics are used as a single source of truth.

Summary

We have seen some practices that will make your deployments more robust and will ensure that support teams can easily roll back and restore the system to a previous status without involving product teams outside normal working hours. Do you use these or similar practices? If so I would love to read about them in the comments.

Rafael Borrego

Consultant and security champion specialised in Java, with experience in architecture and team management in both startups and big corporations.

Disclaimer: the posts are based on my own experience and may not reflect the views of my current or any previous employer

IT consulting

Rafael Borrego's blog

7 steps to make releases backward compatible

API changes

Database changes

Queue/topic changes

Summary

Leave a Reply Cancel reply