Ran into a clever blog post with a Starbucks analogy showing how a traditional, synchronous transaction can be replaced with a series of decoupled, asynchronously executing steps:
Most dev should be intimately familiar with the process of acquiring coffee. 😛
What I especially like about the blog (and the linked post Starbucks Does Not Use Two-Phase Commit by Gregor Hohpe, which is from 2004!) is how they encourage you to completely re-think a naïve, all-or-nothing synchronous approach to common business problems. The cost of 100% transactionally perfect units of work is often not worth the associated cost of those guarantees (and is basically impossible in a distributed microservices world anyways). Just ask Pat Helland, who worked at Microsoft in the 90’s and headed the team responsible for creating their distributed transaction manager, then later wrote the influential Life beyond Distributed Transactions: an Apostate’s Opinion.
Once you leave the safety of atomic transactions, how do you deal with faults and errors when an action has been ‘partially completed’ by one or more services? Gregor suggests a few options:
- Ignore – the error is not important, so simply write it off (rarely used for critical business data, but for dashboard/reports or other ephemeral data, it might be OK)
- Retry – good to use when you’ve had an infrastructure failure (network down, database locked, etc.), but not the best choice for business rule violations (see: Definition of Insanity). The one exception to this is when you expect short-term race conditions because of eventual consistency – a retry will often give them time to settle down.
- Compensating action – a common approach in message-based microservice world; we issue a command to clean up after ourselves, leaving the system in an eventually consistent, proper state Compensating actions are the most interesting option, as they are intimately tied to the business itself – what constitutes a proper response to an error condition for a given business context?
In the coffee blog, the reimbursement process is an example of a compensating action. The cashier rings up your order and passes the cup to the barista. Suddenly they discover they are out of coffee (gasp)! In this case, the service responsible for handling the command “Make a coffee” (the barista) issues a failure event (apologizes to customer and informs cashier), and the cashier responds by debiting the cost back to your card (compensating action). It’s a lengthy, disruptive process, so you want to make sure the chance of failure is low – compensating actions are not always a good fit when failures rates are expected to be high.
In the manufacturing world, we should strive to identify scenarios where a “fire-and-forget” approach is applicable. Do we really need to block the user after they complete an operation while a WIP (Work In Progress) is routed to the next step, or while a document prints?
If different services are handling each aspect (operation history, routing dispatch logic, document evaluation and printing), and the services are independent of each other, a transaction becomes impossible anyways.
We still need to think about the consequences of a failure at each hop, and what the proper response should be. Most of the time, the answer lies in how the business runs, not how the code runs! These are the kinds of scenarios Business Analysts can help uncover, by discussing with the customer what options are available and valid from a business perspective.