WebCat’s Transition to Step Functions

WebCat is Clouden’s domain registrar service. Our customers register and renew domains and manage their DNS records. This is simple on the surface, but involves quite a few integration points. We interact with external services like the .fi top-level-domain registry and external DNS services. We also have an order processing system that has grown more complicated over time.

In this article, we describe how we originally implemented our system with Amazon SNS and SQS and later transitioned to Step Functions. The goal was to enable more advanced features, such as processing of multi-item orders. We also made our system more observable and easier to extend.

Original solution based on Amazon SNS and SQS

When we initially launched WebCat, customers could register one domain at a time. We implemented all integrations using Amazon’s SNS and SQS messaging services. SNS is a publish-and-subscribe service for real-time one-to-many messaging. SQS is a message queue service for eventual one-to-one messaging.

Amazon SNS is useful for keeping services detaching from each other. The sending service doesn’t have to know anything about the receiving services. It just publishes a message to a specific topic, which acts as the API endpoint. We can develop services independently and replace them with new implementations as necessary.

Amazon SQS is useful for making sure that usage peaks don’t cause service outages. Requests are queued and integration services process them as quickly as possible. For instance, we can queue hundreds or thousands of REFRESH_DOMAIN requests immediately. The integration service, which connects to the .fi top-level-domain registry, processes one request at a time and deletes it from the queue.

Shortcomings in the original solution

Although Amazon SNS and Amazon SQS work quite well for their intended purposes, they also have weaknesses when used to implement service integrations.

Due to these weaknesses, we originally developed a complementary service that uses DynamoDB to keep track of requests and their statuses. When AWS released Step Functions, we realized that it provides the same thing out-of-the-box. Consequently, we decided to transition our system to use to Step Functions.

Transitioning to Step Functions

Step Functions are a way to build and execute state machines in the cloud. When you use Step Functions to implement a service integration, the state machine defines basic operating logic.

A state machine execution flows through each state of the machine and typically calls small Lambda functions in a predefined order. The state machine may include Choice states which act similarly to if-then statements. It may also include Map states which iterate over arrays like for-loops. It can also respond to error conditions in the same way as try-catch clauses work.

Step Functions let you define Activities, which are similar to SQS message queues. An Activity State waits for an external service to process the Activity and report success or failure. The operation is synchronous, which means that the next Lambda function receives the activity result as its input.

Overall, Step Functions provide a useful abstraction that can correspond to real-world concepts like order processing and other end-user actions. They reduce the amount of low-level code you need to write in Lambda functions. They also make your business logic more observable and organize it into well-defined state machines.

Anatomy of an Execution

When a customer submits an order, we create a state machine execution to fulfill it. An execution has a state and a unique name. The initial state is always Running and eventually it becomes Succeeded, Failed, Timed out or Aborted. We use the unique name to uniquely identify the order.

To create a state machine execution, we need to know the state machine name and define the execution input as a JSON object. This is roughly equivalent to sending a JSON message to an SNS topic or an SQS queue. Since the execution has a unique name, we can also track it through its lifetime. At the end we will know the final execution status and receive the final output.

However, we are rarely interested in the final status and the final output of a Step Function execution. Instead, we include a final Lambda function state as part of the state machine. In the normal case, this Lambda function receives the final output, finishes handling the order and captures any pending credit card charge. This way the final Lambda function is part of the execution and we can observe it in the same place as the rest of the state machine.

Error handling

What happens when a Step Function execution fails? For that purpose we have a separate Lambda function that subscribes to CloudWatch Events outside the state machine. It receives receives the final status and error message from failed Step Function executions and finishes handling the order with an error status. It also cancels any pending credit card charges.

Under normal conditions, our Step Function executions don’t fail. When an individual service integration fails, we don’t throw an exception that would fail the entire execution. Instead, we return an Error attribute in the output JSON to indicate an error. This makes it possible to use Step Function logic like Choice states to handle errors in various ways, such as calling a specific Lambda function depending on the type of the error and where it occurred.

For example, we might be processing an order to register ten domains. An error may occur while registering one of the domains because it was snatched by someone else a few milliseconds earlier. In this case, we don’t want to fail the entire execution. We want to continue processing and register as many of the remaining domains as possible. The final output will indicate whether each registration was successful or not, and we calculate the final credit card charge based on that information.

Nested Step Functions

It is sometimes useful to split a state machine into several nested state machines. In our case, the top-level state machine is usually an Order. It processes each item of the order and executes an inner state machine to perform the related action, such as a domain registration. We call these inner state machines Actions.

This separation lets us detach service integration details from order processing. When we add new service integrations, we define them as new state machines. The top-level state machine doesn’t have to know anything about the details, just the state machine name that it needs to execute.

Step Functions can also call many cloud service integrations directly. For instance, you can read and write data in DynamoDB without writing any code. This makes it increasingly feasible to implement all business logic as nested Step Functions, without necessarily having to write any Lambda functions at all.

Conclusion and some advice

To conclude, here’s some general advice based on our experiences in implementing Step Functions. Your mileage may vary, but we encourage you to consider these points when planning your own implementation.

DO

DON’T

Thanks for reading and we hope this article was useful for you! If you’d like to see how our system works in practice, we warmly welcome you to register your next .fi domain at WebCat.