Yesterday, Amazon had a brief Lambda service outage in the Ireland region, affecting some of Clouden's systems for about 1.5 hours. Since we are currently testing a number of new services, this was a good opportunity to see how they survive outages in cloud infrastructure. We take outages seriously but we also think they are sometimes unavoidable, and we trust Amazon to keep them as short as possible when they do happen. Here is a short technical analysis of what happened from our perspective.
The most important finding for us was that all of Clouden's services came automatically back online after the outage - without any manual intervention. We observed some problems during the outage, when Lambda functions failed to execute, but afterwards everything was operating normally again. This is one of the many advantages of our Serverless cloud architecture, which does not require any kind of server reboots or similar operations.
We did, however, notice room for improvement in some of our systems. During the outage, there were some problems with internal API calls that were assumed to always succeed. The caller of the API could not tell the difference between an error caused by the outage and an HTTP error returned intentionally from the API. We've now improved the error handling in these situations.
We also noticed that during the outage, quite a few email messages were sent to some test users when certain systems failed and then continued to work again in rapid succession. We will be improving the automated sending of emails by batching multiple messages into one email when possible, so that users won't be overwhelmed by many individual emails.
We hope there won't be any more cloud infrastructure outages in the near future, but we are also confident that our systems are well equipped to handle them when they do occur.