On Tuesday, February 7, 2017, Upserve experienced a serious incident that affected many of our Breadcrumb Point of Sale customers that were in service at the time. The underlying system failure started at roughly 21:00 EST, and was resolved by 23:30 EST. While recovering from the service interruption, we also encountered several related issues which complicated both our incident response and the Breadcrumb customer experience.
Purpose of this Document
Following a significant service interruption, it is Upserve’s practice to share a written analysis explaining the root cause of the problem, steps taken to resolve, measures to prevent recurrence, and lessons learned. If you have further questions, please contact your dedicated Account Manager, or reach our 24x7x365 support team by calling 888-514-4644, clicking support.breadcrumb.com or emailing firstname.lastname@example.org.
iPads with the Breadcrumb POS app keep a connection open to a cloud messaging service in order to send data for checks as they change, and to bring back the data for changes made on other iPads at the same location. The cloud messaging service has dependencies on several other components, including the use of Redis as an in-memory message broker for communication with connected iPads. When the connection between the iPad and the messaging service is lost, Breadcrumb goes into Offline Mode. Although certain functionality is not available in offline mode, customers are still able to create and modify checks, and accept payments.
Main issue and timeline
The root cause of the incident was degraded network performance of the backend Breadcrumb POS client message broker, a Redis cluster, which caused processing to slow down and back up. The issue was resolved by replacing the system.
At 21:00 EST, Redis experienced degraded message processing throughput even as more data was coming through due to the dinner rush. Not all customers were initially affected because the system remained up and was able to process messages, albeit at the slower rate. Our engineering team did not receive an immediate alert because we had inadequate monitors in place to detect the slower rate, and the problem became worse because iPads impacted by the slowness resent their original messages to the service, thereby flooding the system. Customers who force-quit and restarted the Breadcrumb app were not able to login.
We became aware of a widespread issue by 21:15 EST; simultaneously, our support team began seeing a huge spike in call volume.
Initially, we believed that the issue was with our payment gateway; we focused on verifying the health of our payment systems. At 21:22 EST, we determined that our payment systems were healthy, and turned our focus to the cloud messaging service. By 21:38 EST, where we were able to confirm unusual system metrics, but were not yet aware of the cause.
A minor change to this service had been deployed earlier in the day, so at 21:45 EST we rolled that change back in order to rule it out as a cause. Redeploying the service initially allowed several stuck clients to come back, however, the symptom returned quickly. At this point we forced all customers who were still connected into offline mode so that they would not be impacted by the more extreme issue while we continued to investigate the failure of the cloud messaging service. By 22:10 EST we confirmed that all customers were disconnected, then allowed several test customers to reconnect and monitored the system health. Starting at 22:23 EST until 22:40 EST, we allowed more customers to connect, moving in batches of hundreds, until there were thousands of customers connected; however we stopped those reconnections as it became clear that the slow message processing had returned and most customers were still better off in offline mode.
At 22:45 EST we were able to isolate the slow message processing that we were seeing to the Redis component. We immediately started bringing up a new Redis instance, and we put the currently connected customers back into offline mode so that they could continue to create checks and take payments. From 23:00 EST, when our new Redis instance was ready to 23:15 EST, we allowed connections from hundreds then thousands of customers, while we confirmed that our message processing rate had returned to normal. At 23:20 EST we brought all locations back online, and by 23:30 EST we confirmed that the fix was working for all customers and our message processing rate was back to normal.
In addition to the root cause, several other factors burdened our ability to respond, and caused our customers to experience more pain.
There were inadequate alarms on expected message processing rates for the critical backend service. Although we would have been alerted if our message broker had stopped functioning altogether, we had little visibility for a state of degraded performance. Such an alarm would have started the engineering investigation more quickly, and more importantly, would have focused our attention on the failing Redis component without having to rule out other possible “red herring” causes for the issue.
The message service had no fall back system to handle failure. Instead of having standby cluster, or having a load balancer automatically route connections to the healthy system, we had to provision a new system manually during the outage.
The Breadcrumb app did not go into offline mode automatically as would be expected, causing us to spend time forcing everyone into offline mode. Also, Breadcrumb customers that quit and restarted the app were not able to complete their login at all until the issue was resolved.
A large number of customers called our support line, which was staffed appropriately to achieve our service levels on a typical Tuesday night at this time. Our target service level for support is to answer 80% of inbound support calls within 30 seconds. In the Month of January, more than 70% of inbound calls to support achieved this service level. However, on the evening of the service interruption, the inbound call volume exceeded our capacity. Recorded messages did not effectively alleviate customer concerns, and customers were understandably frustrated to wait in the queue for an excessive time on hold. Some customers were not aware of our website, status.breadcrumb.com, which provides regular updates on issues impacting our service via web, SMS text message and email. Some customers attempted to load “status.breadcrumbpos.com,” which lead to a broken link.
What we’re doing
Immediately after the outage was resolved, we improved our alert systems to let us know if the backend POS service message broker enters a state of degraded performance.
We’ve deployed standby Redis instances to reduce the time it takes to switch systems in the event of slowness or failure. In the coming weeks, we’ll evaluate and select from several stronger high-availability data products that include automated switching in the event of failure.
We’ve reproduced the issues that prevented iPads from automatically going into offline mode in this circumstance, and that prevent the app from coming back up after it is quit and restarted in offline mode. We prioritized fixes for these issues, to be included in version 2.7.6 of the Breadcrumb app. We’ve also prioritized collecting detailed metrics about offline mode, to get better ongoing visibility on any issue involving offline mode.
We have also expanded our standard incident response playbooks to improve issue investigation and resolution that occurred during this service interruption.
Our support team has improved the way we use pre-recorded messages in the event of an excessive call queue. In the coming weeks, we’re working with our call center technology provider to implement live announcements on our support line to provide transparency into a caller’s place in the queue, and estimated hold time. While the February incident damaged our service level performance this month, customers should continue to expect that 80 percent of calls will be answered within 30 seconds.
We take the reliability of our service extremely seriously, and we can do much better at preventing downtime for our customers. We continue to place a high degree of scrutiny on the availability of all components that deliver the Breadcrumb Point of Sale service, and on the speed and effectiveness of our response to support inquiries on this mission-critical technology.