On Sunday, February 19, 2017, Upserve experienced a serious incident. This incident affected many Breadcrumb Point of Sale customers in service from 13:40 to 14:35 EST. While this incident was caused by a database failure (a completely different backend system from the prior incident on February 7, which was a Redis failure), there are several common opportunities to improve. Some improvements have been completed already, and more remain underway in the coming two weeks. These improvements will significantly increase overall reliability to eliminate the likelihood of any recurrence of these problems.
Purpose of this Document
Following a significant service interruption, it’s Upserve’s practice to share a written analysis explaining the root cause of the problem, steps taken to resolve, measures to prevent recurrence, and lessons learned. If you have further questions, please contact your dedicated Account Manager, or reach our 24x7x365 support team by calling 888-514-4644, clicking support.breadcrumb.com or emailing firstname.lastname@example.org.
The Breadcrumb POS iPad app communicates with a backend cloud service that uses a database to persist check changes. When the database is unavailable, Breadcrumb goes into Offline Mode, which allows customers to operate normally to create and modify checks, and accept payments.
Main issue and timeline
On February 19, we experienced two major failures: our primary point of sale database became unresponsive, and the iPad Offline Mode didn’t work as designed. The root issue was resolved by replacing our primary point of sale database with a standby replica.
At 13:40 EST, our point of sale database began experiencing degraded processing capabilities. Monitors designed to detect a total database failure were inadequate to detect the degraded performance. Separately, a bug in Breadcrumb POS iPad app’s Offline Mode handling meant that not all customers entered Offline Mode, because the backend service remained up, though it was not responding properly.
Within 6 minutes, at 13:46 EST, engineers became aware of the issue, as multiple services became unavailable, and our support team quickly received a large number of calls about the issue.
At 14:00 EST, after it became clear that not all customers were automatically going into Offline Mode, we forced the backend service to disconnect clients to put iPads into Offline Mode so that they could continue to create and modify checks and swipe credit cards.
After ruling out several potential causes by 14:15 EST, we turned our attention to the database service where we quickly noticed stuck queries that explained the unresponsiveness of the backend service. At 14:22 EST, after attempts to clear the stuck queries failed and our diagnostic queries were also blocked, we determined the point of sale database was not functional and immediately promoted our standby replica to become the new primary database.
The new database was live at 14:28 EST. After confirming the health of the backend service, we allowed all customers to reconnect 5 minutes later at 14:33 EST. Most were back online by 14:35 EST. Support confirmed that customers’ service had been restored and we were back to normal by 14:46 EST.
Comparison to February 7 incident, and actions we’re taking
The incident on February 7 was caused by a different backend storage system failing, but in both cases iPads remained connected and did not automatically go into Offline Mode. On March 8, we will release Breadcrumb POS version 126.96.36.199, which resolves this bug and adds other improvements to Offline Mode.
In both incidents there were inadequate alarms on an unresponsive data store. In this case, we should have alarmed when queries to the point of sale database did not return within the expected time, so that our investigation would have begun sooner and started closer to the root cause. Those alarms are now in place.
Over the last three days, and in consultation with our infrastructure partner, we audited all systems and dependencies behind Breadcrumb POS, and began a comprehensive reliability initiative to significantly increase always-on availability and improve the speed and effectiveness of our response. We’re adding monitoring coverage for all gaps this audit identified by March 3. For services or components of the system without an adequate high availability story, we’ve identified immediate improvements which will be completed by March 10, as well as longer term changes that will simplify our network architecture and remove any single points of failure.
We know Breadcrumb POS is a critical service for our customers, and that no amount of downtime is acceptable. We take our always-on responsibility seriously, and we’re sorry that we let you down. Our entire engineering team is focused on improving Breadcrumb Point of Sale availability, improving our ability to respond effectively when incidents occur, and making our high availability more transparently visible to you, our customers. We’re committed to operating the most highly available and reliable restaurant point of sale system in the market.
Related support phone system issues
At the same time the above incident occurred, our third-party phone system partner experienced an outage resulting from a large volume of inbound telephone calls to support.
The week prior to the incident, we had made an improvement to allow callers to hear their place in the hold queue, and the expected hold time to speak with an agent. We made this change to provide more transparency, as our service level expectation is to answer 80% of all inbound support calls within 30 seconds. (We achieved this service level on more than 70% of calls in January.)
When a major incident occurs, our support team strategy is to quickly turn on recorded messages explaining the issue and recommended actions, to reduce the live agent queue. However, as a result of the recent change to provide hold queue announcements, our recorded message feature was not immediately available.
We also send out frequent status updates as they happen, which can be received automatically via text and email messages. To subscribe to those messages visit status.breadcrumb.com.
As the call queue grew, our phone system partner began to experience their own outage. This affected not only our support team, but other customers who depend on our phone technology provider. Some callers heard a ‘call cannot be completed’ network error message.
We’ve taken several steps to correct this phone system problem from recurring:
- First, we resolved the issue limiting our ability to activate recorded messages.
- Second, we selected a vendor and plan to upgrade to a larger phone system next quarter to better support our scale of growth, while also providing more advanced features to customers, such as the ability to receive a callback without having to wait on hold.
- Third, we will soon provide a real-time view into our support performance and call queue metrics on our status website.
We’re committed to delivering on high standards of excellence and providing transparency to our support team performance.