Root Cause Analysis (RCA)
Intermittent May, 19 - 24 2017
Type: Pulse, LeadManger, DIQ
Several clients report inability to click on Dashboard, Normal View or experience latency when transferring calls. Some clients were also experiencing "Service Unavailable” error messages.
After resolution and a meeting to discuss client impact, Velocify leadership received communication from our partner regarding the latency issues noted above.
Some customers observed increased error rates for HTTP requests serving TwiML fetch and status callbacks from Monday May 19 thru Friday May 29. The request failures were due to network connectivity issues in routes through the affected availability zone in US East Data Center. Requests routed through the affected availability zone failed at a 3% higher rate than other zones forTwiML fetch, media fetch, and status callbacks, collectively known as Webhooks.
These HTTP requests pass through internal proxy servers that relay the requests to the external customer application servers. During the incident time window, proxies residing in the affected data center availability zone experienced a 3% higher failure rate than other availability zones, indicted by a 502 HTTP response status code.
These requests failed to reach the customer application server. Preliminary reports from our network provider indicate that network routes from the affected data center availability to certain external data centers were affected at a higher rate.
Engineering evacuated the proxy servers from the affected data center availability to resolve the incident. We will add tools to enable Support teams to better identify the rate of failures for Webhook requests affecting our customers. This will include analytics to identify anomalies affecting specific customers by availability zone, IP addresses and related attributes. Monitoring and alerting for Webhook requests will be added to detect network anomalies affecting specific availability zones, network address ranges, and destinations.
We would like to apologize for the impact this incident has had on our customers. We are continually working to improve the resilience and robustness of our systems as our first priority. We are applying the lessons learned from these events to implement betterments that will prevent failures of this kind from impacting our customers in the future.