All times Pacific Daylight Time
10/09/2017 8:30 am – 10:30 am
Clients reported inability to login to LeadManager and/ or reported extreme slowness while navigating through the application.
Internal system monitors showed LeadManager Core services was down.
Details / Root Cause:
New code to service a feature was deployed as part of the 17.9 release. This code was written to check every lead for specific information used in the interface. When it began checking these services per lead, it resulted in an excessive load on LeadManager Core Services. The resulting volume / load temporarily slowed down these services resulting in user’s inability to login or slowness when navigating within the UI.
Engineering made a fix in 17.9 code to disable this call to core services and released a Hotfix patch which was deployed to Production environments.
- Engineering currently follows best practices of code review during Software Development Lifecycle (SDLC) but the overall impact of the original code was not envisioned. Best practices have been expanded to include more rigorous use cases covering system wide implications of code changes.
- Engineering teams added performance counters and additional logging that provide additional data and earlier alerts for situations like this one.
- Add appropriate Production monitors and alerts to be triggered for Core and other services. These alerts are designed to alert engineering teams prior to any user impact.
- Update runbooks to address situations like this one to minimize time to resolution and eliminate user impact as quickly as possible