Provation Apex is currently experiencing technical difficulties
Incident Report for Provation
Postmortem

Provation Root Cause Analysis for 5/3/21 Apex Service Disruptions

What Caused Outage #1? - Four Provation Apex databases unavailable

During routine database maintenance, a change to the database’s configuration file successfully completed on all but four production customer databases caused by a bug within Microsoft’s code.  Due to the nature of the bug, it was not exposed until a database was restarted which happened on 5/3/21 at 6:54AM CST after routine database maintenance was initiated on all Provation Apex databases. During normal maintenance operations, databases are restarted using a phased approach. Importantly, this is done without incurring downtime. However, once the restart occurred, four databases that were impacted by the bug experienced a service disruption.  In an attempt to resolve this issue, Microsoft’s on-call engineer manually revised the configuration file.

What Caused Outage #2? - Multiple Provation Apex databases unavailable

While mitigating the incorrect configuration file, Microsoft’s on-call engineer accidentally introduced an extra hidden space/character into the configuration file, causing periodic service disruptions on 5/3/21 at 12:03PM CST on a rolling basis. Once Microsoft’s on-call engineer identified and fixed the corrupted database configuration file, all databases came online and available at 1:55PM CST.

Repair Items & Closing

To prevent any similar issues:

Microsoft has:

  • Identified a code fix and deployed it across the Azure platform to avert future issues with the configuration file during database maintenance for all Microsoft Azure customers
  • Created an alert to proactively detect when the integrity of the configuration file is amiss
  • Removed the manual operation in updating the configuration file to avoid operational syntax errors

Provation has:

  • Set up additional auditing and alerting of Microsoft support, maintenance and database activity
  • Leveraged enhanced Microsoft alerting of database connection failures and Azure resource health
  • Created a new internal tool to build upon Microsoft’s enhanced database alerting, whereby a backend process periodically logs into databases and validates connectivity to each database
  • Ongoing research and analysis by Provation’s Engineering teams to ensure Apex maintains the highest level of availability and resiliency

We apologize for the impact this had on our customers. We continue to be committed to our purpose: To empower Providers to deliver quality healthcare for all.  This means providing you with a great user experience with Provation Apex.

Posted May 20, 2021 - 23:21 CDT

Resolved
This incident has been resolved.
Posted May 03, 2021 - 13:55 CDT
Update
We are actively investigating the issue with Microsoft Azure engineers. More information will be posted as it becomes available.
Posted May 03, 2021 - 13:22 CDT
Investigating
Provation Apex is currently experiencing a partial outage, resulting in some users finding it unreachable. Investigation is underway and we will post an update as soon as we know more.

Click the "Subscribe to Updates" button on this page to get email updates sent to your inbox whenever a change is made to this page.
Posted May 03, 2021 - 12:03 CDT
This incident affected: Provation Apex.