Partial outtage - errors saving while in procedure documentation
Incident Report for Provation
Postmortem

Postmortem: Sporadic Error Saving Notes & Printing Issues

Incident Summary

On March 4th 11:37 CST Apex customers were experiencing sporadic errors when saving notes and encountering printing issues. Investigation revealed that 3 out of 4 apex instances were not processing larger payload traffic successfully. All Apex instances were cleared, and issue was resolved at 13:35 CST.

Root Cause

The root cause of the issue was a lack of available disk space on certain apex instances.

Detailed Analysis

  1. Disk Space Shortage:
* The lack of available disk space was identified as the primary issue.
* apex instances were unable to process larger payloads due to insufficient disk space.
* This impacted the overall system performance and caused sporadic errors for users.
  1. Excessive Log Files:
* Further investigation revealed that log files were consuming a significant amount of disk space.
* These log files were not being deleted frequently enough, leading to the disk space shortage.
* The increasing Apex traffic contributed to the accumulation of log files.
  1. Log File Management:
* The team had not adjusted the log file deletion frequency based on the increased Apex traffic.
* As a result, log files were not being purged at an acceptable rate.
* No alerting mechanism existed to warn the team about the scarce disk space capacity.

Corrective Actions

  1. Immediate Disk Space Cleanup:
* The team performed an emergency cleanup to free up disk space on affected Apex instances.
* Old log files were removed to alleviate the shortage.
  1. Log Rotation and Deletion Strategy:
* A log rotation and deletion strategy was implemented.
* Log files are now rotated and deleted at regular intervals based on traffic patterns.
* The deletion frequency is adjusted dynamically to accommodate increased traffic.
  1. Alerting System Enhancement:
* An alerting system was set up to notify the team when disk space reaches critical levels.
* Alerts are triggered based on predefined thresholds to prevent future incidents.

Preventive Measures

  1. Capacity Planning:
* Regular capacity planning exercises will be conducted to anticipate resource needs.
* Disk space requirements will be reviewed and adjusted as necessary.
  1. Automated Log Management:
* Explore automated log management tools to ensure timely deletion and rotation.
* Regularly monitor log file sizes and adjust retention policies accordingly.
  1. Documentation and Training:
* Document the log management process and educate team members.
* Ensure everyone understands the importance of disk space management.
Posted Mar 21, 2024 - 18:01 CDT

Resolved
This incident has been resolved.
Posted Mar 04, 2024 - 13:35 CST
Update
Apex fully functional.
Posted Mar 04, 2024 - 12:39 CST
Update
We are continuing to investigate the issue.
Posted Mar 04, 2024 - 11:43 CST
Investigating
Currently investigating
Posted Mar 04, 2024 - 11:38 CST
This incident affected: Provation Apex.