Postmortem: Sporadic Error Saving Notes & Printing Issues
Incident Summary
On March 7th 09:19 CST Apex customers were experiencing sporadic errors when saving notes and encountering printing issues. Investigation revealed that 1 out of 4 apex instances were not processing larger payload traffic successfully. All Apex instances were cleared, and issue was resolved at 10:00 CST.
Root Cause
The root cause of the issue was a lack of available disk space on certain apex instances.
Detailed Analysis
* The lack of available disk space was identified as the primary issue.
* apex instances were unable to process larger payloads due to insufficient disk space.
* This impacted the overall system performance and caused sporadic errors for users.
* Further investigation revealed that log files were consuming a significant amount of disk space.
* These log files were not being deleted frequently enough, leading to the disk space shortage.
* The increasing Apex traffic contributed to the accumulation of log files.
* The team had not adjusted the log file deletion frequency based on the increased Apex traffic.
* As a result, log files were not being purged at an acceptable rate.
* No alerting mechanism existed to warn the team about the scarce disk space capacity.
Corrective Actions
* The team performed an emergency cleanup to free up disk space on affected Apex instances.
* Old log files were removed to alleviate the shortage.
* A log rotation and deletion strategy was implemented.
* Log files are now rotated and deleted at regular intervals based on traffic patterns.
* The deletion frequency is adjusted dynamically to accommodate increased traffic.
* An alerting system was set up to notify the team when disk space reaches critical levels.
* Alerts are triggered based on predefined thresholds to prevent future incidents.
Preventive Measures
* Regular capacity planning exercises will be conducted to anticipate resource needs.
* Disk space requirements will be reviewed and adjusted as necessary.
* Explore automated log management tools to ensure timely deletion and rotation.
* Regularly monitor log file sizes and adjust retention policies accordingly.
* Document the log management process and educate team members.
* Ensure everyone understands the importance of disk space management.