An earlier post showed the recovery of a critical alert about the Nutanix cluster storage. What are reasons to get into a situation that your Nutanix cluster is not able to recover from node failure.
Of course, there can be several reasons for this behaviour.
- Node failure (or disk failure) can result in a lack of available storage to recover. When this is the reason, you should really think about expanding your cluster.
- Too many workloads are deployed on the cluster. the cluster is not capable of handling the increased demand. Expanding your cluster may be a good option to recover. However, it can take a few weeks for the expansion to arrive at your site.
Lately, one of my customers had to deal with another cause. Also external components in your IT infrastructure can have an impact on your Nutanix infrastructure. The customer was struggling with the storage for the backup jobs. Due to a lack of available storage for data protection, jobs could not be completed. The failed backup jobs also results in the situation that the backup software was not able to release the recovery points created for earlier backup jobs because the retention rules did not allow this.
The result was like in the following picture. Almost 50 percent of the used capacity was occupied with snapshots.
Solving this is probably not what you want to do very often. You have to interfer with backup schedules, retention policies and I deleted some jobs I normally wanted to keep. Freeing space on the backup storage was not that easy. I tried to respect the running jobs. Result however was freed space to be filled with new backup data immediately. At the end I was able to safely delete some snapshots from the data protection solution.
Besides that I did some cleaning from protection policies (deleting retired VMs from synchronization).
When chacking the garbage report with ‘curator_cli display_garbage_report’ I noticed some effort from my actions. Almost 9 TB should be freed by the Nutanix cluster.
It is around 11PM right now. I will check back tomorrow morning for the results.
After a good night sleep
Sometimes it is wise to let the system run the tasks for curator. About 9 hours later I checked the system again and it was quite a relieve.
Storage resiliency was established and more to come. There is still garbage to cleaned by the system.
Leave a Reply