Nutanix & Resiliency: saving the day

Sometimes, I am really happy with the resiliency builtin the Nutanix NCI solution. Storage usage only increasaes, and sometimes customers are really reaching the resilient capacity. This should be no problem, but when performing maintenance using the LCM, it could happen that a host is not returning in the cluster due to a failed firmware update.

A host not returning in the cluster combined with a storage utilization gives nasty results as shown in the picture below. As you may know the vertical line is the resilient capacity.

If the host affected is away for a long time (multiple hours), this space is not reclaimed immediately. Nutanix made sure the data was saved twice on the cluster. When the node returns in the cluster, the data on this node is saved three times on the cluster. Curator will detect this and as a background process this space is available again.

The left picture below shows the status on the moment that the node came online. The screenshot at the right shows the status from 1 hour after that. You can see that the available capacity increased by almost 1TB. It can take a couple of hours to regain the non-critical resiliency status.

A few hours later, the data resiliency showed OK. Finally we can tolerate node failure again.

So, the cluster was critical for a couple of hours. However, all user vm were unaffected by this. All virtual machines were high available and there was no performance impact on this. This kind of resiliency is a real value of the Nutanix Cloud Infrastructure.

Donders.IT

Nutanix & Resiliency: saving the day

Comments

Leave a Reply Cancel reply