Thursday, October 8, 2020

Troubleshooting proxmox pve-cluster down

Out of the blue, one of my proxmox node becomes offline in the web ui. All the VMs in that node is still working though, and the node is reachable via ssh, so network is not an issue. After searching around, it is because pve-cluster is down on that node, causing it to be invisible to the cluster altogether.

Looking around for clue, I found out that /etc/pve is empty, and "ls /etc/pve" returned "transport endpoint is not connected". journalctl command returned something more subtle, as below:

"host1 pmxcfs[29698]: [main] crit: fuse_mount error: Transport endpoint is not connected”

After searching around some more, I found out that the mountpoint is gone, and I need to remount /etc/pve back to get pve-cluster running.

Normal and forced unmount does not work. Thanks to this post, lazy unmount worked. So in order to restart pve-cluster, I have to lazy unmount /etc/pve first, the restart the pve-cluster.

# umount -l /etc/pve

# systemctl restart pve-cluster

# systemctl status pve-cluster

...

Active: active 

...


And my node is now back online in the proxmox webui.




No comments: