Johnny Zhang
There are many useful log files on a ESX server to help you find what cause a problem. However, there may be to many of them how do you know which one to look at? I will base on my own experiences to explain what I think is important. I will start with HA logs. HA is configured on vCenter. However, once configured it will function without vCenter. It will build it's own HA domain (This domain always use the name 'vmware'). If you have more than 5 servers in a HA cluster, there will be 5 primaries and the rest will be secondary. The log file location after ESX 3.5 U2 is at /var/log/vmware/aam directory
The most impportant log would be "vmware_server_name.log". You can find many HA related info or errors from here.
Even with 5 primaries, there will be only one cluster manager which is the one holding all the rules. You can find which one is the cluster manager by

less vmware_server_name.log | grep -i "VMWareClusterManager submitted to run on node "| uniq
You can see how HA changed the cluster manager from host to host, and the last record is the current manager.
You can also find out if there was a isolation event happened by
less vmware_bs-bcs-h132.log | grep -i ISOLATED
When HA detect heartbeat failure on one host, it will try to find out if this is a agent failure or host failure. The isolation event will only trigger when there is a host failure. HA does this by ping the host after the heartbeat is gone
less vmware_bs-bcs-h132.log | grep -i "Ping Node results:"
When the host replys the ping it will mark as ALIVE. When the host is not responding it will mark as DEAD and HA event will trigger. (More to come in the future)
1 Response
  1. Anonymous Says:

    thanks you verymuch, it was helpful to me.

    SureshK


Post a Comment