Johnny Zhang
I was working with one of the BCS customer on a case. The customer was originally planing to extend a Windows 2003 VM's OS disk to a larger size. However, after extend the VMDK files with VI client, and extend the NTFS inside Windows (Since this was a Windows 2003 guest, the customer needs to use a helper VM to do the job) after failed to power on the VM, the customer found a shocking fact: there was a snapshot. Then the customer tried to shrink the VMDK file back to the original size... and number of other things was also tried. So when we start to work on this issue, the VMDK file was broken in many different ways alone with the snapshot. To make the thing worse the last and only backup was created right after the extension of the VMDK file.

The end story was, we did recover all the data on the VMDK, the VMDK was not good enough to power on as a boot disk, but that was good enough to save the day. Here was how we did it:

First, list the size of the flat file by using "ls -l" in the VM directory
Keep in mind the number we need is the size of the flat file. In my test set up it's 8589934592. We need this number later. Now we need open the VMDK file (The pointer). In my case snap_test.vmdk:
Note, we need the CID and the RW number.
The RW number is the flat disk size divide by 512.
in my case 8589934592 / 512 = 16777216. If this number is wrong, you will not able to power on the VM. (So first we fix this part).

Now let's open the snapshot VMDK file. snap-test-000001.vmdk

In this file the "parentCID" should be the same as the snap-test.vmdk's CID. In my case 4cc5f033. If this is wrong, then the VM will not power on. (We fixed this one as well). The RW size would be also wrong, since it will only record the original size. Change this to the same size as the parent RW (16777216).
Now we need a helper VM to attach this disk as a second VMDK file (Not boot disk). When the OS boots up it will automatically run check disk against the VMDK and auto fix the errors. Now you should able to mount the disk inside the helper VM and copy over the data.
Johnny Zhang
In vCenter 4.0 we now have some easy to used monitor tools for ESX server performance. One of the chart is the memory usage
This usages is NOT the active memory that is currently used by the ESX server or the VM. So when you see very high usage, don't be panic.
In fact, this chart shows the total "Consumed Host Memory" by all the VMs running on the host.
As we can see from the graph the "Consumed Host memory" is 2277.00MB and the "Active Guest Memory" is only 122.00MB. Why a big different?

This is by design, the "Consumed Host Memory" shows the highest memory used by that VM.
OS such like Windows will touch all the memory assigned to it during start up, so most likely you will see the consumed memory is close or little more then the memory assigned to the VM (Overhead to power on the VM is also added here). ESX server will NOT take back the memory if there is no resource contention on the server. The reason is, once the VM need more memory it can access them faster than send the request to vmkernel again.
If you want to know the real time memory usage you can look into the advanced chart to find it out.
Johnny Zhang
Note: This is tip is based on communications between engineering and myself. In most cases there is no need to make any change. This is just for you to know. (This also only based on VI3, not sure if there are changes on vSphere)

Sometimes we see ESX hosts disconnected from vCenter. from the vpxa log files:

['App' 7644 error] [VpxdVmomi] Got vmacore exception: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

This error is common in an environment which hosts are under load.

There could be multiple causes for this kind of connection request failure:

1. Network glitch
2. proxy server on hostd is not able to respond the request due to stress.
3. hostd process is busy or crashed at the time the request arrives.

Now, vCenter has a host connection pool, it will store idle connections between ESX server and vCenter. So when there is a need to connect the server, vCenter will not need to create the connections all over again (Which in some cases we will hit the above issues). The default connection size is set at 20. you can increase this (for example to 50)

You need to add to vpxd.cfg files:


Personally, I believe this will help for a large set-up by reducing the connection requests. please note, the active connection is not put into this pool, so in most cases 20 is good enough.
Johnny Zhang
vShpere change the way to store the console OS. It's now running on a VMDK file. It's a real virtual machine. There are times you may want to know where is your root VMDK file for your console OS. Here is two fast ways to locate it.

1. From /proc/vmware: "rootFsVMDKPath" under /proc/vmware will give you the location of your root VMDK

2. Use of "vsd" command: You can use "vsd -g" to list your root VMDK file.
Johnny Zhang
Please keep in mind, clean up event is not recommended, do it ONLY when you are running out of space because both vpx_event_arg and vpx_event tables are filled with many GB of data.
Always shut down vCenter service before you do anything with the database.
You can check the space usage by each table:

********************************************
use "name_of_the_vcdb"
CREATE TABLE #TemptableforspaceUsed
(name SYSNAME,
rows bigINT,
reserved VARCHAR(20),
data VARCHAR(20),
index_size VARCHAR(20),
unused VARCHAR(20))
GO
INSERT #TemptableforspaceUsed
EXEC sp_MSforeachtable 'sp_spaceused "?"'

select *
from #TemptableforspaceUsed

drop table #TemptableforspaceUsed

******************************************

You can now clean the event tables.

use "name_of_the_vcdb"
truncate table vpx_event_arg;
delete from vpx_event;

Because there are foreign keys inside vpx_event, you can not truncate the data inside. You must remove those data before start the vCenter service, the service might not start if there are data that is missing link. Please also keep in mind, "delete" will always save a copy of the change, so you will see increase in transaction logs, sometimes it can be huge, depends on the amount of data you have.
Johnny Zhang
Many of us believe "vmware -v" will show you the current version of your ESX server (With all the patches applied). This is actually not true. This command only shows the version of a component.
In ESX 3.x "vmware -v" will show you the version number of "VMware-esx-vmx"
And in ESX 4.x "vmware -v" will show you the version of "vmware-esx-vmware-release"

In addition, vCenter will show the version of your "vmware-hostd"
Please keep in mind those numbers can all be different.
Johnny Zhang
You always want to know more about your servers, your network settings your storage options. "esxcfg-info" is the best place to dig things up. You want to know more about your storage? Type:
"esxcfg-info -s | less -RS"
You can check your VMFS alignment here, when the Starting sector set as 128 you know your VMFS is aligned correctly (VMFS aligned on 64KB boundary)
You can map the LUN back to your service console device. It will also show you the type of the file system on the LUN. In this case fb is VMFS
Johnny Zhang
There are times we want to find out what drivers are configured to load up when VMkernel boots up.
You can find this out by typing:
"esxcfg-modele -q"
You can also check what drivers are currently loaded:
" esxcfg-modele -l"
Johnny Zhang
Every time when you create a VMFS datastore, a copy of VMFS metadata will write to your LUN with following information:
  • Block size
  • Number of extents
  • Volume capacity
  • VMFS version
  • Label
  • VMFS UUID
You can use "vmkfstools -P -h /vmfs/volumes/LUN_Lable" to query the file system information

Note: information based on Infrastructure 3 DSA Manual
Johnny Zhang
It will take only few clicks to get your vCenter's table and relations in a Visio file. (I'm using visio 2007 I think the older one will work the same way as well) (Tested on visio 2007 SQL 2005 with vCenter 4.0)

Open your Visio 2007 --> File-->New --> Software and Database --> Database Model Diagram (Metric). This will open a new file for you.
Now click on Database --> reverse Engineer

Pick "Microsoft SQL Server" from Installed Visio drivers
Now click on "New --> System Data Source (Applies to this machine only) --> Next --> Pick "SQL Server" for the data source driver --> Next --> Finish

Now you will see the New data Source box
Give it a name, and the name/ip of your vCenter database (The normal way you setup your ODBC to your vCenter database). Make sure your change your database to your vCenter database.
Now you should see your ODBC connection that you just created --> click Next. It will ask for the connection password. Type in and click OK.
You might not want views or stored Procedures. so you can uncheck them.
Click Next and select all tables
Pick "Yes, add the shapes to the current page." few more clicks you should have your vCenter tables in your Visio file.
Johnny Zhang
You can do a quick check on your ESX server right after it powered on. This log is a little different from what we know about (/var/log/messages). It also called messages log and it located at /var/log/initrdlogs/messages

tail -n 30 messages (check the last 30 line of the messages log)
We can see all VMkernel modules were loaded, and the system was "(forcing normal run)" Then we enter the service console booting process.
Johnny Zhang
You can find out the members of the HA cluster from vmware-sites file. This file is located at /etc/opt/vmware/aam/ directory
[root@bs-bcs-h132 aam]# cat vmware-sites
FULLTIME_SITES_TID 00000030
+ 1:8042,8042,8043 bs-bcs-h132 vmware #FT_Agent_Port=8045
+ 2:8042,8042,8043 bs-bcs-h133 vmware

This file will list all members of the HA cluster and the port used. One thing it's helpful here is the P "+" sign. This means the hosts is currently in the HA cluster. If you see a "-" sign then the host is not connected to the HA cluster (either powered of or agent is not running)

NicInfo is a file in the same directory, this file will list all available NICs that HA cluster can see and use
[root@bs-bcs-h132 aam]# cat NicInfo
interface { ipaddress:10.21.49.132 subnetmask:255.255.252.0 }
Johnny Zhang
Normally when you hit a VMotion issue, or a 64 bit VM can not be power on from your 64 bit ESX server. You might asked to reboot the host and check the BIOS if HV (Hardware Virtualization) is enabled. You can check this without reboot your server
esxcfg-info | grep -i "hv support'

It will return a number between 0 to 3
  • 0 is Not present

  • 1 is Not supported

  • 2 is disabled

  • 3 is enabled

So in my case my server supports HV but is disabled under the BIOS.
Johnny Zhang
Sometimes we will get a hardware device just won't work on ESX server. There are many different reasons for that. We'd like to first find out if the device is supported and the right device driver is loaded. You can check the HCL online, but what about you are in front of the ESX server and it does not have Internet connection?
You can check the device against vmware-devices.map file. this file is located at /etc/vmware directory on your ESX server.
Let's use a NIC on my server as an example
esxcfg-nics -l
We can see the device is Broadcom Corporation NetXtreme II 5706 Gigabit Ethernet and the driver is bnx2.
Now we will chaeck this against vmware-devices.map file
grep -i "bnx2" vmware-devices.map
We can see the line "device,0x14e4,0x164a,nic,NetXtreme II 5706 Gigabit Ethernet,bnx2.o" We know the device is supported, and the loaded driver is also correct. Now we need look elsewhere for the problem.
Johnny Zhang
I think we all experienced searching around and trying to understand why can't we power on a VM within the HA cluster, and how many those so called slots each of my host has? The HA calculation is a chapter for it's own. The good news is in vCenter 4.0 you now don't need to calculate everything by yourself. The "Advanced Runtime Info" for HA will give you that information.
Just click on the cluster --> Summary --> on the HA section click on "Advanced Runtime Info"

Johnny Zhang
vCenter has webaccess allow administrators to manage vCenter through the browser, it also accept API calls to vCenter. The web service behind it is Tomcat. You don't need to worry about the Tomcat setting since VMware ensure it will never break! :). Ok if you really want to take a look under the hood. You can!
you will need to find the file:
"C:\Program Files\VMware\Infrastructure\VirtualCenter Server\tomcat\conf\tomcat-users.xml"

and add:

Now you need to restart "Vmware Infrastracture web access" service
By default the tomcat uses port 8086
you can now open a browser point it to
http://server_name:8086/
click on "
Tomcat Manager"
Type in your user name and password. In my case is "admin"/"test"
Now you just logged into Tomcat Manager

Note: This will only work on VI3 since vSphere uses different version of tomcat.


Johnny Zhang
vMA (vSphere Management Assistant) is a very powerful tool that allow you centralize your ESX/ESXi command access. you can download it from here. Or you can import it from your vCenter (please note vMA will only run on a 64 bit host)
Deploying from URL
1.In the vSphere Client, choose Virtual Appliance > Deploy. 2.When prompted by the Wizard, click Deploy from URL and enter the following URL: http://www.vmware.com/go/importvma/vma4.ovf
You can deploy this as any other VMs. Once it deployed, you now can login as "vi-admin"
The first step is to add hosts that you want to manage with vMA
sudo vifp addserver server_name

In my case I added 2 ESX 4.0 and 2 ESX 3.5 hosts to vMA. We can list those servers by:
sudo vifp listservers

Now we can pick the server we want to manage:
vifpinit server_name
and now we can pass the command to that host.
We will talk more about vMA later
Johnny Zhang
Each ESX server will keep a list of VMotion history since it first boot up. You can find that by
cat /proc/vmware/migration/history
It will show you if the VM was migrated to this host or from this host. In my case VM1244 is migrated from this host to host 10.21.51.133. This IP is the VMotion IP address. The VMotion id is a very important part. Both hosts involved in the VMotion will have the same VMotion id. So you can track the VM from one host to another by using this id.
For example, my VMotion id is 1253064812716632. So I can use this against vmkernel logs on both hosts
On the source:
grep 1253064812716632 vmkernel
From the log we can see:
src ip = <10.21.51.132> dest ip = <10.21.51.133> Dest wid = 1111. The source will always show both VMotion IP address for source and destination hosts, it will also show the new world ID for the VM on the new host, in this case 1111.
On destination:
grep 1253064812716632 vmkernel
On destination we see:
src ip = <10.21.51.132> dest ip = <0.0.0.0> Dest wid = -1 using SHARED swap
Note, the destination IP will always show 0.0.0.0 and world id -1. Another way for us to tell this is the destination host.
We will talk about more on how to track VMotion from other files.
Johnny Zhang
Not like everyone will find this is useful, but it's just cool to know that we can do this!
Yes, you can use VNC client (Please take a note, seems like not all vnc client will connect but I tested tight vnc and it works. http://www.tightvnc.com/download.html) to connect to your VM without install a VNC server on it. all you need to do is to add some lines in your .vmx file

RemoteDisplay.vnc.enabled = "True"
Remo
teDisplay.vnc.port = "7001"
RemoteDisplay.vnc.password = "test"

You can use any port and password.

When ready, you need to connect to your ESX servers useing "
esx_server_name_or_ip:port"
the port is the port you put into your .vmx file in my case is 7001
Click on "
Connect" it will ask for the password. This my case is "test"

Click on "
Ok"
You are now connected to the VM
Johnny Zhang
In vSphere the new vDS would make your life much easier. However, if anything goes wrong during configuration you lost all your network (That include your management network as well). If you still have iLO or KVM you can follow the steps to get the access back

Step 1: Logon to ESX host.

Step 2: Create a new temporary vSS (tmpSwitch) and Port Group (vswifPg)
esxcfg-vswitch -a tmpSwitch
esxcfg-vswitch -A vswifPg tmpSwitch

Step 3: Move uplink from vDS to vSS
esxcfg-vswitch -l (to get DVSwitch, DVPort, and vmnic names)
esxcfg-vswitch -Q vmnic0 -V (unlink vmnic0 from vDS)
esxcfg-vswitch -L vmnic0 tmpSwitch (link to vswitch)

Step 4: Move vswif from vDS to vSS
esxcfg-vswif -l (get vswif IP address, netmask, dvPort id, etc.)
esxcfg-vswif -d vswif0
esxcfg-vswif -a vswif0 -i -n -p vswifPg

Check or edit the default gateway address by editing /etc/sysconfig/network or adding default gateway address with:
route add default gw
Johnny Zhang
In ESX 3.5, the CPU scheduler logically partitions a host's physical CPU into cells, by default, there are 4 cores per cell for scalability reason. The scheduler can make decision locally within a cell without affecting other cells. However, with the introduction of 6 cores CPU, this may lead into some cell span sockets. You might experience performance issue when VMs are using those cells. If you are using those CPUs. You can change it from both VIclient and command line.

Using the VI Client:

  1. Select the Configuration tab in the VI client.
  2. Select Advanced Settings.
  3. Select VMkernel.
  4. In the right pane, locate VMkernel.Boot.cpuCellSize .
  5. Change the value to 6 .
This will take effect the next time the ESX host is rebooted.

From the command line interface (on classic ESX):
1.Enter:
esxcfg-advcfg --set-kernel 6 cpuCellSize
2. Reboot the ESX host.

From the remote command line interface (on ESXi):
1.Enter:

vicfg-advcfg --set-kernel 6 cpuCellSize
2. Reboot the ESX host.

Note: ESX 4.0 uses per-pCPU locks instead of cell scheduler. You will not see the span socket performance issue on ESX 4.0
Johnny Zhang
By default, users can try to log into a Linux or in this case ESX server as many time as they want. Someone can sit there all day try to crack the password or just write up a script let it do the trick. You can change the behavior by add the following lines to /etc/pam.d/system-auth:

auth required /lib/security/pam_tally.so no_magic_root
account required /lib/security/pam_tally.so deny=3
no_magic_root

This will lock out the user after 3 attempts
(Keep in mind you might want to give more than 3 attempts before lock users out, just in case you forgot your password)

You can also setup the log to monitor it after this

To create the file for logging failed login attempts, execute the following commands:
touch /var/log/faillog
chown root:root /var/log/faillog
chmod 600 /var/log/faillog

Note: This will only work with VI3 since PAM on Redhat 5 (where ESX 4.x service console based on) does not work with those options

Johnny Zhang
There are times you just want to find where are all those logs are (Don't we all hope all the logs in a centralized location for us?)

Here are the list of default log locations for vCenter and some of the plug-ins

vCenter
C:\Documents and Settings\All Users\Application Data\VMware\VMware
VirtualCenter\Logs\*

Web server component of VirtualCenter
C:\Program Files\VMware\Infrastructure\VirtualCenter Server\tomcat\logs\*

License manager (For vCenter which manage the ESX 3.x servers)
C:\WINDOWS\Temp\lmgrd.log

VMware Update Manager
C:\Documents and Settings\All Users\Application Data\VMware\VMware
Update Manager\Logs\*

VMware Update Manager UMDS
C:\Windows\Temp\
vmware-downloadservice-log4cpp

VMware Enterprise Converter
C:\Documents and Settings\All Users\Application Data\VMware\VMware
Converter Enterprise\Logs\*

VMware Guided Consolidation
C:\Documents and Settings\All Users\Application Data\VMware\VMware
Capacity Planner\Logs\*

Vi client
C:\Documents and Settings\\Local Settings\Application Data\VMware
Johnny Zhang
If you are on a ESX host service console (ESX 3.x or ESX 4.0), you want to run a quick summary on the data stores this host can see, but either don't have access to VI client or just simplely lazy to launch it, you can get that by typing
vmware-vim-cmd hostsvc/datastore/listsummary

You can got more about a specific datastore by
vmware-vim-cmd hostsvc/datastore/info datastore_name
Johnny Zhang
Some guest OS processes can send information to ESX or ESXi hosts via vmware tools, it's called setinfo messages, those messages will write to VM's configuration file known as .vmx file. If you want to restrict this, you can add "isolation.tools.setinfo.disable = "True"" on your .vmx file.
Please keep in mind after you set this, you will not see the VM's ip adress, DNS name etc from either ESX server or vCenter since those information will not pass to the host through vmware tools.
Johnny Zhang
Anyone work with ESX server would know sometimes you need to restart hostd service on the ESX server to "refresh" information on ESX. Normally we do this by typing "service mgmt-vmware restart"
However, sometimes this is not good enough. Every time you restart hostd it will map to 4 files under
/var/lib/vmware/hostd/stats.
If any of those file got corrupted by any reason, restart hostd will not help. Every time when hostd restart called, the service will check if those files exist, if not it will create them before starting the service. So we can remove those files to make sure hostd service starts from scrach
"rm -rf /var/lib/vmware/hostd/stats/*" Make sure you are in the right direct tory when you run the rm command especially with the -rf switch
Johnny Zhang
There are many useful log files on a ESX server to help you find what cause a problem. However, there may be to many of them how do you know which one to look at? I will base on my own experiences to explain what I think is important. I will start with HA logs. HA is configured on vCenter. However, once configured it will function without vCenter. It will build it's own HA domain (This domain always use the name 'vmware'). If you have more than 5 servers in a HA cluster, there will be 5 primaries and the rest will be secondary. The log file location after ESX 3.5 U2 is at /var/log/vmware/aam directory
The most impportant log would be "vmware_server_name.log". You can find many HA related info or errors from here.
Even with 5 primaries, there will be only one cluster manager which is the one holding all the rules. You can find which one is the cluster manager by

less vmware_server_name.log | grep -i "VMWareClusterManager submitted to run on node "| uniq
You can see how HA changed the cluster manager from host to host, and the last record is the current manager.
You can also find out if there was a isolation event happened by
less vmware_bs-bcs-h132.log | grep -i ISOLATED
When HA detect heartbeat failure on one host, it will try to find out if this is a agent failure or host failure. The isolation event will only trigger when there is a host failure. HA does this by ping the host after the heartbeat is gone
less vmware_bs-bcs-h132.log | grep -i "Ping Node results:"
When the host replys the ping it will mark as ALIVE. When the host is not responding it will mark as DEAD and HA event will trigger. (More to come in the future)
Johnny Zhang
After ESX 3.5 U2, all version of ESX supports remote Cli. The vifs tools in remote Cli is useful when you need to get log files from a remote host.

C:\Program Files\VMware\VMware VI Remote CLI\bin>vifs.pl -D /host --server bs-bcs-h132.bsl.vmware.com --username root

Now once you know what is there, you might want to get some of those files. let's say i want to get hostd.log from this server, and put the file under my c:/ directory, I can use:

C:\Program Files\VMware\VMware VI Remote CLI\bin>vifs.pl -g /host/hostd.log c:/ --server
bs-bcs-h132.bsl.vmware.com --username root
There was an error in this gadget