Performing Basic Troubleshooting for ESXi Hosts
Your ESXi hosts are the most important physical resources in your virtual datacenter. They provide the platform upon which all the VMs are supported and from which they obtain their resources. When there is a problem with an ESXi host, that problem will likely affect many VMs.
In this section, I will begin by identifying general troubleshooting guidelines for ESXi hosts. Then I will discuss troubleshooting common installation issues and how you should avoid them. I will continue by discussing the ongoing monitoring of the health of your ESXi host. Finally, I will discuss how you can export diagnostic information to examine for yourself and especially to send to the VMware Technical Support Team.
Identifying General ESXi Host Troubleshooting Guidelines
Your vSphere is unique, just as everyone’s vSphere is unique, but there are some guidelines that you can follow to effectively troubleshoot your ESXi hosts. You can use these general guidelines to determine more specific steps for your own organization. The following sections document some basic troubleshooting guidelines for ESXi.
Learn How to Access Support Mode
Tech Support Mode (TSM) consists of a command-line interface that you can use to troubleshoot abnormalities on ESXi Hosts. You can access it by logging in to the Direct Console User Interface (DCUI) or by logging in remotely using Secure Shell (SSH). It is provided by VMware specifically for the purpose of troubleshooting issues that cannot be resolved through the use of more normal means, such as the vSphere Client, vCLI, or PowerCLI. It is generally used with the assistance of the VMware Technical Support Team.
To enable TSM from the DCUI, follow the steps in Activity 6-1.
Activity 6-1 Enabling TSM from the DCUI
- Access the DCUI of your ESXi host.
Press F2 and enter your username and password, and then press F2 again to proceed, as shown in Figure 6-1.
Figure 6-1 Logging On to the DCUI
Scroll to Troubleshooting Options, as shown in Figure 6-2, and press Enter.
Figure 6-2 Selecting Troubleshooting Options
Select Enable ESXi Shell and press Enter. The panel on the right should now show that ESXi Shell Is Enabled, as shown in Figure 6-3.
Figure 6-3 Enabling ESXi Shell
- Select Enable SSH and press Enter to also enable remote TSM through SSH, and then press Enter and view the panel on the right to confirm the change.
Optionally, you can configure a timeout to enhance security if the logged-in user should walk away. To enable a timeout, select Modify ESXi Shell Timeout, press Enter, and configure your desired timeout value, as shown in Figure 6-4.
Figure 6-4 Modifying ESXi Shell Timeout
- Press Esc three times to return to the main DCUI screen.
You can also enable TSM from the security profile of your vSphere Client. To illustrate how these are tied together, I am going to demonstrate that TSM is now enabled, and then you will disable it from the vSphere Web Client. To access the settings of the security profile of your ESXi host, follow the steps outlined in Activity 6-2.
Activity 6-2 Configuring TSM from the vSphere Client
- Log on to your vSphere Web Client and select Hosts and Clusters.
Select the host on which you want to configure TSM and (if necessary) open the Summary tab. Note the warnings that SSH and the ESXi Shell are enabled, as shown in Figure 6-5.
Figure 6-5 Confirming That SSH and ESXi Shell Are Enabled
Click the Manage tab, then the Settings tab, and select Security Profile. Scroll down to Services and note that the services of SSH and ESXi Shell are listed, which indicates that they can be controlled from here. Select Edit and then ESXi Shell; then click Stop, as shown in Figure 6-6. (You should also change the startup policy to Start and Stop Manually.)
Figure 6-6 Configuring the ESXi Shell and SSH Services
- Select SSH, click Stop, and then click OK.
- Click the Summary tab for the host and note that the warnings are no longer there.
Know How to Retrieve Logs
One thing that computers and networking components are good at is keeping up with what has happened to them, who or what made it happen, and when it happened. This information is stored in logs. Although there is generally no need for you to understand all the verbose information that is in every log, it is important that you know where to find logs and how to export them when needed. In this section, I will explore three different locations where you can access logs for your most essential vSphere components.
There are two locations on your ESXi hosts from which you can access logs: your DCUI and your vSphere Web Client. As I said before, it’s not essential that you understand all the information in the log, but what’s important is your ability to access it when working with a VMware Support person. I will briefly describe how to access logs in each of these locations.
To access the logs from your DCUI, you should access your host’s DCUI and then select View System Logs. From this screen, you can select from six different logs, as shown in Figure 6-7.
- Syslog: Logs messages from the VMkernel and other system components to local files or to the remote host
- VMkernel: Used to determine uptime and availability statistics
- Config: Potentially useful in the case of a host hang, crash, or authentication issue
- Management Agent (hostd): Logs specific to the host services that connect your vSphere Client to your ESXi host
- Virtualcenter Agent (vpxa): Additional logs that appear when your ESXi host is connected to and managed by a vCenter
- VMware ESXi Observation Log (vobd): Logs changes to the configuration of your host and their result
Figure 6-7 Viewing Logs on the DCUI
You can view each of these logs by simply pressing the number associated with it. For example, you can view the vmkernel log by pressing 2. Figure 6-8 is an example of a VMkernel log. When you are finished viewing the log, press Q to return to the previous screen.
Figure 6-8 Viewing the VMkernel Log
To access your host’s logs using your vSphere Web Client, log on to your host (not your vCenter). You can log on to your host using its hostname or IP address. After you log on to your vSphere Web Client, click your host, then click Monitor, and finally click Log Browser, where you can view hostd, VMkernel, and shell logs, as well as others as shown in Figure 6-9.
Figure 6-9 Viewing Logs on a Single Host
Troubleshooting Common Installation Issues
For your hosts to function well in your vCenter, you must first install them properly. As discussed in Chapter 1, “Planning, Installing, Configuring, and Upgrading vCenter Server and VMware ESXi,” there are many different ways to install the software for an ESXi host, including interactive installation, USB key, scripted, or even loaded directly into the memory of the host. That makes this objective a very broad one indeed. With that in mind, I will list three of the most common installation issues and how you should address them.
Troubleshooting Boot Order
If you are installing ESXi, you might need to reconfigure BIOS settings. The boot configuration in BIOS is likely to be set to CD-ROM and then ordered by the list of drives available in your computer. You can change this setting by reconfiguring the boot order in BIOS or by selecting a boot device for the selection menu. If you change this in the BIOS, it will affect all subsequent boots. If you change it in the boot selection menu, it will affect only the current boot.
Troubleshooting License Assignment
Suppose you have a vSphere key that allows for 16 processors. Now, suppose that you attempt to install that key on a host that has 32 processors. You might assume that the key would install but only enable the host to use the processors covered by the key. In fact, you will not be able to install the key on that host. In addition, you will not be able to install license keys that do not cover all the features that you have enabled for a host (for example, DRS, Host Profile, fault tolerance, and so on). To address the issue, you should do one of the following:
- Obtain and assign the appropriate key with a larger capacity.
- Upgrade your license edition to cover the features that you are using on your host.
- Disable the features that are not covered by the key that you are attempting to assign.
As you might know, plug-ins are used in the vCenter, so it might seem unusual to discuss them under this heading. However, if you think about it, the services to the VMs are actually provided by the hosts and are only controlled by the vCenter. In addition, plug-ins that fail to enable can be frustrating, so troubleshooting them warrants discussion here.
In cases where plug-ins are not working, you have several troubleshooting options. You should first understand that plug-ins that run on the Tomcat server have extension.xml files that contain the URL of the application that can be accessed by the plug-in. These files are located in C: \Program Files\VMware\Infrastructure\VirtualCenter Server\extensions. If your vCenter Server and your vSphere Web Client are not on the same domain, or if the hostname of the plug-in server is changed, the clients will not be able to access the URL, and then the plug-in will not enable. You can address this issue by replacing the hostname in the extension file with the IP address of the plug-in server.
Monitoring ESXi System Health
You can use your vSphere Client to monitor the state of your host hardware components. The host health monitoring tool allows you to monitor the health of many hardware components including CPU, memory, fans, temperature, voltage, power, network, battery, storage, cable/interconnect, software, watchdog, and so on. Actually, the specific information that you will obtain will vary somewhat by the sensors available in your server hardware.
The host health monitoring tool will gather and present data using Systems Management Architecture for Server Hardware (SMASH) profiles. SMASH (isn’t that a fun acronym!) is an industry standard specification. You can obtain more information about SMASH at http://www.dmtf.org/standards/smash. You can monitor the host health status by connecting your vSphere Client directly to your host and selecting Configuration and then Health Status, as shown in Figure 6-10. As you might imagine, you are looking for a green check mark here. The status will turn yellow or red if the component violates a performance threshold or is not performing properly. Generally speaking, a yellow indicator signifies degraded performance, and a red indicator signifies that the component has either stopped or has tripped the highest (worst) threshold possible.
Figure 6-10 Viewing Health Status on a Specific Host
You can also monitor your host’s health by logging on to your vCenter with your vSphere Web Client, selecting the host, and then clicking the Monitor tab and finally the Hardware Status tab, as shown in Figure 6-11.
Figure 6-11 Viewing Hardware Status on a Host Through vCenter
Exporting Diagnostic Information
If you have an issue that warrants contacting VMware technical support, the technicians might ask you to send them a log or two. If they want to see multiple logs, the easy way to send them “everything you’ve got” is to generate a diagnostic bundle. That sounds like more work for you, doesn’t it? Actually, it’s a very simple task that you can perform on your vCenter through your vSphere Web Client. I will discuss this briefly here and then I will discuss it in more detail in Chapter 7, “Monitoring a vSphere Implementation.”
To export a diagnostic data bundle, you use either a host log-in, as detailed in Activity 6-3, or use a vCenter log-in, as detailed in Activity 6-4.
Activity 6-3 Exporting Diagnostic Information from a Host Log-In
- Log on to your host with your vSphere Client.
Click your ESXi host in the console pane. Then select File, then Export, and finally Export System Logs, as shown in Figure 6-12.
Figure 6-12 Exporting System Logs from a Single Host
Specify the system logs that you want to be exported, likely as directed by the VMware Support Team, as shown in Figure 6-13, and click Next.
Figure 6-13 Selecting Logs to Export
Enter or select Browse to find the location to which you want to download the file, as shown in Figure 6-14.
Figure 6-14 Selecting the Location for Exported Logs
You can view the progress of your System Log Bundle as it is downloaded to the destination, as shown in Figure 6-15.
Figure 6-15 Viewing the Progress of a System Log Bundle on a Single Host
Activity 6-4 Exporting Diagnostic Information from a vCenter Log-In
- Log on to your vCenter with your vSphere Web Client.
Click your root object. Then select Monitor, then System Logs, and finally Export System Logs, as shown in Figure 6-16.
Figure 6-16 Exporting System Logs from vCenter
Specify the hosts that you want to include in the log bundle and whether you want to include the vCenter and Web Client logs as well, as shown in Figure 6-17, and click Next. These decisions will likely be directed by the VMware Support Team.
Figure 6-17 Specifying Hosts for Log Creation
Choose whether you want to gather performance data, as directed by the VMware Support Team, and select Generate Log Bundle, as shown in Figure 6-18.
Figure 6-18 Generating the Log Bundle
Select Download Log Bundle and choose the download destination for your logs, as shown in Figure 6-19.
Figure 6-19 Selecting the Destination Location for Exported Logs
You can view your logs at the download destination, as shown in Figure 6-20.
Figure 6-20 Viewing the Download Destination
Performing Basic vSphere Network Troubleshooting
Your vSphere network should connect your VMs to each other and also allow your VMs to connect to physical resources outside your vSphere. In addition, your network should provide a management port (or multiple management ports) that allows you to control your hosts and VMs. Finally, your network might very well be involved with your storage, if you are using IP storage options such as Internet Small Computer System Interface (iSCSI), storage-area networking (SAN), or Network File System (NFS) datastores.
Because your vSphere network is such an integral part of your virtual datacenter, you should understand the network components and their correct configuration so that you can troubleshoot them when necessary. In this section, I will discuss verifying and troubleshooting network configuration including your VMs, port groups, and physical network adapters. In addition, I will discuss identifying the root cause of a network issue based on troubleshooting information.
Verifying Network Configuration
At the very least, your network configuration should include a VMkernel port for management; otherwise, you won’t be able to control the host remotely. In fact, one is provided for you with the default installation of an ESXi host. If you are using vSSs, you will need at least one VMkernel management port on each host. If you are using a vDS, you will need at least one VMkernel management port on the vDS. Of course, it is possible to configure more than one management port, and that is certainly recommended on a vDS. Another option is to configure one VMkernel port but then configure it to use more than one physical NIC (vmnic). In addition, you might have additional VMkernel ports for a myriad of reasons, including an additional heartbeat network for high availability (HA), an additional port for IP storage (iSCSI or NFS), fault tolerance (FT) logging for vSphere fault tolerance, Virtual SAN, and for vMotion.
Other than the VMkernel ports, the rest of the ports on a switch will be used for uplinks to the physical world or, for VM port groups, most will likely be used for VM port groups. The correct use of VM port groups enables you to get more options out of a single switch (vSS or vDS) by assigning different attributes to different port groups. As you know, with vDSs, you can even assign different attributes at the individual port level. VM port groups give you options on which to connect a VM.
Verifying your network configuration consists of viewing your network with an understanding of how all of these virtual components are linked together. Only by understanding how it should be connected will you be able to troubleshoot any configuration issue. Figure 6-21 shows one of the views you can use through your vSphere Web Client to manage the networking of your host.
Figure 6-21 Managing the Networking of a vSS
Verifying a Given Virtual Machine Is Configured with the Correct Network Resources
As I mentioned earlier, port groups give you options on which to connect a VM. In my opinion, you can really see this more clearly from the VM’s standpoint. In Figure 6-22, I right-clicked a VM and then selected Edit Settings. As you can see, I have a list of port groups from which to choose for the virtual network interface card (vNIC) on this VM called Network adapter 1. These port groups are all VM port groups on this switch or on the vDS to which this host is connected. Also, note the Device Status check boxes at the top right of the screen. These should be selected on an active connection. When the VM is connected to the appropriate port group, it can be configured with the correct network resources. If it is not on the correct port group, many issues could result, including having the wrong security, traffic shaping, NIC teaming options, or even having a total lack of connectivity.
Figure 6-22 Viewing a VM’s Network Configuration
Troubleshooting Virtual Switch and Port Group Configuration Issues
Just connecting the VM to a port group does not guarantee that you get the desired configuration. What if the port group itself is not configured properly? You should understand that any configuration options on a vSS will be overridden by conflicting options on a port group of the same switch. In addition, any options on a port group of a vDS will be overridden by conflicting options on a specific port. I covered these options in Chapter 2, “Planning and Configuring vSphere Networking,” so I will not go into great detail about security, traffic shaping, NIC teaming, and so on, but Figure 6-23 shows the general area in which you can find them on a vDS. The main point here is to verify that you have set the properties appropriately for the VMs that are connected to the port group.
Figure 6-23 Port Group Settings on a vDS
Troubleshooting Physical Network Adapter Configuration Issues
It can’t all be virtual! At some point, you have to connect your vSphere to the physical world. The point at which the data moves out of the host and into the physical world can be referred to as a physical network adapter, a vmnic, or an uplink. Because the configuration of this point of reference is for a piece of physical equipment, the available settings are what you might expect for any other physical adapter, namely speed, duplex, wake on LAN, and so on, as shown in Figure 6-24.
Figure 6-24 Settings for a Physical Adapter
Identifying the Root Cause of a Network Issue Based on Troubleshooting Information
I’ve seen and written about many different models of troubleshooting that look great on paper, but might be overkill for the real world. Also, VMware doesn’t subscribe to a certain five-step or seven-step model of troubleshooting with regard to the exam. That said, you should be able to “think through” a troubleshooting question based on what you know about virtual networking.
In general, a VM’s network performance is dependent on two things: its application workload and your network configuration. Dropped network packets indicate a bottleneck in the network. Slow network performance could be a sign of load-balancing issues or the lack of load balancing altogether.
You’ll know if you have high latency and slow network performance; there is no hiding that! How will you know if you have dropped packets? You can use esxtop, resxtop, or the Advanced performance charts to examine dropped transmit (droppedTx) and dropped receive (droppedRx) packets. (These should be zero, or very close to it, if you don’t have a bottleneck on this resource.) I will discuss the use of resxtop in the next chapter, “Monitoring a vSphere Implementation.”
If these utilities indicate that there is an issue, you can verify or adjust each of the following to address the issue:
- Verify that each of the VMs has VMware Tools installed.
- Verify that vmxnet3 vNIC drivers are being used wherever possible.
- If possible, place VMs that communicate to each other frequently onto the same host on the same switch in the same subnet so they can communicate without using the external network at all.
- Verify that the speed and duplex settings on your physical NICs are what you expected.
- Use separate physical NICs to handle different types of traffic, such as VM, iSCSI, VMotion, and so on.
- If you are using 1 Gbps NICs, consider upgrading to 10 Gbps NICs or using Link Aggregation Groups (LAGs).
- Use vNIC drivers that are TSO-capable (as I discussed in Chapter 2).
Of course, this is not an exhaustive list, but it’s a good start toward better virtual network performance. You should apply each of these potential solutions “one at a time” and retest. In this way, you can determine the root cause of your network issue, even as you are fixing it.
Performing Basic vSphere Storage Troubleshooting
As you know, it’s possible for a VM to be given visibility to its actual physical storage locations, as with a physical compatibility raw device mapping (RDM). That said, it should not be the norm in your virtual datacenter. In most cases, you will use either a Virtual Machine File System (VMFS) datastore or an NFS datastore, either of which hides the specifics of the actual physical storage from the VM. Also, you may begin to use a Virtual SAN.
Regardless of what type of storage you use, you will need to configure it properly to get your desired result. In this section, I will discuss verifying storage configuration. I will also cover troubleshooting many aspects of storage, including storage contention issues, overcommitment issues, and iSCSI software initiator issues. In addition, I will discuss storage reports and storage maps that you can use for troubleshooting. Finally, you will learn how to identify the root cause of a storage issue based on troubleshooting information.
Verifying Storage Configuration
Your vCenter includes two views that will assist you in verifying your storage configuration: the Manage, Storage link in Hosts and Clusters view and the Storage view. Each of these tools lists information about your storage, and there is some overlap with regard to what these tools list. If you are focusing on what a host can see, then you might use the Manage, Storage link, as shown in Figure 6-25.
Figure 6-25 The Manage, Storage Link in Hosts and Clusters View
Click Refresh to make sure that you are seeing the latest information. You can use the Manage, Storage link to quickly identify the storage adapters and storage devices that are accessible to that host. In addition, you can view the status, type, capacity, free space, and so on, for each one. You can even customize what you show by right-clicking at the top of a column and selecting only what you want to see, as shown in Figure 6-26.
Figure 6-26 Customizing the Manage, Storage Link
The Storage view allows you to see some of the same information as the Manage, Storage link, but also much, much more detail about datastores. You can determine which hosts are connected to each datastore, but that is not the primary focus. Instead, the primary focus is detailed information about the datastores to which the hosts are connected.
You should click the Refresh link to make sure that you are seeing the latest information. Figure 6-27 shows the Storage view with a datastore selected in the Navigator (left pane) and the Summary tab selected in the details pane. As you can see, you can also show many more tabs. For example, the Related Objects tab in Figure 6-28 shows the hosts that have visibility to this datastore.
Figure 6-27 The Storage View Summary Tab
Figure 6-28 The Related Objects Tab
Troubleshooting Storage Contention Issues
To troubleshoot storage contention issues, you should focus on the storage adapters that connect your hosts to their datastores. As you know from Chapter 3, “Planning and Configuring vSphere Storage,” you can provide multipathing for your storage to relieve contention issues. The settings for multipathing of your storage are in the Storage view. Click Manage and then Settings and then Connectivity and Multipathing; finally, click your host to show the Multipathing Details, as shown in Figure 6-29. You can change path selection policy after clicking Edit Multipathing, as shown in Figure 6-30.
Figure 6-29 Settings for Multipathing of Storage
Figure 6-30 Configuring Multipathing in the Storage View
Troubleshooting Storage Overcommitment Issues
As you continue to grow your vSphere, and your hosts and VMs are competing for the same resources, many factors can begin to affect storage performance. They include excessive SCSI reservations, path thrashing, and inadequate LUN queue depth. This section briefly discusses each of these issues.
Excessive Reservations Cause Slow Host Performance
Some operations require the system to get a file lock or a metadata lock in VMFS. They might include creating or expanding a datastore, powering on a VM, creating or deleting a file, creating a template, deploying a VM from a template, creating a new VM, migrating a VM with vMotion, changing a vmdk file from thin to thick, and so on. These types of operations create a short-lived SCSI reservation, which temporarily locks the entire LUN or at least the metadata database. As you can imagine, excessive SCSI reservations caused by activity on one host can cause performance degradation on other servers that are accessing the same VMFS. Actually, ESXi 5.x does a much better job of handling this issue than legacy systems did, because only the metadata is locked and not the entire LUN.
If you have older hosts and you need to address this issue, you should ensure that you have the latest BIOS updates installed on your hosts and that you have the latest host bus adapter (HBA) firmware installed across all hosts. You should also consider using more small logical unit numbers (LUNs) rather than less large LUNs for your datastores. In addition, you should reduce the number of VM snapshots because they can cause numerous SCSI reservations. Finally, follow the Configuration Maximums document and reduce the number of VMs per LUN to the recommended maximum, even if you have seen that you can actually add more than that figure.
Path Thrashing Causes Slow Performance
Path thrashing is most likely to occur on active-passive arrays. It’s caused by two hosts attempting to access the same LUN through different storage processors. The result is that the LUN is often seen as not available to both hosts. The default setting for the Patch Selection Policy (PSP) of Most Recently Used will generally keep this from occurring. In addition, ensure that all hosts that share the same set of LUNs on the active-passive arrays use the same storage processor. Properly configured active-active arrays do not cause path thrashing.
Troubleshooting iSCSI Software Initiator Configuration Issues
If your ESXi host generates more commands to a LUN than it can possibly handle, the excess commands are queued by the VMkernel. This situation causes increased latency, which can affect the performance of your VMs. It is generally caused by an improper setting of LUN queue depth, the setting of which varies by the type of storage. You should determine the proper LUN queue depth for your storage from your vendor documentation and then adjust your Disk.SchedNumReqOutstanding parameter accordingly.
Troubleshooting Storage Reports and Storage Maps
As you have already noticed, you can use a great number of reports and tools for troubleshooting vSphere. In most cases, you are going to be better off learning how to use the vSphere Web Client. Many of the latest features are available only through the Web Client, such as Cross-Host vMotion. Also, the Windows-based vSphere Client is “on its way out.”
That said, there are a few exceptions. For example, at the time of this writing, you cannot view maps of any kind through the vSphere Web Client. Because of this, I will present this section on the Windows-based vSphere Client.
You can use the Storage Views tab on the vSphere Client in reports view to gather a tremendous amount of data about your storage. You can get this same data from the vSphere Web Client, but vSphere Client offers just another location to see a lot of data. In addition, on your Windows-based vSphere Client, you can use the maps view to see a graphical representation of the relationships between the objects in your vSphere. In fact, you can view storage reports and maps for every object in your datacenters except for the networking objects, which have their own reports and maps. This section briefly discusses the use of these storage reports and maps.
Using your Storage Views tab, you can display storage reports to view storage information for any object except networking. For example, you can view datastores and LUNs used by a VM, the adapters that are used to access the LUN, and even the status of the paths to the LUNs. To access storage reports from the Storage Views tab, follow the steps outlined in Activity 6-5.
Activity 6-5 Viewing Storage Reports
- Log on to your vCenter with your vSphere Client.
In the console pane, select the object on which you want to view connected storage (in this case, VM-02), and then open the Storage Views tabs and click the Reports button, as shown in Figure 6-31.
Figure 6-31 The Storage Views Tab and Reports Button
Select View and then Filtering to display the Show All [Category of Items] or click the amazingly small drop-down arrow, as shown in Figure 6-32.
Figure 6-32 Choosing the Display on the Storage Views Tab
Move the cursor over the column heading to the description of each attribute, as shown in Figure 6-33.
Figure 6-33 Viewing Column Descriptions
As you can see, Storage Reports can give you a lot of information about your datastores, but all the information is in the form of text. The problem is that we (people) don’t think in text; we think in pictures. We can generally understand a situation better if someone will take the time to “draw us a picture.”
In essence, that’s just what VMware has done with the Maps view of the Storage Views tab. You can use the view to display a graphical representation of every object that relates to your storage. For example, you can tell whether a specific VM has access to a host that has access to a storage processor that has access to a LUN, and whether or not there is a datastore on the LUN. To use your Maps view on your Storage Views tab, follow the steps outlined in Activity 6-6.
Activity 6-6 Viewing Storage Maps
- Log on to your vCenter with your vSphere Client.
In the console pane, select the object on which you want to view connected storage objects (in this case, VM-03), and then open the Storage Views tab and click the Maps button, as shown in Figure 6-34.
Figure 6-34 Viewing Maps in Storage Views
- You can choose the objects that you would like to display on your map.
- You can also hover your mouse pointer over an object for a few seconds to see the “callout” that gives a detailed description of that object.
Identifying the Root Cause of a Storage Issue Based on Troubleshooting Information
After you have obtained information from the reports and maps provided by your vCenter, you can use your knowledge of your systems to compare what you are viewing to what should be occurring. One “catch-22” is that the time that you are most likely to need the information is also the time at which it is most likely to be unavailable. For this reason, consider printing a copy of your storage maps when everything is running smoothly to be kept on hand for a time when you need to troubleshoot. Then if you have access to the current maps, you can compare what you are seeing with what you have in print. However, if you can no longer use the tools, you have the printed map to use as an initial guide until you can access the current configuration.
Performing Basic Troubleshooting for HA/DRS Clusters and vMotion/Storage vMotion
If you think about it, the technologies that are engaged when you use vMotion, Storage vMotion, HA, and DRS are amazing! These are reliable technologies and services as long as they are configured properly with all that is required and as long as that configuration stays in place. Troubleshooting them is therefore just a matter of knowing what is required in order for them to operate properly and then verifying that the correct configurations still exist in your vSphere. In this section, I will discuss the steps involved in verifying the configurations of vMotion, Storage vMotion, HA, and DRS. In addition, I will discuss how to troubleshoot the most common issues associated with these services and how to identify the root cause of the issue so as to make only the appropriate changes.
Identifying HA/DRS and vMotion Requirements
HA/DRS and vMotion requirements might seem at first to be too many topics to discuss all at once, but the reason that I can cover them all “rather simultaneously” is that the requirements are much the same for each of these features. At least the host requirements are much the same, but the VM requirements vary some from feature to feature. First, I will discuss the requirements that are the same, and then I will discuss some requirements that apply to only one or two of these features, but not all three.
The requirements for all of HA, DRS, and vMotion are the following:
- All hosts must have at minimum 1 Gbps NICs.
- All hosts must share the same datastores or data space. These can be VMFS, NFS, or even RDMs.
- All hosts must have access to the same physical networks.
Additional requirements that apply to vMotion and DRS, but not to HA, are as follows:
- All hosts must have compatible CPUs.
- The VMs on the hosts must not have any locally attached CD-ROMs or ISOs that are loaded.
- The VMs cannot have a connection to an internal switch with no uplinks.
- The VMs’ swap file must either be shared by the hosts or must be created before migration can begin. Solid state drives (SSDs) can now be used for the swap files.
- If the VM uses an RDM, it must be accessible to the source and destination hosts.
None of this should really seem any different than what I discussed previously in Chapter 5, “Establishing and Maintaining Service Levels,” but the main point here is that the second bulleted list does not apply to HA. I want to make this clear: HA does not use vMotion in any way, shape, or form!
HA provides for the automatic restart of VMs when the host that they were on has failed. At that point, the VMs can be restarted on another host as long as the host meets the requirements in the first set of bullet points. It doesn’t matter at that point whether the CPUs of the host are compatible. All that matters is that the VMs are protected and that the hosts are in the same HA cluster with a shared datastore and 1 Gbps or higher links.
That leaves us with Storage vMotion. You should clearly understand that when you Storage vMotion a VM’s files, the VM’s state is not moved from one host to another. Therefore, to have a list of requirements for “all hosts” is not needed because only one host is involved.
For Storage vMotion to be successful, the following requirements must be met:
- The host must have access to both the source and the destination datastores.
- A minimum of one 1 Gbps link is required.
- The VM’s disks must be in persistent mode or be RDMs.
Verifying vMotion/Storage vMotion Configuration
Now that I have identified what you must have configured in order for vMotion to be successful versus what you must have configured in order for Storage vMotion to be successful, I’ll examine where you would look to verify that the proper configuration exists. Because these are two different types of migration, I continue to treat them independently of each other. I will first discuss verifying vMotion configuration and then verifying Storage vMotion configuration.
Verifying vMotion Configuration
As you might remember, to succeed with vMotion, you will need to have a VMkernel port on a switch that is associated to each of the hosts that are involved in the vMotion. In addition, the VMkernel port will need to be enabled for vMotion, and the IP addresses of the hosts should be in the same subnet (point-to-point is best). In addition, consistency is a key factor, so unless you are using a vDS (which guarantees consistency of port group naming), you should ensure that your port group names are identical, including correct spelling and case sensitivity.
In addition to the networking requirement, your hosts must have shared datastores. You can verify whether two hosts share the same datastore by looking at the Related Objects for the datastore in Storage view and then selecting Hosts, as shown in Figure 6-35.
Figure 6-35 Verifying Whether Hosts Share the Same Datastore
Verifying HA Network Configuration
To verify the requirements for HA to function, you should start with the cluster settings. Because the purpose of the cluster is to provide for HA, DRS, or both, it would seem logical that you should check those settings first. However, because I’m following the exam blueprint “to the letter,” I will discuss that in our next topic.
What else should you verify to assure that HA should be able to function? You should look at the vmnics used on the hosts and assure that they are 1 Gbps or better. As you should remember from Chapter 2, you can modify the properties of the switch by opening the Manage, Network connection. After you have done this, you can click the Physical Adapters tab, as shown in Figure 6-36. You will need at least 1 Gbps (1000 Mb) vmnics to have an effective HA cluster. You should also verify that the hosts share a datastore, as you did with vMotion requirements.
Figure 6-36 Verifying the Speed of the Underlying Network
Verifying HA/DRS Cluster Configuration
Speaking of the cluster configuration, the most general verification that you can make is whether HA/DRS are turned on in the cluster settings. To do this, click your cluster in Hosts and Clusters view and then look under Services for vSphere DRS and vSphere HA. This will allow you to view the current settings of these services, as shown in Figure 6-37. In addition, even if HA is turned on, you should check to make sure that HA monitoring is enabled because it’s possible to turn it off for a maintenance event. Finally, ensure that the policies that are configured for HA/DRS are what you configured and that you have followed the guidelines mentioned in Chapter 5. For example, check Admission Control Polices for HA and VM affinity rules for DRS.
Figure 6-37 Verifying Cluster Settings for HA and DRS
Troubleshooting HA Capacity Issues
This title is kind of “funny” because I took it straight from the blueprint. What it should say is “Troubleshooting Cluster Capacity Issues That Are Due to HA.” As you know, Admission Control Policy in HA causes each host to reserve enough resources to recover VMs in the case of a host’s failure. This means that if you set your Admission Control Policy too conservatively, you might not be able to start as many VMs as you may have thought possible. For example, changing from a policy that allows for only one host failure to one that allows two host failures can have a dramatic affect on the VM capacity of your cluster, especially in a small cluster. Therefore, without rehashing all of Chapter 5, just verify that the settings that you expect to see are still there.
Troubleshooting HA Redundancy Issues
As you know, HA stands for high availability. This high availability is maintained by the heartbeats that are exchanged between hosts in an HA cluster. When the cluster determines that a host is isolated or has failed, it will follow the isolation response that you have configured. The default isolation response in vSphere 5.x is Leave Powered On, which will leave the VMs powered on with the assumption that they still have the resources that they need. Other options are power off or shut down. If you have a separate management network or a separate heartbeat network, you can give the host another tool with which to make a more accurate decision with regard to whether to leave powered on or to shut down. If you are troubleshooting the configuration of this network, you should examine your network settings to ensure that the network is in place. As a small example, my Management network is on vSwitch0 and vmk0, and my RedundantHeartbeat network is on vSwitch3 and vmk3, as shown in Figure 6-38. Also, (not shown) each of these VMkernel ports has its own vmnic.
Figure 6-38 A Small Example of HA Network Redundancy
Interpreting the DRS Resource Distributing Graph and Target/Current Host Load Deviation
VMware used to just say, “Set DRS at Fully Automated, set the Migration Threshold in the center, and trust us.” Then they really didn’t give you native tools to check how well they were doing for you. Now, VMware has given us some very cool tools indeed! In fact, you can tell a lot about DRS from just the Summary tab of the cluster, as shown in Figure 6-39.
Figure 6-39 Viewing the Summary Tab of a Cluster
If that’s not big enough, you can even expand the view by clicking the upper-right corner of the vSphere DRS panel. The result will be a large “carpenter’s level” that leaves no room for misinterpretation as to whether or not the cluster is balanced, as shown in Figure 6-40. Can you tell whether or not the cluster is balanced?
Figure 6-40 Viewing the DRS Panel on the Summary Tab for a Cluster
Troubleshooting DRS Load Imbalance Issues
If you notice a load imbalance, you will want to determine why the imbalance was allowed to happen. It could be that the cluster or some of the VMs in it are not set to Fully Automated. It could also be that it was “intentionally” allowed by the system based on your Migration Threshold or VM-VM-Host affinity configuration.
In addition, check to make sure that there are no VMs that are using a large amount of resources and that cannot be vMotioned, as that will stop DRS from being effective, especially if they are all on the same host. Finally, you might want to check to see if there is one huge VM that must be on one host or another and seems to throw off the balance no matter where DRS places it. You can view the resources of VMs and compare on the Virtual Machines tab within the Related Objects of your cluster as shown in Figure 6-41. As you can see, I don’t have much running right now.
Figure 6-41 Viewing the Resources of VMs in a Cluster
Troubleshooting vMotion/Storage vMotion Migration Issues
If your vSphere and your VMs meet all the requirements for vMotion, you should be able to vMotion. If you can vMotion, you should also be able to Storage vMotion because vMotion has all of the configuration requirements of Storage vMotion and more. If you cannot vMotion or Storage vMotion, go back through the list of requirements to see what you are missing. You can refresh your memory by reviewing the “Migrating Virtual Machines” section of Chapter 5.
Interpreting vMotion Resource Maps
As I mentioned earlier, people don’t really think in text form, so wouldn’t it be great to have a tool that shows an easy-to-read picture, whether your vSphere meets all the requirements to vMotion a VM from one host to another? That’s what the vMotion Resource Map does. You can access a vMotion Resource Map for a VM by simply selecting the VM on the console pane and then opening the Maps tab on the Windows-based vSphere Client, as shown in Figure 6-42. The vMotion Resources Map will show you what resources are currently connected to the VM and whether those resources would be available if the VM were to be vMotioned to another host. If you can “read between the lines,” you will see what is missing and why the VM might not be able to vMotion to another host. In this case, VM-02 is now powered on and connected to a local ISO image on datastore1 of esxi01 and would not have a connection to the same ISO from esxi02.
Figure 6-42 A vMotion Map with an Error
Identifying the Root Cause for a DRS/HA Cluster or Migration Issue Based on Troubleshooting Information
If you know all the configuration pieces that are supposed to be there, you can just start checking them off one by one to determine whether they are present. The nice thing about Storage vMotion and especially about vMotion is that the wizard will validate most of the configuration for you and give you a list of changes that you must make to perform the migration, as shown in Figure 6-43.
Figure 6-43 An Easy-to-Interpret Error Message
By carefully reading the information under Compatibility, you can determine the root cause of the issue that is keeping you from being able to vMotion or Storage vMotion. This intuitive wizard tells you exactly what you need to know, as long as you understand enough about your vSphere to interpret what it’s telling you. Once you fix the issue, you can refresh the map. Figure 6-44 shows the map after the ISO file was unmounted from VM-02; the vMotion should succeed now.
Figure 6-44 A vMotion Map That Indicates Success
The main topics covered in this chapter are the following:
- I began this chapter by discussing basic troubleshooting techniques for ESXi hosts. In particular, I discussed how you can enable the tools that you can use along with the VMware Support Team as a last resort when more conventional tools are not working. In addition, I discussed how you can monitor an ESXi host’s health on the host itself as well as through your vCenter. Finally, I discussed how you can easily export a diagnostic bundle to assist the VMware Support Team in assisting you.
- I then covered basic vSphere network troubleshooting tools and techniques. In particular, I discussed how to verify your network configuration and the configuration of the VMs on your network. In addition, I discussed troubleshooting port group issues and issues with physical network cards. Finally, I covered how to identify the root cause of a network issue based on troubleshooting information.
- I then turned my attention toward troubleshooting vSphere storage. I discussed the tools and techniques that you can use to verify your vSphere storage. In addition, I discussed troubleshooting storage contention issues, overcommitment issues, and iSCSI software initiator configuration issues. I also discussed the proper use of storage reports and storage maps. Finally, I discussed how to identify the root cause of a storage issue based on troubleshooting information.
- I ended this chapter with a discussion of basic troubleshooting for HA/DRS clusters and vMotion/Storage vMotion. In particular, I identified the requirements for each of these features and compared and contrasted them. In addition, I discussed how you can verify the configuration of each of these requirements using the tools provided by your vCenter. Finally, I discussed troubleshooting issues with regard to HA and DRS by using the reports and maps provided by your vCenter.