How to quickly discover a “VM to Resource Pool” mapping

29 01 2010

Everyone knows the situation in where a (s)VMotion asks you to select a Resource Pool. I would love to see that VMware automatically selects the current VM’s Resource Pool but unfortunately we have to make a selection since it defaults to the cluster. When we leave it to default it obviously causes the VM to get moved out of it’s Resource Pool.

 

Follow the instruction below to get the current VM’s Resource Pool without having to walk thru each Resource Pool manually.

1) Select the “Host and Clusters”-view
2) Type the VM name in the “Search Field”, press enter and select the VM

 

 

3) This will unfold the correct Resource Pool on the left side of the screen and gives you just the information you need when doing a (s)VMotion.





SCSI Bus Sharing on the VM Boot Disk SCSI Card

18 01 2010

Lately I’ve been asked to review a PID (Project Initiation Document) for a new File Server project based on Linux/GFS v2 (General File System) hosted on VMware vSphere. I noticed that the PID was talking about a dynamic File Server that could (and even would) grow towards 100 TB storage space. That number immediately set off alarm bells in my head.

The vSphere Configuration Maximums document states that a VM can have a maximum of 4 SCSI cards.

So my initial design would look like this:

 

  • 1 SCSI Card is for the VM Boot Disk (SCSI Bus Sharing: None) which leaves us with
  • 3 SCSI Cards for the RDM’s (SCSI Bus Sharing: Virtual or Physical).

Read the rest of this entry »





Unresponsive HP Virtual Connect Manager – vcutil

24 12 2009

After rebooting several Virtual Connect Modules to test the failover behaviors I got myself in a situation in were the Virtual Connect Manager got completely unresponsive. In my case the vcutil from HP eventually solved my problem so I want to give some more information on this tool since I only knew it as a Virtual Connect Firmware Update Tool from the past.

In my case the following statements were true:

  • While logging on with the web browser the Interface the “Loading, please wait…” wouldn’t disappear.
  • While logging on with SSH, I was able to enter credentials but after that the CLI never appeared and eventually timed out.

Read the rest of this entry »





VMware vCenter Storage Views: Partial/No Redundancy

15 12 2009

While exploring my software iSCSI initator environment I noticed that all my VM’s on every host are reporting a “Partial/No Redundancy”-status within the Multipathing Status even though I have Round Robin in place and thus 2 paths to the storage.

This behavior is a bug as confirmed by VMware Technical Support. The rule for displaying the “Multipathing Status” is as follows:

If there exists 2 or more distinct adapters AND 2 or more distinct targets
MPStatus = Up (Full Redundancy)
else If there exists at least one path whose status is “Up”
MPStatus = degraded (Partial/No Redundancy)
else If there exists at least one path whose status is “Unknown”
MPStatus = unknown (Unknown)
else (for all other cases)
MPStatus = down (All Paths Down)

In case of a software iSCSI Initiator you only have one adapter, thus a single point of failure thus a “Partial/No Redundancy” status. So based on the current Storage Views API rules software iSCSI will always be displayed with a degraded status.

VMware has an open bug for this case at the moment.





Testing Scenario’s VMware / HP c-Class Infrastructure

4 12 2009

Since my blog about Understanding HP Flex-10 Mappings with VMware ESX/vSphere is quite a big hit (seeing the page views per day) I decided to also write about the testing scenario’s which should all be walked through before taking a design as this into production.

In my blog I stated:

Last word of advice: while implementing a technical environment like this it’s crucial to test every possible failure, from single ESX Host to all the separate components. I’ve wrote very detailed documents about it

So let’s take a look at these testing scenario’s which can be divided into three main subjects:

  • Hardware (ex. power redundancy)
  • Connectivity and failover within the hardware (This is Virtual Connect in my design but could also be normal (SAN)switch configurations, this is depending on the modules that are present in the enclosure.)
  • Connectivity and failover within the OS (vSphere Configuration)

As a short introduction: I’ve have been working with HP c-Class components ever since the first c7000 enclosure was placed in the Netherlands. In this time I’ve seen many HP c-Class implementations were people just rely on the fact that “everything is redundant” and thus assume that it simply works. Like Travis Dane (Under Siege 2) said: Did you see the body? Assumption is the mother of all F*CK UPS!
My statement is clear, it isn’t working until you’ve seen the behavior in failure scenario yourself.

Read the rest of this entry »





Creating easy to identify LeftHand Volumes on ESX/vSphere

3 12 2009

Coming from mostly HP EVA environments I got used to identify a Volume/LUN by there LUN number which was a real unique identifier. I could always ”talk” LUN number and be sure that it was understood and unique.
So ever since I’ve been working with our LeftHand environment I disliked the way that every Volume/LUN is marked with LUN 0.

Goal of this blog is to show you how you can easily rename a published LeftHand Volume to something that is easy to recognize and is unique.

Read the rest of this entry »





vSphere: Freezing VMs after deleting a volume from the SAN

2 12 2009

We are running a newly designed vSphere 4.0 environment connected to a very big LeftHand iSCSI environment. Lately we discovered some major problems with a couple of VM’s totally freezing for about 30 seconds, this problem seemed to only occur on several VM’s from one specific host, so time to do some research on this host.

The first fast conclusion I could make was that the vmkernel was flooded (multiple entries per second) with error messages coming from the Path Selection Policy (PSP).

Dec  2 15:41:13 esxhostname vmkernel: 0:00:37:21.082 cpu14:4118)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device “naa.6000eb36b7210cc2000000000000017a”.
Dec  2 15:41:13 esxhostname vmkernel: 0:00:37:21.082 cpu14:4118)WARNING: NMP: nmp_IssueCommandToDevice: I/O could not be issued to device “naa.6000eb36b7210cc2000000000000017a” due to Not found
Dec  2 15:41:13 esxhostname vmkernel: 0:00:37:21.082 cpu14:4118)WARNING: NMP: nmp_DeviceRetryCommand: Device “naa.6000eb36b7210cc2000000000000017a”: awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
Dec  2 15:41:13 esxhostname vmkernel: 0:00:37:21.082 cpu14:4118)WARNING: NMP: nmp_DeviceStartLoop: NMP Device “naa.6000eb36b7210cc2000000000000017a” is blocked. Not starting I/O from device.
Dec  2 15:41:14 esxhostname vmkernel: 0:00:37:22.084 cpu0:4285)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.6000eb36b7210cc2000000000000017a”.
Dec  2 15:41:14 esxhostname vmkernel: 0:00:37:22.084 cpu2:4231)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa.6000eb36b7210cc2000000000000017a” – issuing command 0×4100010f2e40
Dec  2 15:41:14 esxhostname vmkernel: 0:00:37:22.084 cpu2:4231)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device “naa.6000eb36b7210cc2000000000000017a”.
Dec  2 15:41:14 esxhostname vmkernel: 0:00:37:22.084 cpu2:4231)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa.6000eb36b7210cc2000000000000017a” – failed to issue command due to Not found (APD), try again…
Dec  2 15:41:14 esxhostname vmkernel: 0:00:37:22.084 cpu2:4231)WARNING: NMP: nmp_DeviceAttemptFailover: Logical device “naa.6000eb36b7210cc2000000000000017a”: awaiting fast path state update…

Further conclusions at that moment learned that a volume was deleted from the LeftHand SAN and EXS obviously didn’t handle this well causing ALL VM’s on the troubled host to freeze completely. To the user it only appears like the server is losing its network connection but in fact it’s a real freeze that varies from 15 to 30 seconds (in our environment). So to get a grip on the situation I frozen (to stay in terms ;)  all the LUN removals since I first wanted to reproduce this in our life-like test environment.

Read the rest of this entry »





VMotion and Exchange 2010, not supported

20 11 2009

Just a short blog post about Microsoft Exchange 2010 in combination with VMware VMotion. We are running this combination hosted on vSphere platform and noticed that whenever we VMotion over a Exchange 2010 Mailbox server that is using DAG (Database Availability Group’s), the DAG will fail.

I’m not an Exchange guru but in short this is what the Database Availability Groups look like. The green databases are active and the blue databases are the passive databases which are spread across the rest of the mailbox servers.

Anyway, the story behind the failing DAG is because the DAG is relying on Windows Failover Clustering which doesn’t work and more important, isn’t supported with VMotion (Same counts for Microsoft Hyper-V Live Migration)

VMware’s Setup for Failover Clustering and Microsoft Cluster Service manual states:

Before you set up MSCS, review the list of functionality that is not supported for this release, and any
requirements and recommendations that apply to your configuration.
The following environments and functionality are not supported for MSCS setups with this release of vSphere:
- Clustering on iSCSI or NFS disks.
- Mixed environments, such as configurations where one cluster node is running a different version of
ESX/ESXi than another cluster node.
- Clustered virtual machines as part of VMware clusters (DRS or HA).
- Use of MSCS in conjunction with VMware Fault Tolerance.
- Migration with VMotion of clustered virtual machines.

Read the rest of this entry »





VMFS- and Block Size is important for virtual RDM’s

10 11 2009

A little post from me since I  got an error message while working with large RDM’s. While I twittered out the message it seemed that Duncan Epping from VMware had a posting ready at which he only had to press the “Publish”-button. See his very helpful article over here.

What you might have noticed is that a RDM’s size is displayed as the real size of the physical LUN that it is referring to. So for example, when I publish a 1 TB LUN it will show up as a 1 TB VMDK file even while my actual VMFS volume in which it resides is much smaller (500 GB in this sample). 

1 TB RDM VMDK (virtual compatibility mode) on VICL01-151 TB RDM

VICL01-15 showing that it’s actual size is 500 GBActual size of the Datastore

So far so good you might think since the VM accepts that you connect this 1 TB RDM. Strange thing though is that if you try to datastore migrate this VM it will give an error stating that the destination VMFS has insufficient disk space available, while the destination datastore is an empty 500 GB VMFS datastore offering more free space than the original source datastore.
Removing the RDM from the VM, migrating the VM and reconnect the RDM does work in this situation.

Read the rest of this entry »





VMware VMotion, how fast can we go?

9 11 2009

Lately while I was testing out specific failover behaviors in vSphere, I accidently discovered that VMotion Speeds (MB/s) are logged in the the /var/log/vmkernel, now that’s cool!

Issue the command tail -f /var/log/vmkernel and than initiate a VMotion. You should get info like this:

Host VMotionning to (receiving)

Nov  7 21:13:14 xxxxxxxx vmkernel: 10:06:06:18.104 cpu3:9131)VMotionRecv: 226: 1257624621919023 D: Estimated network bandwidth 280.495 MB/s   during pre-copy
Nov  7 21:13:15 xxxxxxxx vmkernel: 10:06:06:18.756 cpu2:9131)VMotionRecv: 1078: 1257624621919023 D: Estimated network bandwidth 280.050 MB/  during page-in

Host VMotionning from (sending)

Nov  9 17:44:00 xxxxxxxxvmkernel: 12:01:47:02.229 cpu12:11150)VMotionSend: 2909: 1257781936902648 S: Sent all modified pages to destination (network bandwidth ~287.381 MB/s)

The last notice: ”Sent all modified pages to destination (network bandwidth ~xxx.xxx MB/s)” is the overall counter that rates the whole VMotion action.

While seeing this MB/s counters I wondered if there is any speed limit on VMotion other then obviously the network speed limit.  Second I wanted to know if we are using the full 7 Gb that I configured in our current vSphere environment.

So…… Testing Time! 

Read the rest of this entry »