Unresponsive HP Virtual Connect Manager – vcutil

24 12 2009

After rebooting several Virtual Connect Modules to test the failover behaviors I got myself in a situation in were the Virtual Connect Manager got completely unresponsive. In my case the vcutil from HP eventually solved my problem so I want to give some more information on this tool since I only knew it as a Virtual Connect Firmware Update Tool from the past.

In my case the following statements were true:

  • While logging on with the web browser the Interface the “Loading, please wait…” wouldn’t disappear.
  • While logging on with SSH, I was able to enter credentials but after that the CLI never appeared and eventually timed out.

Read the rest of this entry »





VMware vCenter Storage Views: Partial/No Redundancy

15 12 2009

While exploring my software iSCSI initator environment I noticed that all my VM’s on every host are reporting a “Partial/No Redundancy”-status within the Multipathing Status even though I have Round Robin in place and thus 2 paths to the storage.

This behavior is a bug as confirmed by VMware Technical Support. The rule for displaying the “Multipathing Status” is as follows:

If there exists 2 or more distinct adapters AND 2 or more distinct targets
MPStatus = Up (Full Redundancy)
else If there exists at least one path whose status is “Up”
MPStatus = degraded (Partial/No Redundancy)
else If there exists at least one path whose status is “Unknown”
MPStatus = unknown (Unknown)
else (for all other cases)
MPStatus = down (All Paths Down)

In case of a software iSCSI Initiator you only have one adapter, thus a single point of failure thus a “Partial/No Redundancy” status. So based on the current Storage Views API rules software iSCSI will always be displayed with a degraded status.

VMware has an open bug for this case at the moment.





Testing Scenario’s VMware / HP c-Class Infrastructure

4 12 2009

Since my blog about Understanding HP Flex-10 Mappings with VMware ESX/vSphere is quite a big hit (seeing the page views per day) I decided to also write about the testing scenario’s which should all be walked through before taking a design as this into production.

In my blog I stated:

Last word of advice: while implementing a technical environment like this it’s crucial to test every possible failure, from single ESX Host to all the separate components. I’ve wrote very detailed documents about it

So let’s take a look at these testing scenario’s which can be divided into three main subjects:

  • Hardware (ex. power redundancy)
  • Connectivity and failover within the hardware (This is Virtual Connect in my design but could also be normal (SAN)switch configurations, this is depending on the modules that are present in the enclosure.)
  • Connectivity and failover within the OS (vSphere Configuration)

As a short introduction: I’ve have been working with HP c-Class components ever since the first c7000 enclosure was placed in the Netherlands. In this time I’ve seen many HP c-Class implementations were people just rely on the fact that “everything is redundant” and thus assume that it simply works. Like Travis Dane (Under Siege 2) said: Did you see the body? Assumption is the mother of all F*CK UPS!
My statement is clear, it isn’t working until you’ve seen the behavior in failure scenario yourself.

Read the rest of this entry »





Creating easy to identify LeftHand Volumes on ESX/vSphere

3 12 2009

Coming from mostly HP EVA environments I got used to identify a Volume/LUN by there LUN number which was a real unique identifier. I could always ”talk” LUN number and be sure that it was understood and unique.
So ever since I’ve been working with our LeftHand environment I disliked the way that every Volume/LUN is marked with LUN 0.

Goal of this blog is to show you how you can easily rename a published LeftHand Volume to something that is easy to recognize and is unique.

Read the rest of this entry »





vSphere: Freezing VMs after deleting a volume from the SAN

2 12 2009

We are running a newly designed vSphere 4.0 environment connected to a very big LeftHand iSCSI environment. Lately we discovered some major problems with a couple of VM’s totally freezing for about 30 seconds, this problem seemed to only occur on several VM’s from one specific host, so time to do some research on this host.

The first fast conclusion I could make was that the vmkernel was flooded (multiple entries per second) with error messages coming from the Path Selection Policy (PSP).

Dec  2 15:41:13 esxhostname vmkernel: 0:00:37:21.082 cpu14:4118)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device “naa.6000eb36b7210cc2000000000000017a”.
Dec  2 15:41:13 esxhostname vmkernel: 0:00:37:21.082 cpu14:4118)WARNING: NMP: nmp_IssueCommandToDevice: I/O could not be issued to device “naa.6000eb36b7210cc2000000000000017a” due to Not found
Dec  2 15:41:13 esxhostname vmkernel: 0:00:37:21.082 cpu14:4118)WARNING: NMP: nmp_DeviceRetryCommand: Device “naa.6000eb36b7210cc2000000000000017a”: awaiting fast path state update for failover with I/O blocked. No prior reservation exists on the device.
Dec  2 15:41:13 esxhostname vmkernel: 0:00:37:21.082 cpu14:4118)WARNING: NMP: nmp_DeviceStartLoop: NMP Device “naa.6000eb36b7210cc2000000000000017a” is blocked. Not starting I/O from device.
Dec  2 15:41:14 esxhostname vmkernel: 0:00:37:22.084 cpu0:4285)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate: Could not select path for device “naa.6000eb36b7210cc2000000000000017a”.
Dec  2 15:41:14 esxhostname vmkernel: 0:00:37:22.084 cpu2:4231)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa.6000eb36b7210cc2000000000000017a” – issuing command 0×4100010f2e40
Dec  2 15:41:14 esxhostname vmkernel: 0:00:37:22.084 cpu2:4231)WARNING: vmw_psp_rr: psp_rrSelectPath: Could not select path for device “naa.6000eb36b7210cc2000000000000017a”.
Dec  2 15:41:14 esxhostname vmkernel: 0:00:37:22.084 cpu2:4231)WARNING: NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa.6000eb36b7210cc2000000000000017a” – failed to issue command due to Not found (APD), try again…
Dec  2 15:41:14 esxhostname vmkernel: 0:00:37:22.084 cpu2:4231)WARNING: NMP: nmp_DeviceAttemptFailover: Logical device “naa.6000eb36b7210cc2000000000000017a”: awaiting fast path state update…

Further conclusions at that moment learned that a volume was deleted from the LeftHand SAN and EXS obviously didn’t handle this well causing ALL VM’s on the troubled host to freeze completely. To the user it only appears like the server is losing its network connection but in fact it’s a real freeze that varies from 15 to 30 seconds (in our environment). So to get a grip on the situation I frozen (to stay in terms ;)  all the LUN removals since I first wanted to reproduce this in our life-like test environment.

Read the rest of this entry »





VMotion and Exchange 2010, not supported

20 11 2009

Just a short blog post about Microsoft Exchange 2010 in combination with VMware VMotion. We are running this combination hosted on vSphere platform and noticed that whenever we VMotion over a Exchange 2010 Mailbox server that is using DAG (Database Availability Group’s), the DAG will fail.

I’m not an Exchange guru but in short this is what the Database Availability Groups look like. The green databases are active and the blue databases are the passive databases which are spread across the rest of the mailbox servers.

Anyway, the story behind the failing DAG is because the DAG is relying on Windows Failover Clustering which doesn’t work and more important, isn’t supported with VMotion (Same counts for Microsoft Hyper-V Live Migration)

VMware’s Setup for Failover Clustering and Microsoft Cluster Service manual states:

Before you set up MSCS, review the list of functionality that is not supported for this release, and any
requirements and recommendations that apply to your configuration.
The following environments and functionality are not supported for MSCS setups with this release of vSphere:
- Clustering on iSCSI or NFS disks.
- Mixed environments, such as configurations where one cluster node is running a different version of
ESX/ESXi than another cluster node.
- Clustered virtual machines as part of VMware clusters (DRS or HA).
- Use of MSCS in conjunction with VMware Fault Tolerance.
- Migration with VMotion of clustered virtual machines.

Read the rest of this entry »





VMFS- and Block Size is important for virtual RDM’s

10 11 2009

A little post from me since I  got an error message while working with large RDM’s. While I twittered out the message it seemed that Duncan Epping from VMware had a posting ready at which he only had to press the “Publish”-button. See his very helpful article over here.

What you might have noticed is that a RDM’s size is displayed as the real size of the physical LUN that it is referring to. So for example, when I publish a 1 TB LUN it will show up as a 1 TB VMDK file even while my actual VMFS volume in which it resides is much smaller (500 GB in this sample). 

1 TB RDM VMDK (virtual compatibility mode) on VICL01-151 TB RDM

VICL01-15 showing that it’s actual size is 500 GBActual size of the Datastore

So far so good you might think since the VM accepts that you connect this 1 TB RDM. Strange thing though is that if you try to datastore migrate this VM it will give an error stating that the destination VMFS has insufficient disk space available, while the destination datastore is an empty 500 GB VMFS datastore offering more free space than the original source datastore.
Removing the RDM from the VM, migrating the VM and reconnect the RDM does work in this situation.

Read the rest of this entry »





VMware VMotion, how fast can we go?

9 11 2009

Lately while I was testing out specific failover behaviors in vSphere, I accidently discovered that VMotion Speeds (MB/s) are logged in the the /var/log/vmkernel, now that’s cool!

Issue the command tail -f /var/log/vmkernel and than initiate a VMotion. You should get info like this:

Host VMotionning to (receiving)

Nov  7 21:13:14 xxxxxxxx vmkernel: 10:06:06:18.104 cpu3:9131)VMotionRecv: 226: 1257624621919023 D: Estimated network bandwidth 280.495 MB/s   during pre-copy
Nov  7 21:13:15 xxxxxxxx vmkernel: 10:06:06:18.756 cpu2:9131)VMotionRecv: 1078: 1257624621919023 D: Estimated network bandwidth 280.050 MB/  during page-in

Host VMotionning from (sending)

Nov  9 17:44:00 xxxxxxxxvmkernel: 12:01:47:02.229 cpu12:11150)VMotionSend: 2909: 1257781936902648 S: Sent all modified pages to destination (network bandwidth ~287.381 MB/s)

The last notice: ”Sent all modified pages to destination (network bandwidth ~xxx.xxx MB/s)” is the overall counter that rates the whole VMotion action.

While seeing this MB/s counters I wondered if there is any speed limit on VMotion other then obviously the network speed limit.  Second I wanted to know if we are using the full 7 Gb that I configured in our current vSphere environment.

So…… Testing Time! 

Read the rest of this entry »





Understanding HP Flex-10 Mappings with VMware ESX/vSphere

4 11 2009

I’ve written this blog as an add-on to Frank Denneman’s blog about Flex-10 which you can find over here.
Goal of this blog is to get a clear vision about the Flex-10 port mappings that HP uses to facilitate their blades with NIC’s, with the special focus towards VMware ESX/vSphere.

The discussed sample in this blog could also be used for a real production environment (in fact, it is ;)

Read the rest of this entry »





VMware VCDX Design Exam, watch out for the “Next”-Button!

31 10 2009

Yesterday I made progress in my pursuit to get the VMware VCDX Certificate by completing the Design Exam in Frankfurt. I’ve got a very important notice to make about this exam for all of you who are going to do this in the future.

The exam consists of 2 section, one section in which you will get 51 questions and the second section in where you have to draw a design following the customer case on screen (you get 20 minutes for this assignment).
So I finised the first section, pressed “End Exam” in the review screen (just like always in the Pearson Vue exams) and then section 2 popupped showing me the customer case with all possibilites to draw the design. To me it looked like the drawing screen was divided into 2 sections so after I reading the case  and seeing the “Next”-button in the bottom of the screen I decided to press it to see if it opens another drawing or additional information.

After pressing the “Next”-Button my PC starts ratteling and on my screen the message “Congratulations you have passed the Exam” appeared (with my score in the next line). I was completely shocked, not because I passed the exam but I still  had 17 minutes left for my design which I didn’t even touched!

So for all of you who are going to do the VCDX Design, be warned, the “Next”-button in Section 2 of the exam actually means “End Exam”!