vmware-esxi interview questions

Top vmware-esxi frequently asked interview questions

How to run VMWare ESX or ESXi in a virtual machine?

Can VMWare ESX or ESXi be installed and used inside a virtual machine?

It can be installed inside VMWare Workstation or Server, but then it doesn't work; the main symptoms are:

  • It runs REALLY slowly.
  • It lets you create VMs, but when powering up them it gives an error stating "You may not power on a virtual machine in a virtual machine".

Source: (StackOverflow)

What are the benefits of "enterprise-level" virtualization?

For a company with modest virtualization needs - VirtualBox is currently doing fine at hosting a few light servers - what would some of the benefits be of moving to a more robust platform?

I'm hoping to shortcut my research a bit - to get a short list of the features enterprise-level virtualization has that VBox and its ilk don't.

Source: (StackOverflow)

How to [politely?] tell software vendor they don't know what they're talking about

Not a technical question, but a valid one nonetheless. Scenario:

HP ProLiant DL380 Gen 8 with 2 x 8-core Xeon E5-2667 CPUs and 256GB RAM running ESXi 5.5. Eight VMs for a given vendor's system. Four VMs for test, four VMs for production. The four servers in each environment perform different functions, e.g.: web server, main app server, OLAP DB server and SQL DB server.

CPU shares configured to stop the test environment from impacting production. All storage on SAN.

We've had some queries regarding performance, and the vendor insists that we need to give the production system more memory and vCPUs. However, we can clearly see from vCenter that the existing allocations aren't being touched, e.g.: a monthly view of CPU utilization on the main application server hovers around 8%, with the odd spike up to 30%. The spikes tend to coincide with the backup software kicking in.

Similar story on RAM - the highest utilization figure across the servers is ~35%.

So, we've been doing some digging, using Process Monitor (Microsoft SysInternals) and Wireshark, and our recommendation to the vendor is that they do some TNS tuning in the first instance. However, this is besides the point.

My question is: how do we get them to acknowledge that the VMware statistics that we've sent them are evidence enough that more RAM/vCPU won't help?

--- UPDATE 12/07/2014 ---

Interesting week. Our IT management have said that we should make the change to the VM allocations, and we're now waiting for some downtime from the business users. Strangely, the business users are the ones saying that certain aspects of the app are running slowly (compared to what, I don't know), but they're going to "let us know" when we can take the system down (grumble, grumble!).

As an aside, the "slow" aspect of the system is apparently not the HTTP(S) element, i.e.: the "thin app" used by most of the users. It sounds like it's the "fat client" installs, used by the main finance bods, that is apparently "slow". This means that we're now considering the client and the client-server interaction in our investigations.

As the initial purpose of the question was to seek assistance as to whether to go down the "poke it" route, or just make the change, and we're now making the change, I'll close it using longneck's answer.

Thank you all for your input; as usual, serverfault has been more than just a forum - it's kind of like a psychologist's couch as well :-)

Source: (StackOverflow)

vSphere education - What are the downsides of configuring VMs with *too* much RAM?

VMware memory management seems to be a tricky balancing act. With cluster RAM, Resource Pools, VMware's management techniques (TPS, ballooning, host swapping), in-guest RAM utilization, swapping, reservations, shares and limits, there are a lot of variables.

I'm in a situation where clients are using dedicated vSphere cluster resources. However, they are configuring the virtual machines as though they were on physical hardware. In turn, this means a standard VM build may have 4 vCPUs and 16GB or more of RAM. I come from the school of starting small (1 vCPU, minimal RAM), checking real-world use and adjusting up as necessary. Unfortunately, many vendor requirements and people unfamiliar with virtualization request more resources than necessary... I'm interested in quantifying the impact of this decision.

Some examples from a "problem" cluster.

Resource pool summary - Looks almost 4:1 overcommitted. Note the high amount of ballooned RAM. enter image description here

Resource allocation - The Worst Case Allocation column shows that these VMs would have access to less than 50% of their configured RAM under constrained conditions. enter image description here

The real-time memory utilization graph of the top VM in the listing above. 4 vCPU and 64GB RAM allocated. It averages under 9GB use. enter image description here

Summary of the same VM enter image description here

  • What are the downsides of overcommitting and overconfiguring resources (specifically RAM) in vSphere environments?

  • Assuming that the VMs can run in less RAM, is it fair to say that there's overhead to configuring virtual machines with more RAM than they actually need?

  • What is the counter-argument to: "if a VM has 16GB of RAM allocated, but only uses 4GB, what's the problem??"? E.g. do customers need to be educated that VMs are not the same as physical hardware?

  • What specific metric(s) should be used to meter RAM usage. Tracking the peaks of "Active" versus time? Watching "Consumed"?

Update: I used vCenter Operations Manager to profile this environment and get some detail on the cluster stats listed above. While things are definitely overcommitted, the VMs are actually so overconfigured with unnecessary RAM that the real (tiny) memory footprint shows no memory contention at the cluster/host level...

My takeaway is that VMs should really be right-sized with a little bit of buffer for OS-level caching. Overcommitting out of ignorance or vendor "requirements" leads to the situation presented here. Memory ballooning seems to be bad in every case, as there is a performance impact, so right-sizing can help prevent this.

Update 2: Some of these VMs are beginning to crash with:

kernel:BUG: soft lockup - CPU#1 stuck for 71s! 

VMware describes this as a symptom of heavy memory overcommitment. So I guess that answers the question.

enter image description here

vCops "Oversized Virtual Machines" report... enter image description here

vCops "Reclaimable Waste" graph...

enter image description here

Source: (StackOverflow)

Troubleshooting latency spikes on ESXi NFS datastores

I'm experiencing fsync latencies of around five seconds on NFS datastores in ESXi, triggered by certain VMs. I suspect this might be caused by VMs using NCQ/TCQ, as this does not happen with virtual IDE drives.

This can be reproduced using fsync-tester (by Ted Ts'o) and ioping. For example using a Grml live system with a 8GB disk:

Linux 2.6.33-grml64:
root@dynip211 /mnt/sda # ./fsync-tester
fsync time: 5.0391
fsync time: 5.0438
fsync time: 5.0300
fsync time: 0.0231
fsync time: 0.0243
fsync time: 5.0382
fsync time: 5.0400
[... goes on like this ...]

That is 5 seconds, not milliseconds. This is even creating IO-latencies on a different VM running on the same host and datastore:

root@grml /mnt/sda/ioping-0.5 # ./ioping -i 0.3 -p 20 .
4096 bytes from . (reiserfs /dev/sda): request=1 time=7.2 ms
4096 bytes from . (reiserfs /dev/sda): request=2 time=0.9 ms
4096 bytes from . (reiserfs /dev/sda): request=3 time=0.9 ms
4096 bytes from . (reiserfs /dev/sda): request=4 time=0.9 ms
4096 bytes from . (reiserfs /dev/sda): request=5 time=4809.0 ms
4096 bytes from . (reiserfs /dev/sda): request=6 time=1.0 ms
4096 bytes from . (reiserfs /dev/sda): request=7 time=1.2 ms
4096 bytes from . (reiserfs /dev/sda): request=8 time=1.1 ms
4096 bytes from . (reiserfs /dev/sda): request=9 time=1.3 ms
4096 bytes from . (reiserfs /dev/sda): request=10 time=1.2 ms
4096 bytes from . (reiserfs /dev/sda): request=11 time=1.0 ms
4096 bytes from . (reiserfs /dev/sda): request=12 time=4950.0 ms

When I move the first VM to local storage it looks perfectly normal:

root@dynip211 /mnt/sda # ./fsync-tester
fsync time: 0.0191
fsync time: 0.0201
fsync time: 0.0203
fsync time: 0.0206
fsync time: 0.0192
fsync time: 0.0231
fsync time: 0.0201
[... tried that for one hour: no spike ...]

Things I've tried that made no difference:

  • Tested several ESXi Builds: 381591, 348481, 260247
  • Tested on different hardware, different Intel and AMD boxes
  • Tested with different NFS servers, all show the same behavior:
    • OpenIndiana b147 (ZFS sync always or disabled: no difference)
    • OpenIndiana b148 (ZFS sync always or disabled: no difference)
    • Linux 2.6.32 (sync or async: no difference)
    • It makes no difference if the NFS server is on the same machine (as a virtual storage appliance) or on a different host

Guest OS tested, showing problems:

  • Windows 7 64 Bit (using CrystalDiskMark, latency spikes happen mostly during preparing phase)
  • Linux 2.6.32 (fsync-tester + ioping)
  • Linux 2.6.38 (fsync-tester + ioping)

I could not reproduce this problem on Linux 2.6.18 VMs.

Another workaround is to use virtual IDE disks (vs SCSI/SAS), but that is limiting performance and the number of drives per VM.

Update 2011-06-30:

The latency spikes seem to happen more often if the application writes in multiple small blocks before fsync. For example fsync-tester does this (strace output):

pwrite(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 1048576, 0) = 1048576
fsync(3)                                = 0

ioping does this while preparing the file:

[lots of pwrites]
pwrite(3, "********************************"..., 4096, 1036288) = 4096
pwrite(3, "********************************"..., 4096, 1040384) = 4096
pwrite(3, "********************************"..., 4096, 1044480) = 4096
fsync(3)                                = 0

The setup phase of ioping almost always hangs, while fsync-tester sometimes works fine. Is someone capable of updating fsync-tester to write multiple small blocks? My C skills suck ;)

Update 2011-07-02:

This problem does not occur with iSCSI. I tried this with the OpenIndiana COMSTAR iSCSI server. But iSCSI does not give you easy access to the VMDK files so you can move them between hosts with snapshots and rsync.

Update 2011-07-06:

This is part of a wireshark capture, captured by a third VM on the same vSwitch. This all happens on the same host, no physical network involved.

I've started ioping around time 20. There were no packets sent until the five second delay was over:

No.  Time        Source                Destination           Protocol Info
1082 16.164096        NFS      V3 WRITE Call (Reply In 1085), FH:0x3eb56466 Offset:0 Len:84 FILE_SYNC
1083 16.164112        NFS      V3 WRITE Call (Reply In 1086), FH:0x3eb56f66 Offset:0 Len:84 FILE_SYNC
1084 16.166060        TCP      nfs > iclcnet-locate [ACK] Seq=445 Ack=1057 Win=32806 Len=0 TSV=432016 TSER=769110
1085 16.167678        NFS      V3 WRITE Reply (Call In 1082) Len:84 FILE_SYNC
1086 16.168280        NFS      V3 WRITE Reply (Call In 1083) Len:84 FILE_SYNC
1087 16.168417        TCP      iclcnet-locate > nfs [ACK] Seq=1057 Ack=773 Win=4163 Len=0 TSV=769110 TSER=432016
1088 23.163028        NFS      V3 GETATTR Call (Reply In 1089), FH:0x0bb04963
1089 23.164541        NFS      V3 GETATTR Reply (Call In 1088)  Directory mode:0777 uid:0 gid:0
1090 23.274252        TCP      iclcnet-locate > nfs [ACK] Seq=1185 Ack=889 Win=4163 Len=0 TSV=769821 TSER=432716
1091 24.924188        RPC      Continuation
1092 24.924210        RPC      Continuation
1093 24.924216        RPC      Continuation
1094 24.924225        RPC      Continuation
1095 24.924555        TCP      nfs > iclcnet_svinfo [ACK] Seq=6893 Ack=1118613 Win=32625 Len=0 TSV=432892 TSER=769986
1096 24.924626        RPC      Continuation
1097 24.924635        RPC      Continuation
1098 24.924643        RPC      Continuation
1099 24.924649        RPC      Continuation
1100 24.924653        RPC      Continuation

2nd Update 2011-07-06:

There seems to be some influence from TCP window sizes. I was not able to reproduce this problem using FreeNAS (based on FreeBSD) as a NFS server. The wireshark captures showed TCP window updates to 29127 bytes in regular intervals. I did not see them with OpenIndiana, which uses larger window sizes by default.

I can no longer reproduce this problem if I set the following options in OpenIndiana and restart the NFS server:

ndd -set /dev/tcp tcp_recv_hiwat 8192 # default is 128000
ndd -set /dev/tcp tcp_max_buf 1048575 # default is 1048576

But this kills performance: Writing from /dev/zero to a file with dd_rescue goes from 170MB/s to 80MB/s.

Update 2011-07-07:

I've uploaded this tcpdump capture (can be analyzed with wireshark). In this case is the NFS server (OpenIndiana b148) and is the ESXi host.

Things I've tested during this capture:

Started "ioping -w 5 -i 0.2 ." at time 30, 5 second hang in setup, completed at time 40.

Started "ioping -w 5 -i 0.2 ." at time 60, 5 second hang in setup, completed at time 70.

Started "fsync-tester" at time 90, with the following output, stopped at time 120:

fsync time: 0.0248
fsync time: 5.0197
fsync time: 5.0287
fsync time: 5.0242
fsync time: 5.0225
fsync time: 0.0209

2nd Update 2011-07-07:

Tested another NFS server VM, this time NexentaStor 3.0.5 community edition: Shows the same problems.

Update 2011-07-31:

I can also reproduce this problem on the new ESXi build

Source: (StackOverflow)

Is there a way to clone an existing VM on an ESXi server without having to re-import it?

I have imported a VM into an ESXi server using the VMware "Converter stand alone utility". However, this process is a little tedious. Since I eventually want several copies of this VM (or, to be more accurate, several more-or-less identical instances of this VM) to be running at the same time, is there a way to clone the already imported VM? Or do I have to re-import for each instance I want?

Source: (StackOverflow)

Are VMware ESXi 5 patches cumulative?

This seems basic, but I'm confused about the patching strategy involved with manually updating standalone VMware ESXi hosts. The VMware vSphere blog attempts to explain this, but the actual process is still not clear to me.

From the blog:
Say Patch01 includes updates for the following VIBs: "esxi-base", "driver10" and "driver 44". And then later Patch02 comes out with updates to "esxi-base", "driver20" and "driver 44". P2 is cumulative in that the "esxi-base" and "driver44" VIBs will include the updates in Patch01. However, it's important to note that Patch02 not include the "driver 10" VIB as that module was not updated.

This VMware Communities post gives a different answer. This one contradicts the other.

Many of the ESXi installations I encounter are standalone and do not utilize Update Manager. It is possible to update an individual host using the patches make available through the VMWare patch download portal. The process is quite simple, so that part makes sense.

The bigger issue is determining what exactly to actually download and install. In my case, I have a good number of HP-specific ESXi builds that incorporate sensors and management for HP ProLiant hardware.

  • Let's say that those servers start with an ESXi build #474610 from 9/2011.
  • Looking at the patch portal screenshot below, there is a patch for ESXi update01, build #623860. There are also patches for builds #653509 and #702118.
  • Coming an old version of ESXi (e.g. vendor-specific build), what is the proper approach to bring the system fully up-to-date? Which patches are cumulative and which need to be applied sequentially? Is installing the newest build the right approach, or do I need to step back and patch incrementally?
  • Another consideration is the large size of the patch downloads. At sites with limited bandwidth, downloading of multiple ~300mb patches is difficult.

enter image description here

Source: (StackOverflow)

When deploying a single server on new hardware, do you virtualize it or not?

There are a few questions that I've found on ServerFault that hint around this topic, and while it may be somewhat opinion-based, I think it can fall into that "good subjective" category based on the below:

Constructive subjective questions:

* tend to have long, not short, answers
* have a constructive, fair, and impartial tone
* invite sharing experiences over opinions
* insist that opinion be backed up with facts and references
* are more than just mindless social fun

So that out of the way.

I'm helping out a fellow sysadmin that is replacing an older physical server running Windows 2003 and he's looking to not only replace the hardware but "upgrade" to 2012 R2 in the process.

In our discussions about his replacement hardware, we discussed the possibility of him installing ESXi and then making the 2012 "server" a VM and migrating the old apps/files/roles from the 2003 server to the VM instead of to a non-VM install on the new hardware.

He doesn't perceive any time in the next few years the need to move anything else to a VM or create additional VMs, so in the end this will either be new hardware running a normal install or new hardware running a single VM on ESXi.

My own experience would lean towards a VM still, there isn't a truly compelling reason to do so other than possibilities that may arise to create additional VMs. But there is the additional overhead and management aspect of the hypervisor now, albeit I have experienced better management capabilities and reporting capabilities with a VM.

So with the premise of hoping this can stay in the "good subjective" category to help others in the future, what experiences/facts/references/constructive answers do you have to help support either outcome (virtualizing or not a single "server")?

Source: (StackOverflow)

Is there still a use for irqbalance on modern hardware?

This question has been asked before, but I believe that the world has changed enough for it to be asked again.

Does irqbalance have any use on today’s systems where we have NUMA-capable CPUs with memory sharing between their cores?

Running irqbalance --oneshot --debug shows that a virtual guest on a modern VMware ESXi environment is sharing the NUMA nodes between cores.

# irqbalance --oneshot --debug 3
Package 0:  numa_node is 0 cpu mask is 0000000f (load 0)
    Cache domain 0:  numa_node is 0 cpu mask is 0000000f  (load 0)
            CPU number 0  numa_node is 0 (load 0)           
            CPU number 1  numa_node is 0 (load 0)
            CPU number 2  numa_node is 0 (load 0)
            CPU number 3  numa_node is 0 (load 0)

irqbalance will in this case detect that it is being run on a NUMA system, and exit. This messes with our process monitoring.

Should we look into running numad instead of irqbalance on such systems?

This is mostly interesting for VMware virtualised servers.

Source: (StackOverflow)

What kind of storage do people actually use for VMware ESX servers?

VMware and many network evangelists try to tell you that sophisticated (=expensive) fiber SANs are the "only" storage option for VMware ESX and ESXi servers. Well, yes, of course. Using a SAN is fast, reliable and makes vMotion possible. Great. But: Can all ESX/ESXi users really afford SANs?

My theory is that less than 20% of all VMware ESX installations on this planet actually use fiber or iSCS SANs. Most of these installation will be in larger companies who can afford this. I would predict that most VMware installations use "attached storage" (vmdks are stored on disks inside the server). Most of them run in SMEs and there are so many of them!

We run two ESX 3.5 servers with attached storage and two ESX 4 servers with an iSCS san. And the "real live difference" between both is barely notable :-)

Do you know of any official statistics for this question? What do you use as your storage medium?

Source: (StackOverflow)

Why are snapshots considered as temporary backups not real backups?

I am using VMware ESXi. In our team we use to provide snapshots for long term backup.

Then we faced issues like memory spillover and the server got hang up.

I started reading in VMware knowledgebase articles and everywhere. Everywhere it was recommended not to have snapshots for a long time.

Even VMware advised to keep snapshots for maximum of three days.

But our team kept asking us to have at least two permanent snapshots (till deleting the VM). Sometimes we may use the VM for a year).

  1. one snapshot is for fresh machine state. (So when we complete testing an application, we will revert back to fresh state and install another application) (If I did not allow that, I may often need to host the VM.)

  2. Next snapshot for keeping the VM in some state (maybe they would have found an issue and keep that state for some time. Or they may install prerequisites for the application and keep the machine ready for testing.)

Logically, their needs seems to be fair. But if I allow that, I am to permit them to hold the snapshots for long time. We are not using our VM as a mail server or database server.

Why is keeping snapshots for long time having an adverse effect?

Why are snapshots considered as temporary backups, not real backups?

Source: (StackOverflow)

VMWare ESXi free?

Much is made of the fact that VMWare's ESXi hypervisor is "free"

As best I can tell, you can install the hypervisor on a host for "free".

Because ESXi does not have a built in management console, you need a program, of some sort, to connect to the ESXi hosts to "manage" them. By "manage" I mean, start, stop, install, reboot and backup vms.

If you install the free ESXi on a host and connect to it via a web browser, you are prompted to download vSphere to manage the host. OK, but vSphere is, as best I can tell, not free. When you install it you are continuously reminded that you have only 60 days to evaluate vSphere.

My question is this: Is there a completely free management tool for ESXi hosts that enables one to:

  • Create VMs
  • Modify VMs settings (memory etc.)
  • Power VMs on and off
  • Backup the VM (via any means)
  • Resore a VM from a backup

Failing that, without licensing something from VMWare, is there any tool that will let you manage your hosts after the 60 day evaluation period of vSphere ends?

I have not found a straightforward explanation of this on VMWare's web site. Does anyone out there know the answer (even better if you can point me to a clear explanation on VMWare's website...)

Source: (StackOverflow)

mount: unknown filesystem type 'linux_raid_member'

I have Suse Linux 12.1 and i am trying to mount a single RAID 1 disk, to explore the files in it. However when mounting it:

 # mount /dev/sdc1 /mnt/test
  mount: unknown filesystem type 'linux_raid_member' 

I started reading around and many advised to just force the filessystem type

  # mount -t ext4 /dev/sdc1 /mnt/test
  mount: /dev/sdc1 already mounted or /mnt/test busy

when trying

 umount /dev/sdc1                 
 umount: /dev/sdc1: not mounted

Could someone provide some advise?

I am running my machines insed an ESXI server and it is a virtual disk. However this should not play, as this disks are not used by any other machines


Source: (StackOverflow)

Hyper-V vs. ESXi vs. XenServer

Can anyone share their experiences (for example, this was great! This failed miserably!) with using the Hyper-V, ESXi, and XenServer virtualization platforms? Cost? Management? features? Handling load and backups and recovery?

And also minimum server requirements?

I thought Xen was a free virtualization platform for Linux. Is there a Xen and a separate XenServer platform?

Opinions and observations would be appreciated for a test rollout for our organization.

Source: (StackOverflow)

What happens when the USB key or SD card I've installed VMware ESXi on fails?

An SD (SDHC) card installed in an HP ProLiant DL380p Gen8 server running VMware ESXi just failed :(

I encountered some ominous looking messages on the vCenter console and in the HP ProLiant ILO event log...

Lost connectivity to the device ... backing the boot filesystem. As a result, host configuration changes will not be saved to persistent storage.

enter image description here

Embedded Flash/SD-CARD: Error writing media 0, physical block 848880: Stack Exception.

enter image description here

VMware advocates the use of USB and SD (SDHC) boot devices for ESXi. It was one of the main reasons the smaller footprint ESXi was developed (versus the older ESX). I've spent much time highlighting the differences between ESXi's installable and embedded modes to coworkers and clients. However, these failures do seem to happen. In this case, this is my third instance.

Luckily, this is a vSphere cluster with SAN storage. What steps should be taken to remediate this failure?

Source: (StackOverflow)