Trouble-shooting a memory-bound VM; OR: Adventures in stupidity

Troubleshooting a memory-bound VM

I had fun at work this week.  I spent a lot of time chasing my tail, trying to figure out why one of our SQL servers was frequently memory-bound.  Note that I have removed server names and other identifying material from screenshots.  One of the greatest frustrations in this new job is finding things my former colleagues either a) completely fucked up, or b) never noticed were a problem.  A good example (which I wish I'd documented as well as my story below):  A SQL server (this very same one, as it happens), averaging 99% CPU consumption, 24/7.  One of my former colleagues (the one with DBA aspirations) decided that the problem was a corrupt SQL installation, and had intended to rebuild the machine from scratch.  I thought this sounded like bollocks, and with a bit of googling, learning, reading BOL and testing, discovered that one of the databases had a table that needed an index applied.  CPU usage fell to approximately 3% average.  A rebuild of the machine would never have fixed that.

Anyway… on to today's tale.

Background:  I have been asked to prepare one of our servers for a SQL version upgrade (from 2008 to 2008 R2).  As part of this prep work, performance baselines were taken.  The performance log files were parsed by an automated tool called PAL (http://pal.codeplex.com/).  PAL found that this VM was severely memory-bound.


 

This conclusion was supported by casual observation in Task Manager:

 

Manual analysis of the performance log files did not reveal the source of the RAM constraints.  All counters suggested that the sum total of memory used by processes was in the order of approximately 2GB. 

 

The VM has been configured with 8GB of RAM.  Neither automated analysis nor the manual analysis made sense.  How can a system using only 2GB of 8GB be memory-bound?  The flat-line nature of the graphs bothered me.  My experience has been that flat lines are usually the result of an artificial constraint (e.g. bandwidth consumption limits on WAN traffic, imposed by network shaping).

 

Since this server is a SQL box, I investigated SQL’s memory usage (Counter: SQLServer: Memory Manager\Target Server Memory (KB):

 

SQL appeared to be consuming up to approximately 1.8GB of RAM. 

 

By default, SQL is configured at the SQL Server level to consume up to 2PB of RAM.  I checked to see if SQL had been “held back” by a non-standard configuration:

 

Someone has imposed a limit of 6.7GB of RAM usage on this SQL Server instance.  Why this particular number was chosen is unknown to me, but in the context of the current problem, it did not appear to be a contributor – we would’ve seen SQL consuming more than 2GB of RAM.  In short, automated and manual analysis of the performance logs did not show what was consuming this system’s memory.

 

I ran a Sysinternals tool called RAMMap:

 

RAMMap showed that drivers were consuming 5.5GB of RAM!  On a VM, this is very unusual.  VMware Tools load some drivers that are designed to help the hypervisor shuffle resources between VMs.  One of these is the balloon driver, which under normal conditions is used by the host to create an artificial RAM constraint at the guest level.  This forces applications to release unused memory to the balloon driver, which “tells” the OS it is using some amount of RAM that cannot be released (preventing user apps from trying to reclaim more RAM).  In turn, physical memory is released to the host to allocate to VMs that need more RAM.

At this point, I felt that the issue was at the VMware layer.

 

The Resource Allocation view provided a clue:

 

 

VMware’s performance overview showed another flat-line graph:

 

 

Using the advanced graph to show balloon usage for this VM confirmed my suspicion:

 

 

Another flat line.  The balloon driver was the culprit.  But it didn’t make sense.  The host is not memory-constrained, and other VMs are not hitting their memory limits. 

 

I investigated the VM’s configuration and found this:

 

Despite giving the VM 8GB of memory, it had been artificially constrained to only use 2GB.  In order to achieve this, the balloon driver kicked in, consuming the difference.  There is really no good reason for this, and my suspicion is that the VM was created from a template that had this constraint applied to it, and that whoever provisioned it did not remove the limit.  It has been running on effectively 2GB of RAM since.

 

I checked the “Unlimited” checkbox (no outage required).  This removes the hard limit, and tells the balloon driver to release RAM to the OS.  I could see the driver immediately released some RAM to the OS, but it wasn’t the 5.5GB I was hoping for:

 

 

The process of releasing RAM from the balloon is a slow one.  Googling suggests it can take days to finally release all of the memory to the guest.  A restart of the VM might accelerate this.  We planned an outage to restart the VM, but as it turns out, at 0130 the next morning, the balloon was released entirely:

 


 

Next steps

If one VM has been misconfigured, it stands to reason there might be others.  I used a PowerCLI command to identify VMs that have memory limitations imposed on them:

Get-VM | Get-VMResourceConfiguration | where {$_.MemLimitMB -ne '-1'} | foreach {$_.VM.Name + " " + $_.VM.MemoryMB + " " + $_.MemLimitMB}

 

Server1 1024 1024

Server2 4096 4096

Server 3 4000 4000

Server 41024 1024

Server 5 2048 2048

Server6 4096 2000

Server 7 2048 2000

Server8 2048 2048

Server9 2048 2048

Server10 2048 2048

Server11 1024 1024

Server12 2048 2048

Server13 2048 2048

Server14 2048 2048

Server15 4096 4096

Server16 2000 2000

Server17 2048 2048

Why anyone would configure a VM with an amount of RAM, and then set a hard limit on that VM for the same amount of RAM is beyond me.  I also do not understand the logic of configuring RAM or RAM limits using values that do not fall on standard boundaries (e.g. 2000 instead of 2048).

We can see here that Server6 has been configured with 4GB of RAM, yet is only allowed to consume 2GB of that 4GB.  I would not be surprised if this VM is also experiencing low-memory conditions.

 

Information on memory limits can also be exposed via the vSphere GUI (but this does not show the VM’s configured memory):

 

Performance counters for Server6 should be recorded and analysed.  If this VM is memory-bound, then the limit should be removed.  If it is not memory-bound, then it does not make sense to present it with 4GB of RAM when it only needs 2GB.

Broader next steps would be to review VM configurations across the infrastructure.  We should also consider taking baseline performance counters to assess if current workload requirements are being met.

 

All in all, this was a great learning exercise for me, and I feel that I a) accomplished something useful, and b) identified misconfigurations on other VMs before they became problematic.  But I'm also very disappointed that the people trusted with managing this infrastructure in the past really didn't do as good a job as they could've. 

VMware – I moved it? I copied it?

I was just talking with my mate Andrew about this today.  For those of you who have no clue WTF I'm talking about it, when opening a VM from file in VMware Workstation and ESX/i, it prompts you with a strange little question.  Did you move the VM or copy it?  And what does it matter?

Rather than explain it here, here's an excellent article on the topic.  I stumbled across this while looking for information on SQL configurationfile.ini settings.

Fixing slow VMware Workstation shutdown

This happens to me all the time at work.  I've got half a dozen VMs open, then I realise it's going-home time and want to pack up and get out of there!  Even though I pause my VMs, and it looks like they're suspended, and I can close Workstation, it's still doing stuff in the background.  The disk churns like crazy.  It once took fifteen freaking minutes for it to be done doing whatever it was doing.

This hint from Bryon Brewer worked a treat!  Thankyou, Bryon!

Online training

As the one or two regular readers of this blog would know, about six months ago I moved into a new role.  Instead of being a 3rd-level-tech-cum-team-leader-cum-project-manager-cum-network-admin-cum-you-name-it-I'm-it, I'm now a server engineer.  This is good for me in lots of ways, but my favourite thing about it is that I get to focus on just one thing instead of having to spread myself across so many disciplines.

It also means I'm getting exposed to a lot of technologies my fomer role didn't expose me to.  Rightly or wrongly, my workplace's IT infrastructure is largely outsourced, so I've never had the exposure to directly administering some of the fun parts of our infrastructure – eg Exchange, VMware, to a lesser extent, SQL.  That I'm getting exposed to it now is a good thing.  That I need to know it all now is not quite so good.  Of particular "joy" is the fact that I've had to take over a departed team mate's SQL responsibilities.  I've always had a bit of a hate-hate relationship with SQL.  I know just enough to get by, and possibly just enough to be dangerous.  But since nobody else wants it, and I'm "new" to the new IT organisation, I've been lumped with it.  So I am now something of a reluctant DBA.

I spent the first six months in this job hating/resenting having to deal with SQL on a daily basis.  Largely, I regard SQL as something that must be defeated.  They say it is best to truly know one's enemy.  So last week, I decided I had to do something about it.  I've decided I'm going to learn SQL.

Since BlueScope's training budget is zero, I knew I'd have to pony up the $$$ myself.  Classroom training for SQL is around the $5K mark, and I just don't have that sort of money to spend.  I started looking at online/CBT training.  Two companies stood out:  CBT Nuggets and TrainSignal

TrainSignal's demo system lets you view a presentation.  In this case, it was an 18 minute segment on SQL.  It was well-presented and I found myself wanting to buy the course.  Around $600USD.  I noted they also have a MCITP:EA course, listed at $1000 or thereabouts.  So that's $1600 for Windows plus SQL.  Seems like a lot of dough.  The good news is that you can download the content to view it offline, which made it very appealing.

CBT Nuggets' demo, quite simply, sucks.  You get a TWO MINUTE taster of a presentation.  Hardly enough to get an idea of how good the product is.  Interestingly, however, they have a 24 hour subscription for $24.  And it gives you access to their whole library.  So I figured, what the hell, I've got a few spare bucks in my PayPal account, why not?  It was well worth it.  I got to see not only their SQL 2008 training, but their Windows, VMware, Exchange etc etc.  The thing about CBT Nuggets is that their content is not available for offline viewing.  It's a streaming model.  If you buy a course (eg SQL), you get access to it for four months.  Four months just isn't enough for me.  I'm not that motivated, and with a product I'm new to, I need to go over the material many times.  I had a look at their other courses.  Their MCITP:EA course is $1500.  The Exchange course is $600.  That's a lot of money.  And I know that the next twelve months will see me covering all sorts of topics, not just Windows or just SQL or whatever.  I saw then that they have a 12 month subscription for $2000.  It lets you access their entire library for 12 months.

Both providers offer access to Transcender exams, and pricewise, they're about the same (except for CBT's MCITP:EA course which is inexplicably $600 more than TS).  Offline access was really important to me, but so was having access to an entire library of training material.  As interested as I was in the TS products, their sales people just didn't impress me.  They weren't interested in modifying their packages or offering discounts unless I spent over $1300.  Even then, that would only get me a couple courses.

In the end, I decided a wide library was better than offline access.  I pulled out my credit card and signed up for the CBT Nuggets 12 month subscription.  I'm very glad I did.  I've gone through 3 SQL presentations now and have learned heaps.  It's already helped me answer some pressing work questions.

Emboldened by my experience with CBT, I started looking at Safari Books Online.  They also have a subscription model – one with limits (cheaper) and one that's unlimited access to their library for 12 months.  They're running a deal at the moment, where you can get the 12 month unlimited access deal for $399 USD.  I went for it.  Again, I'm glad I did.  They have an iPad app that lets me read their books on my iPad.  You can't read them offline using the iPad (or maybe you can, I haven't looked hard enough), but the point is you can read it on the go.  Eg if you're a commuter stuck on a train or bus or have time to kill or whatever.

So for $2400, I have 12 months access to a large CBT library and a huge textbook library.  That's half the cost of a single face-to-face training course!  My credit card is hurting, but I'm feeling much more positive about SQL and also about my ability to improve my skills at work.  I'd better get stuck back into study now.  I don't want to waste these subscriptions!

Revisiting VMware study

I'm off work today, sick.  Many thanks to the knob at the desk across the hallway from me who came to work sick and spent the whole day yesterday coughing and sneezing into the communal air.  Did I say thank you?  Actually, no, I meant FUCK YOU.  You turd.

Since I am at home, and have time to myself, I thought I'd do a bit of experimentation with Veeam, a backup product for virtualised environments.  Of course, this meant resurrecting my VMware study lab, which I am ashamed to say, has sat idle for the last.. ohhh… at least six months.  Give or take another six months.  In the process of setting up Veeam, I figured I'd try to reproduce the VM environment at work.  This involves configuring a NAS device and using it as a backup target.  Hmm.  i have a VM called "FreeNAS".  Funnily enough, it runs FreeNAS.  A NAS.  That's free.  I set about configuring Active Directory integration on the device and it was playing silly-buggers, so I figure OK, what the hell.. I'll just create a new one.  I downloaded the latest version of FreeNAS, quickly configured it and configured the AD integration.  Or so I thought.  It refused to work, giving me a vague "service couldn't start" message.  Super-helpful.  So I googled, and discovered this cautionary tale, which in turn pointed to this cautionary tale.

You can guess what happened to my domain controller VM, right?  Shame on the FreeNAS GUI designers who didn't think that "hostname" could be interpreted to mean "hostname of the DC" instead of "hostname of the freenas box".  So, with my AD hosed, and not enough care factor to try to fix it, I'm recreating my AD environment too.  Bugger.