Navigating your way around capacity management is not and easy task, especially at a large company where it seems almost impossible to get your arms wrapped around it. HA – I picture a large tree and trying to hug it, not quite able to lock your fingers on the other side! It’s really kind of like that. You got most of it, but you are always reaching. At times you need to step back and re-evaluate your angle or approach. Over the last year or so I’ve been working with the capacity management team to choose exactly the right metrics to determine the best way to evaluate capacity. Last week one cluster, according to vROPs, was in desperate need of capacity, we were running into our buffers; however when we looked closely in our review meeting we noticed that the reason we were out of capacity was due to CPU Demand. This spun off a number of weekly meetings to re consider our approach or angle to see if we can get our fingers locked. In all honesty, this wasn’t an oversight, we have a pretty smart group of people and we meet regularly to review. Everyone on our team has the same goal and these types of discussions make sure we are staying on target; however we did realize that we needed a deeper understanding of the different types of capacity models and how to apply them as policies across the virtual infrastructure. So let’s start with a quick level set and go from there. All right, here we go!
This model is capacity based on the configured amount of resources assigned to a VM or VMs in a cluster. The consensus is that this model should be used for production environments where you have important workloads, and you want to be able to keep resources for fail-over, and you want to make sure you don’t over commit by too much. You decide your over commitment ratio and set that in the policy. This is the most conservative capacity model.
The Demand model is often used in Test/Development environments where you don’t necessarily care about over allocation, and you really want to get as many guests as possible in the environment. If you are using this model you probably don’t care if the hosts are running hot. You will likely be way over allocated but again you don’t care because you want to run this for highest possible VM density.
Memory Consumed model
This model allows you to see the memory resources used just like you would in the vSphere client. It shows the active memory, plus shared memory pages, plus recently touched memory. All the memory overhead.
So which one do we choose? That’s an excellent question. In all likely hood, we are going to look at all these models and how they affect capacity. We have, and I’m guessing you do too, clusters with mixed workloads or due to licensing considerations clusters where you have to mix test/dev hosts with production hosts. So its not so easy to just pick one or the other and go with it, especially when you have to scale up the environment to meet the needs of the company. Our team decided to start to implement different policies specific to the cluster and workloads in those clusters. The polices will include different allocation over-commit ratios for CPU/Memory and Disk. Some policies will account for all three models others will just be one or a combination. What’s really great is vRealize Operations is so flexible its really easy to dial in capacity just the way you want it. One other decision we made that you might want to consider is that we will only rely on the data in vROPs for capacity management. We wont look at what vCenter is showing for cluster resources used to determine if we can “fit” more VMs in. Capacity management is not easy, it takes time to collect metric data, analyze it and then tweak it so you are sure you can make the best decisions. Sometimes those decisions can save (or cost) your company a significant amount of money. The good news is there is no magic going on there. If you put in the work and use a great tool like vRealize Operation Manager you will get to a point where real value will be realized with vROPs. Now that our team has determined to use a combination of models, we can then begin to adjust policies and review data that’s already been collected to make sure we are using metrics that meet our needs. I’d love to hear how others are using vROPs to determine capacity and some of the challenges and success you have encountered. If you read this and want to share, add a comment.
I’d like to thank Hicham Mourad for his help with some questions and his guidance along the way. He is a really smart guy, and Im thankful I can reach out to him when I need to. 🙂
It Seems like the default policy considers both allocation and demand. but when it comes to environment analysis I believe it is giving the capacity remaining and time remaining based on allocations