Capacity Management in a Cloud Environment

 


Like many in 2020, I found myself working exclusively from home and having a bit more time on my hands in the evenings. I was much more fortunate than many as I had my whole family at home, and we were all working or studying remotely. I decided that I wanted to make something of the time in isolation (those of us trained in physics may remember Isaac Newton’s productive period in 1665 when he left London for the countryside during a plague outbreak). My team does database cloud capacity management in a converged, VMware environment. My theory was that the public cloud providers faced capacity management issues similar to the ones that we were solving. So our team attended the various online vendor conferences and took a lot of online training (one good thing that came out of 2020) to see if the larger cloud world was experiencing similar problems and what we might be able to learn to make our analyses and processes better.

 After listening to what cloud providers were seeing with their customers combined with our experience, I came to believe that the challenges of capacity management were similar even if the terms used were different. So, what in a private cloud might be described as running short on capacity on a particular machine might be described as cost creep in a public cloud environment. The key themes that I found for all cloud environments were:

1.       The need to build a model of capacity that all stakeholders can understand from their perspective.

2.       Accept that applications teams may not really know what they need when they provision capacity.

3.       Demanding applications must be dealt with differently.

4.       Cleanup does not naturally happen. Wasted capacity does.

5.       There are many different opinions on what IAAS, PAAS and others really provide.

 The first common theme was the need to build a model of capacity that all stakeholders can understand from their perspective. Whether it is a finance person or an application system manager, being provided with a list of 150 servers or 212 containers for a usage review usually does not produce effective results. Why? I believe that very few people can understand a list of host names or container names (or even service instances for that matter) and translate that list into what it is doing for them. After this first try, we enhanced the capacity model driven from the list of servers and containers and merged in data from the CMDB, database and operating system monitoring tools. Basically, we looked everywhere we could find information that would put a resource on the network into context for someone who needed to review capacity usage. So when speaking to application teams, it helps to identify what databases are on those servers, the version of the database used (so they can see which ones were the old ones that they migrated off of to meet risk requirements), its role, etc. When speaking to people involved with costs, you start with the resources that generate the bill (disk, CPU, memory, etc.) and then map those to individual application teams involved along with the versions used. In these cases, each group saw what they were concerned about and could map it to an application or user community which helped them evaluate if it was still needed and also see where they might have to make changes such as migrating off older versions of the software (which is another report that they might be getting from risk or architecture). Basically, it comes down to a model that speaks everyone’s language and a tailored set of reports to help them understand what they have rather than an itemized bill that speaks only of resources used.

 The next theme that I found was that application teams may not know what they really want when first moving to a cloud environment or building a new application. They are typically good with wonderful features and ideas to impress users, but when you ask them how many CPUs, how much memory, etc., you are often speaking a foreign language to them. Many will go to their vendors who have a real incentive to get a Porsche to run their product and make it look good while the infrastructure world is under pressure to deliver on the big savings and high utilization that was advertised for the cloud environment. There are often a long series of meetings where there is a lot of jockeying and eventually a configuration is selected. The challenge here is that there are a lot of assumptions (or even guesses) about how well the application will be accepted (most have heard of things going viral on the internet) and also what features people might think of next. This brings us back to the challenge of high utilization to realize cost savings competing with ability to scale up quickly in one place. It often leads to a situation in which people must move to a different location to keep up with their performance needs, which takes time and energy from the application team and infrastructure teams.

 An assumption that many teams make is that one architecture will fit all applications. This may work for a world with only a handful of applications which are all similar, but most larger companies have a broad portfolio that usually follows the 80/20 or 90/10 rules. Typically, there are only a handful of applications that drive the business, have large user bases, or need maximum performance. So, while most applications will fit into the cost-effective, high density environment designed for the masses, it is important to have a higher performance environment or specialty options available versus the one size fits all solution.

 The next theme is unfortunately that cleanup does not happen naturally, but waste does. In public clouds this is often referred to as cost increases, in the private cloud it often shows up as capacity shortages or unexpected growth. In most cases, developers are allowed to provision systems through automation for their tasks but there is no one watching to clean things up when they are no longer needed (perhaps that goes back to management not being able to understand what is out there in terms that they can relate to). So when they complete a special development project that they requested resources for or perhaps when they move to the next version of the database, web server or operating system to meet standards from architecture or risk, no one wants to get rid of the old resources right away (perhaps they want to see if the new ones really work). If this is not watched over, as you spend more years in the cloud, you accumulate more old junk similar to a garage or attic. The key here is to show the people responsible for paying the bills or justifying their use of private cloud resources what they are using in terms that they relate to so they can make the right decisions.

 The final theme that I found is that there are many different opinions on what IAAS, PAAS and the flock of *AAS cloud acronyms really provide. Application teams tend to read the many glowing articles on what can be done in the cloud and they assume that they somehow get all of these wonderful possibilities when they move to the cloud. In my experience, you do not get what you want; instead, you get what you architect and design into the system. Everything from backups that meet your corporate requirements to failover automation to firewall security need to be planned and implemented using the appropriate vendor’s tools as they do not come magically out of a box. Most cloud providers provide a myriad of options and possibilities for operating systems, disk speeds, applications supported and even settings. The challenge is taking the many thousands of options and turning them into the list of possible configurations that meet your requirements and work well with the vendor’s environment.

 Finally, one thing that I did find unique to making your own private cloud is the battle with the converged teams over the use of over commit. Many moves to a cloud environment start when some big name consultants come in and tell your board of directors how incredibly low the utilization in your legacy data centers is and the enormous amount of company capital that IT is wasting. Sensitive to this, the on-premise cloud vendors have found ways to promise the same resource (be it CPU, memory or IO bandwidth) to multiple applications or virtual machines at the same time. The thought is that it is usually statistically improbable that the applications sharing these resources will use them at the same time. This is usually a good bet for things such as web servers, where you are responding quickly to requests sent in from the web over the course of a day. But it is not so good for database servers where it may take seconds for some queries to process and applications that tend to have sharp spikes during the day (everyone logs in at 9 am or everyone wants to check their balances while shopping during the holidays) where making the assumption that the load will spread itself out evenly over the day is invalid. The difficulty here is that if everyone is putting CPU or memory demands on the system, it goes into swap where the system spends all of its time moving processes in and out of memory, or the processes back up on themselves, locking rows of data or just generally stacking up a system to the point where it cannot keep up. So this would be an example in which you profile each application or product (databases for example often allocate large memory areas at startup and do not release them, making swap more likely if over-committed) and make the correct decision for that application, as opposed to using general guidelines of what is possible from the vendors based on their lab environments.

 These are some of the main themes that I have seen in both the public and private cloud world from a capacity perspective. I believe each environment needs to be studied and modeled (yes, we database types love to build a database for everything) to allow you to run analyses to see where your capacity issues lie. One note is that capacity issues can really take two forms. The first is performance where you find that a given application is too much for its current location (needs to be moved to a better environment or perhaps is a candidate for physical servers in the cloud). The next is the overall capacity management (the numbers to ensure that you can provide enough resources both today and in the future to a given container or virtual machine). It is a never-ending series of analyses, since, once you solve one problem, there is another to address. The model helps you determine the problems, after which you can then use the tools of your environment (moving containers, migration to new containers or perhaps moving to a new architecture) to ensure that the environment is ready for tomorrow.

 

Comments