Machine Learning Challenge: Rightsize in 14 Days
During the recent years, the classic mathematics-based models used to create optimization recommendations have been empowered by Machine Learning mechanisms. This resulted in significant improvements in recommendations scope and accuracy of existing tools, as well as the creation of new ones.
Having reviewed the ML-based rightsizing offerings by top cloud providers, the Cloud Competency team from Ukraine (Rostyslav Myronenko, Yevhen Nadin, Oleksandr Onsha, Bohdan Onsha) took the challenge to create their own POC. We claimed we would get the initial version of a cloud-agnostic ML-based rightsizing solution within the shortest terms – 2 weeks only.
Look, what we got as a result!
Background and Investigation
However, there are several points that are hard to face for the cloud provider’s native rightsizing and optimization tools. For the top providers in the EPAM list (AWS, Microsoft Azure, GCP) the main are:
- These tools are not applicable for multi-cloud environments
- Most of these tools work with uncustomizable thresholds and timeframes which disables analytics fine-tuning for specific workloads
- The service analysis is performed mainly based on performance data within the same service (for example, compute instance optimization is based on instance performance metrics)
Look what we’ve got!
Kick-Off and Success Criteria
Effective rightsizing and optimization is a multi-step and multi-layered task. It needs to go beyond standard focus over the performance of individual resources, and take into account the parameters on the account, region, and whole-cloud levels. This is important because the performance and value of specific resources are often affected not only by their own load and lifecycles. This also can be the load of the neighbor services, amount of users, the specifics of the timeframe. Machine learning is a perfect solution for detecting all types of such dependencies and building experience-based recommendations for such infrastructures.
- The data for rightsizing includes:
- the standard instance performance metrics (CPU, Disk operations, Network)
- instance series (types)
- instance timelines
- The rightsizing suggestions used to train the ML are generated based on typical Maestro user flows
- The rightsizing suggestion should include recommendations on:
- Scaling up and down
- Stopping
- Scheduling
The Approach
Pandas is a Python-based powerful data analysis and manipulation tool, which can be called an industry standard. We used it for the initial data set processing – finding min, max, median, average values, detecting dispersions, grouping, filtering, etc. It also allows working with time series, and time is one of the core points for resources lifecycle analysis.
After the analysis, we got data structured by core parameters and split by time and values:
This is how the data are passed to the ML models, based on the Gradient Boosting approach. Briefly, it allows you to create predictions based on the decision trees. The trees illustrate going from the branches (which are the observation facts about the items), to leaves (which are the conclusions about the items values). A very simplified gradient boosted tree for our POC would look like this:
For each data set, numerous trees are created in order to find the most relevant dependencies and conclusions.
ML models automatically build these trees to find the dependencies and give the predictions in possibility rates between 0 (“absolutely impossible”) and 1 (“will definitely happen”).
To illustrate the approach application, here is a general flow of analysis for scheduling calculation:
- Collect instance metrics for several days (4 in our POC). We used data for Azure instances collected with standard Azure tools
- Sort data by days and metrics with (Pandas)
- Get the shutdown probability for same timepoints (ML)
- Aggregate the timepoints to period, using a specific probability threshold (we used 0.2) (Pandas)
- Create scheduling recommendations for the highlighted periods (Pandas)
Implementation
We decided to build our rightsizing mechanism on AWS (not only because our Architect is AWS adherer, but also to demonstrate the “cloud agnosticism” of the solution). The standard AWS services were gathered into the following architecture:
The instance data is stored in an S3 bucket, and processed by the Python code (with Pandas libraries used) put to AWS Lambda functions. The ML models are stored and learned within the AWS SageMaker, which is triggered by Lambda functions and returns the new values to them.
Once the recommendations are acquired, they are transmitted into a human-readable format on the dedicated web UI:
To Sum Up
Two weeks after we started our POC, we already have a mechanism that:- Identifies the necessity of instances shutdown, scale up and scale down
- Creates a recommendation on possible schedules that would optimize instance utilization
- Can be used for resources hosted in different cloud providers
- Can be customized, so that the applied thresholds correspond to the specific infrastructure purposes
- Can be integrated into Maestro UI and reporting, or to any other tool, once its API allows it.
This is only a small part of analytics that can be done for optimizing and rightsizing virtual infrastructures. Having strict deadlines and limited capacities, as well as actually working atop our standard product development tasks, we got a very promising result and a good start for Machine Learning for effective cloud management.
We are going to proceed, of course.
Keep watching ;)
Comments
Post a Comment