Machine Learning Challenge: Rightsize in 14 Days

During the recent years, the classic mathematics-based models used to create optimization recommendations have been empowered by Machine Learning mechanisms. This resulted in significant improvements in recommendations scope and accuracy of existing tools, as well as the creation of new ones.

Having reviewed the ML-based rightsizing offerings by top cloud providers, the Cloud Competency team from Ukraine (Rostyslav Myronenko, Yevhen Nadin, Oleksandr Onsha, Bohdan Onsha) took the challenge to create their own POC. We claimed we would get the initial version of a cloud-agnostic ML-based rightsizing solution within the shortest terms – 2 weeks only.

Look, what we got as a result!

Background and Investigation

Modern cloud management needs to cover a wide range of tasks and expectations from businesses and users of all kinds. Resource and cost optimization, effective performance, reliability, fault-tolerance are among them. Naturally, any cloud provider offers its own tools to help users fine-tune their infrastructures and react properly.

However, there are several points that are hard to face for the cloud provider’s native rightsizing and optimization tools. For the top providers in the EPAM list (AWS, Microsoft Azure, GCP) the main are:

These tools are not applicable for multi-cloud environments
Most of these tools work with uncustomizable thresholds and timeframes which disables analytics fine-tuning for specific workloads
The service analysis is performed mainly based on performance data within the same service (for example, compute instance optimization is based on instance performance metrics)

We, the Maestro team from Ukraine (Rostyslav Myronenko, Yevhen Nadin, Oleksandr Onsha, Bohdan Onsha, Svitlana Podkopaieva), made a market investigation and concluded – the perfect tool is not there. So, we’re about to make it. However, we still had our standard Maestro roadmap, approved yearly budget, and other constraints. So, we made a bet to create an ML-based rightsizing POC within a two weeks sprint, with a team of 1 solution architect and 3 developers.

Look what we’ve got!

Kick-Off and Success Criteria

Diagram

Description automatically generated Effective rightsizing and optimization is a multi-step and multi-layered task. It needs to go beyond standard focus over the performance of individual resources, and take into account the parameters on the account, region, and whole-cloud levels. This is important because the performance and value of specific resources are often affected not only by their own load and lifecycles. This also can be the load of the neighbor services, amount of users, the specifics of the timeframe. Machine learning is a perfect solution for detecting all types of such dependencies and building experience-based recommendations for such infrastructures.

The amount of data to be used in ML processes, and the scope of the recommendations to be received, in fact, depends on the time and effort you are ready to spend to get the first results. With the available capacities and strict deadlines, we highlighted the following key points for our POC:

The data for rightsizing includes:

the standard instance performance metrics (CPU, Disk operations, Network)
instance series (types)
instance timelines

The rightsizing suggestions used to train the ML are generated based on typical Maestro user flows
The rightsizing suggestion should include recommendations on:

Scaling up and down
Stopping
Scheduling

The Approach

The main tools we selected were the Pandas library for data analysis, and Gradient Boosting, powered by XGBoost library, for machine learning.

Pandas is a Python-based powerful data analysis and manipulation tool, which can be called an industry standard. We used it for the initial data set processing – finding min, max, median, average values, detecting dispersions, grouping, filtering, etc. It also allows working with time series, and time is one of the core points for resources lifecycle analysis.

After the analysis, we got data structured by core parameters and split by time and values:

Chart, histogram

Description automatically generated

This is how the data are passed to the ML models, based on the Gradient Boosting approach. Briefly, it allows you to create predictions based on the decision trees. The trees illustrate going from the branches (which are the observation facts about the items), to leaves (which are the conclusions about the items values). A very simplified gradient boosted tree for our POC would look like this:

For each data set, numerous trees are created in order to find the most relevant dependencies and conclusions.

ML models automatically build these trees to find the dependencies and give the predictions in possibility rates between 0 (“absolutely impossible”) and 1 (“will definitely happen”).

The resulting data can be again processed with Pandas to get a new set of conclusions.

To illustrate the approach application, here is a general flow of analysis for scheduling calculation:

Collect instance metrics for several days (4 in our POC). We used data for Azure instances collected with standard Azure tools
Sort data by days and metrics with (Pandas)
Get the shutdown probability for same timepoints (ML)
Aggregate the timepoints to period, using a specific probability threshold (we used 0.2) (Pandas)
Create scheduling recommendations for the highlighted periods (Pandas)

Implementation

We decided to build our rightsizing mechanism on AWS (not only because our Architect is AWS adherer, but also to demonstrate the “cloud agnosticism” of the solution). The standard AWS services were gathered into the following architecture:

Text

Description automatically generated

The instance data is stored in an S3 bucket, and processed by the Python code (with Pandas libraries used) put to AWS Lambda functions. The ML models are stored and learned within the AWS SageMaker, which is triggered by Lambda functions and returns the new values to them.

Once the recommendations are acquired, they are transmitted into a human-readable format on the dedicated web UI:

To Sum Up

Two weeks after we started our POC, we already have a mechanism that:

Identifies the necessity of instances shutdown, scale up and scale down
Creates a recommendation on possible schedules that would optimize instance utilization
Can be used for resources hosted in different cloud providers
Can be customized, so that the applied thresholds correspond to the specific infrastructure purposes
Can be integrated into Maestro UI and reporting, or to any other tool, once its API allows it.

This is only a small part of analytics that can be done for optimizing and rightsizing virtual infrastructures. Having strict deadlines and limited capacities, as well as actually working atop our standard product development tasks, we got a very promising result and a good start for Machine Learning for effective cloud management.

We are going to proceed, of course.

Keep watching ;)

Search This Blog

Maestro Blog