AI/ML usage for SRE Journey
- rohitsinhalala
- Jun 18, 2021
- 2 min read
Updated: Jun 18, 2022
The journey of SRE can be defined by following when it comes to AI/ML set up
Identification of Automation and optimization Use Cases
Toil reduction
Simplicity of log based analysis for better decision making
Visibility Engineering --> The effective way of doing things
Proper use of SRE s time
One of the key aspect which is often overlooked is overload of SRE s. SInce its the unique thing coming in, the organizations often overload the SREs in general.
Most of the time , the data is provided to SRE to analyze, there have been instances where SREs are given tons of data and are left on their own to figure out a needle from bundle of hey.
This is exactly where AI/ML and predictive analysis can help, The two primary areas for these are following
Firstly, Analysis of areas where the automation can be done and/or automation needs to improve. There are instances where automation is done for the sake of it. The general ticket/incident analysis is done by all organization but very seldom do we see a organization investing in AI based solution to play around and also use it to really change their ways of working and process. A very simple example is automatic ticket triaging based on the ML based solution to read through the ticket and understand the technical error it points to. Once we have something like that in place, it makes problem analysis more real time and effective. SREs can just have a look at it and take decisions. Trends and patterns available at their finger tips.
Secondly, the under utilization of log analytics. Log is the source of all data analysis and there is nothing better to provide the truth. Even Incidents are based on static thresholds. If a solution is put in place to ensure there is a dynamic AI based threshold set up which monitors the system and alerts if need be ( or even take action), that would be a future solution. This is more effective even with API based, Container based monitoring as well as Infrastructure/application based monitoring. Automation is only the end product , the real crux lies in the analysis and the how quick that analysis is. These log based analysis if coupled with AI based algorithms , can make the prediction for real and then gone are the days when we would require Incident based ITIL processes, the prediction would change the operations for ever and also SREs would be in the forefront of this transformation as this is most beneficial for service reliability and automation of toil scenarios.

The SREs have the prediction, they have the trend and pattern and also have the raw data. This is sufficient for them to take decisions, this methodology is useful even in the application and more so in security domain. SO a holistic solution can be achieved. Visibility engineering also plays a part in this, this ensures that the clarity is restored and simple dashboards are put in place to take simple decisions. In the next blog, we shall talk about simplicity in Toil of things which are done by operations which are not recordable & addition to the release engineering principle.
Comments