Anomaly detection is a monitoring mechanism, in which a system keeps an eye on important key metrics of the business, and alerts users whenever there is a deviation from normal behaviour. Conventionally, businesses use fixed set of thresholds to identify metrics that cross the threshold, to mark them as anomalies. However, this method is reactive in nature, which means – by the time businesses recognize threshold violations, the damage caused would have amplified multi-fold. What is needed, is a system that constantly monitors data streams for anomalous behaviour, and alert users in real-time to facilitate timely action.
Anomaly detection algorithms are capable of analysing huge volumes of historical data to establish a ‘Normal’ range, and raise red flags when outliers are seen to be deviating from the tolerable range.
A Good anomaly detection system should be able to perform the following tasks
- Identification of signal type & select appropriate model
- Forecasting thresholds
- Anomaly identification & scoring
- Finding root cause by correlating various identified anomalies
- Obtaining feedback from users to check quality of anomaly detection
- Re-training of the model with new data
Identification of signal type: The first task is to identify the correct type of signal. For instance, if the chosen data has cyclicity or a trend component etc. Usually, Deep learning models do not perform well on sparse data or small volumes of data, and for these type of signals, a simple ARIMA or XGBoost with correct feature engineering might be a better option. Whereas in case of data with good cyclicity in large volumes, application of deep learning models would be a good choice.
Forecasting thresholds: After every re-train of the model, it usually forecasts the threshold limits and these limits are calculated based on the metrics obtained from latest trained data, like mean, median, variance etc. By utilizing normal distribution analogy, based on given confidence, threshold will be set for the next actual point to be forecasted
Anomaly identification and scoring: Anomalies are identified whenever a particular metric moves beyond the specified threshold. However, it is important to quantify the magnitude of deviation of the anomaly, in order to prioritize which anomaly needs to be investigated/solved first. In the scoring phase, each anomaly is scored as per the magnitude of deviation from median or based on how long the deviated metric sustains from normal behaviour. Larger the deviation, higher the score.
Finding root cause by correlating various identified anomalies: Often, it is difficult to identify the root cause by looking into each of the metrics in silos. Rather putting all anomalies together gives a complete picture about the situation. Consider the example of a sudden increase in the traffic on a set of towers for a telecom operator. But by putting them on a map, it can be identified that the tower in the centre was shut down due to a technical problem, which led to the increase in traffic for all the neighbouring towers. However, this increase could be temporary, and the operator does not need to take any permanent action by increasing investing on infrastructure based on this anomaly identification. In order to stitch an entire story, one needs to put down all anomalies together, and understand the context by correlating with multiple data sources.
Feedback from users to check quality of anomaly detection: Anomaly detection systems are usually designed around tight bounds to highlight deviation quickly, but in the process sometimes these systems raise many false alarms. In fact, false positives is known to be one of the prevalent issues in the area of anomaly detection. One cannot underrate the flexibility that needs to be provided to end user, to change the status of a data point from anomaly to normal. After receiving this feedback, models needed to be updated/retrained to avoid identified false positives from recurring.
Re-training of the model with new data: The system needs to re-train on new data continuously, to adapt as per the newer trends. It is possible that the pattern itself does change due to the change in operating environment, rather than anomalous deviating behaviour. However, there should be a balance in the mechanism. Updating the model too frequently requires excessive amount of computational resources, and lower frequency of updating results in a deviation of the model from the actual trend.
Overall, anomaly detection is gaining increased importance in recent years, due to exponential growth of available data, and the absence of impactful mechanisms to use this data. Anomaly detection systems are better fit in identifying significant deviations, and at the same time ignoring the not worthy noises from the ocean of data – enabling business with the right alarms and insights at the right time.
Unleash the power of AI on your data with Anomaly Detection System
This article is originally published at TechBullion
Pratap Dangeti is the Principal Data Scientist at CrunchMetrics. He has close to 9 years of experience in the field of analytics across the domains like banking, IT, credit & risk, manufacturing, hi-tech, utilities and telecom. His technical expertise includes Statistical Modelling, Machine Learning, Big Data, Deep Learning, NLP, and artificial intelligence. As a hobbyist, he has written 2 books in the field of Machine Learning & NLP