Anomaly detection, also known as outlier detection is the process of identifying extreme points or observations that are significantly deviating from the remaining data. Usually, these extreme points do have some exciting story to tell, by analyzing them, one can understand the extreme working conditions of the system. Some of the anomalies could be banking fraud in terms of transactions, or a sudden increase in the failure rate of fintech transactions due to the new software upgrade or surprising increase in purchase volume in the scenario of e-commerce product due to technical price glitch, which was displaying much lower price than regular price by mistake etc. In this way, anomalies can occur in so many varieties in actual various business scenarios. Proper anomaly detection should be able to distinguish signal from noise to avoid too many false positives in the process of discovery of anomalies.
Supervised learning is the scenario in which the model is trained on the labeled data, and trained model will predict the unseen data. Whereas in unsupervised learning, no labels are presented for data to train upon. Each of the methodologies has its advantages, and disadvantages like supervised learning models do produce highly accurate results, whereas unsupervised learning models do not improve performance over the period. One thumb rule could be, whenever any point is falling beyond 99th percentile value, that data point can be classified as an anomaly using unsupervised learning, but this method is too trivial.
There is a third methodology called semi-supervised learning, which is the combination of both supervised and unsupervised learning, here we would be explaining with the application using time series data in detail.
In the above-shown diagram, Regression model (like Random forest regressor or XGBoost regressor) can be used to train on the past 10 days to predict next 2 days based on various explanatory variables like hour flag, which day of the week flag, some lag variables, etc. Once the actual values are predicted from the model, upper and lower bounds are calculated based on the standard deviation observed from the prediction of the model for 2 days. Quantity to be added in either direction to the actual values is computed by 1.96 * standard deviation etc. The upper limit is the predicted central value + 1.96 * standard deviation, and the Lower limit is the predicted central value – 1.96 * standard deviation.
Once the upper and lower bounds are calculated, it will be overlaid with actual values and will highlight the anomaly whenever the real value do breach either upper or lower limits. One can say why this is a semi-supervised learning model is
- A model trained on actual data is like a supervised regression model, which is only the first part of the whole process.
- In the second stage, Anomalies predicted (1 represents an anomalous data point, and 0 is non-anomalous data point) whenever actual value breaches either upper or lower bounds is a representation of unsupervised learning model, where the multiplier values are fixed (like 1.96,2.56, etc.). Only variable in this phase is standard deviation which is not in control for model
In this way, semi-supervised learning is utilized for prediction of the anomalous data points on time-series data. Even though the second phase of the model is unsupervised, where bounds do not learn with the data. However, the user can manually reset the values to adjust to the signal accordingly. Let us say if the limits are too tight, and through too many anomalies, the user needs to increase the bands like changing the constant from 1.96 to 2.56. Similarly, if the bounds are too broad and every value is getting camouflaged, then the user needs to tighter the limits from 1.96 to 1.63, etc. to make bounds adjust to the signal type.
Summon the power of Augmented Analytics to help you identify risks and business incidents in real-time.
Pratap Dangeti is the Principal Data Scientist at CrunchMetrics. He has close to 9 years of experience in the field of analytics across the domains like banking, IT, credit & risk, manufacturing, hi-tech, utilities and telecom. His technical expertise includes Statistical Modelling, Machine Learning, Big Data, Deep Learning, NLP, and artificial intelligence. As a hobbyist, he has written 2 books in the field of Machine Learning & NLP