Anomaly Detection Using Isolation Forest
Recently I have started working on Anomaly detection techniques. Anomaly detection is one of the least explored areas in Data Science. In this article, let us brush past about fundamentals of anomaly detection and dive deep into the Isolation forest technique- the one which will win hands down with other techniques.
Anomaly detection deals with the identification of unusual patterns/behaviour that doesn’t conform to the usual trend.
It is applied in wide range of areas- Signal processing, Automation in manufacturing, Chemical reaction monitoring etc. Here we will narrow down to finding anomalous data points.
Anomaly detection using conventional statistical methods
Statisticians, since 1950s ,have come up with different methods for Anomaly detection. Here are the 3 most widely used statistical methods. Since our main focus is on Isolation forest, we will not discuss about these methods, though I will give pointers-if you’re interested, go ahead and take a look.
Though the above tests have been implemented in the industries for so long, they have the following limitations
- Works well only for near to normal distribution
- Detects only one anomaly per test
- Works well only for a 1-Dimensional data set
Need for other anomaly detection methods
- Most of the real world data significantly deviate from normal
Let us take see an example.
At Grofers, we once analysed the data of Total action time of vendors( In layman terms, this is similar to delivery time) . When I plotted the histogram of the data distribution, I clearly get a distribution that significantly deviates from the normal.
Thus, we couldn’t go with blunt assumption that all real word data is close to normal distribution. So we need an algorithm that can deal with distributions which aren’t normally distributed.
2. Dataset can have multiple anomalies
Conventional statistical methods generally detects only one anomaly per test but in most cases we will have multiple anomalies present in the dataset. So we need an algorithm that can detect multiple anomalies at a go.
3. Data point can depend on a lot of features
Most of the real world phenomenon requires significant amount of dependent variables/features. Thus we require an algorithm that can over come the curse of the dimensionality. (Feel free to read Section 5.3 to understand how Isolation forest overcomes this problem)
Isolation forest method
Isolation forest detects anomalies by randomly partitioning the domain space. Yeah, you’re heard me right- It works similar to Decision trees algorithm, where we start with a root node and keep on partitioning the space. In Isolation forest we partition randomly, unlike Decision trees where the partition is based on Information gain.
Partitions are created by randomly selecting a feature and then randomly creating a split value between the maximum and the minimum value of the feature. We keep on creating the partitions until we isolate all the points(in most cases we also set a limit on number of partitions/height of the tree).
Now let us visualise how a normal point will differ from an anomalous points in our feature space.
In the above image, the red points denote the anomalous points whereas the blue ones denotes the normal points.
Now clearly, a normal point will be more clustered while an anomalous point will be far away from other points. Thus while randomly partitioning the domain space, the anomaly will be detected in smaller number of partitions than a normal point. Smaller number of partitions means lesser is the distance from the root node(this means lesser number of edges need to be travelled from root node to the terminal node).
The above discussed concept will be well evident with the images given below.
the anomaly will be detected in smaller number of partitions than a normal point
So clearly the path length indicates whether a point is a normal or an anomalous point. (Path length- of a point x is measured by the number of edges x traverses an Isolation tree from the root node until the traversal is terminated at an external node)
Isolation forest is an ensemble method. So we create multiple Isolation trees(generally 100 trees will suffice) and we take the average of all the path lengths.This average path length will then decide whether a point is anomalous or not.
Anomaly score is given by the following formula-
n- Number of data points
c(n)– It is the average path length of unsuccessful search in a Binary search tree. (Feel free to go through Pages 660, 661 in this book for the derivation)
Now observe. We grow an isolation tree by randomly choosing a feature and randomly partitioning. This is very similar to the Binary Search tree. Thus we can approximate the average path length of for a node termination with the unsuccessful search in a Binary Search tree. Thus we use c(n) as the reference.
If you’re having a hard time figuring out about Binary search trees, c(n) is just a reference metric. It normalises the score between 0 to 1.
Note that it is always better to represent score between 0 to 1 because the score can now be interpreted as a probability. For example, say for a data point if we get the anomaly score as 0.8, then we can interpret such that the point has a probability of 80% to be an anomalous point.
E(h(x))– Average of path lengths from the Isolation forest
- As score is closer to 1, then it is an anomalous point
- As the score is closer to 0, it a normal observation
- A score near 0.5, indicates it doesn’t have much distinction from normal observations
Let us now implement Isolation forest algorithm in Python using sklearn library.
Let us first execute it on a synthetic dataset and then discuss a real world example from Vendor-TAT dataset.
Anomaly detection on synthetic dataset using Python
I will take you through the code and we will interpret on the go.
First, we import necessary libraries
Next, we generate data for our purpose. Here we generate 4 different datasets. 2 datasets with standard normal distribution and 2 with uniform distribution. The former will essentially be our normal(usual) points while the latter will be the outliers. 2 datasets are generated because one will be used for training and the other for testing.
We will plot and visualise the dataset to get a better understanding.
Now we come to the important part where we call the Isolation Forest algorithm, train with our training data and then generate predictions.
We first append the normal and outlier data for both testing as well as training dataset.
We then fit the Isolation forest algorithm. Here we have two parameters. Random state is just to set the random seed, so that it generates the same trees anytime we run it.
Contamination- Contamination is the assumption about the fraction of anomalies in the dataset. This number is set by the intuition of the domain experts- generally the domain experts will have a rough idea about the fraction of the anomalies in the dataset. For most real-life cases, a value of 0.1 works.
Once we are done with fitting, we now generate predictions. Here y_train has the predictions on training dataset while y_test corresponds to that of test data.
Now we will plot and visualise the predictions of the Isolation forest algorithm in our toy dataset.
First we will look at the training predictions.
Now let us look at the test predictions.
Hurray!! We implemented the isolation forest algorithm.
Isolation forest can be useful in various real world applications.
- When anomalies affect your Machine learning algorithm-
All machine learning algorithm goes with the ideal scenario that the values of the input features are representative of the real-world.
Any data comes with unavoidable errors, but anomalies can be deviated from real world behaviour in such a way that this can change the actual distribution of the data. This can affect the performance of the Machine learning algorithm which will ultimately end up misinterpreting the relation between inputs and the output.
This will be very well evident in the next point.
2. When anomalies makes your analysis to be mislead-
At Grofers, we use the TAT of the vendor for various operations. TAT is the Total action time. It is the total time between the time at which the purchase order is placed and time at which the order is delivered.
Say we analyse the standard deviation of the TAT for a particular vendor. Let us see the result of our analysis in two cases- data with anomalies and the one without.
We first calculate the standard deviation of feature which has anomalies in it.
Now we apply Isolation forest algorithm and remove the outliers. Now, after removing the outliers, we calculate the standard deviation.
We clearly see that the presence of anomalies changes the essence of the data. Thus we use Isolation forest to remove the outliers, before applying the data to any algorithm or analysis.
2. Anomalous points can detect mistakes in process
Manual errors are inevitable in data management. Isolation forest can detect manual errors, since manual errors are mostly situated far from the normal data points in the domain space.
Let us see an example in the next part, where we detect the manual errors in TAT dataset.
Applying Isolation Forest on TAT data
We use the TAT dataset again. Due to manual errors, there may exist anomalous values in the dataset. The job is to detect the anomalies and remove it.
6 months vendor data was taken into consideration. The data was grouped outlet wise, since the delivery time depends on the distance of the outlet from the vendor station. Contamination was set to 0.01.
The predictions generated.
Let us take two vendors and visualise the predictions.
The algorithm was able to detect anomalies much efficiently and accurately.
Anomaly detection using Isolation forest was originally published in Lambda – The Grofers Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.