Anomaly detection using Isolation forest

Anomaly Detection Using Isolation Forest

Recently I have started working on Anomaly detection techniques. Anomaly detection is one of the least explored areas in Data Science. In this article, let us brush past about fundamentals of anomaly detection and dive deep into the Isolation forest technique- the one which will win hands down with other techniques.

Anomaly Detection

Anomaly detection deals with the identification of unusual patterns/behaviour that doesn’t conform to the usual trend.

It is applied in wide range of areas- Signal processing, Automation in manufacturing, Chemical reaction monitoring etc. Here we will narrow down to finding anomalous data points.

Anomaly detection using conventional statistical methods

Statisticians, since 1950s ,have come up with different methods for Anomaly detection. Here are the 3 most widely used statistical methods. Since our main focus is on Isolation forest, we will not discuss about these methods, though I will give pointers-if you’re interested, go ahead and take a look.

  1. Dixon Q-Test
  2. Barlett test
  3. Grubbs test

Though the above tests have been implemented in the industries for so long, they have the following limitations

  • Works well only for near to normal distribution
  • Detects only one anomaly per test
  • Works well only for a 1-Dimensional data set

Need for other anomaly detection methods

  1. Most of the real world data significantly deviate from normal

Let us take see an example.

At Grofers, we once analysed the data of Total action time of vendors( In layman terms, this is similar to delivery time) . When I plotted the histogram of the data distribution, I clearly get a distribution that significantly deviates from the normal.

Example of a Distribution that deviates from Normal distribution

Thus, we couldn’t go with blunt assumption that all real word data is close to normal distribution. So we need an algorithm that can deal with distributions which aren’t normally distributed.

2. Dataset can have multiple anomalies

Conventional statistical methods generally detects only one anomaly per test but in most cases we will have multiple anomalies present in the dataset. So we need an algorithm that can detect multiple anomalies at a go.

3. Data point can depend on a lot of features

Most of the real world phenomenon requires significant amount of dependent variables/features. Thus we require an algorithm that can over come the curse of the dimensionality. (Feel free to read Section 5.3 to understand how Isolation forest overcomes this problem)

Isolation forest method

Isolation forest detects anomalies by randomly partitioning the domain space. Yeah, you’re heard me right- It works similar to Decision trees algorithm, where we start with a root node and keep on partitioning the space. In Isolation forest we partition randomly, unlike Decision trees where the partition is based on Information gain.

Partitions are created by randomly selecting a feature and then randomly creating a split value between the maximum and the minimum value of the feature. We keep on creating the partitions until we isolate all the points(in most cases we also set a limit on number of partitions/height of the tree).

Now let us visualise how a normal point will differ from an anomalous points in our feature space.

Visualising anomalies

In the above image, the red points denote the anomalous points whereas the blue ones denotes the normal points.

Now clearly, a normal point will be more clustered while an anomalous point will be far away from other points. Thus while randomly partitioning the domain space, the anomaly will be detected in smaller number of partitions than a normal point. Smaller number of partitions means lesser is the distance from the root node(this means lesser number of edges need to be travelled from root node to the terminal node).

The above discussed concept will be well evident with the images given below.

Isolating a normal point
Isolating an anomalous point

the anomaly will be detected in smaller number of partitions than a normal point

So clearly the path length indicates whether a point is a normal or an anomalous point. (Path length- of a point x is measured by the number of edges x traverses an Isolation tree from the root node until the traversal is terminated at an external node)

Isolation forest is an ensemble method. So we create multiple Isolation trees(generally 100 trees will suffice) and we take the average of all the path lengths.This average path length will then decide whether a point is anomalous or not.

Anomaly score-

Anomaly score is given by the following formula-

where

n- Number of data points

c(n)– It is the average path length of unsuccessful search in a Binary search tree. (Feel free to go through Pages 660, 661 in this book for the derivation)

Now observe. We grow an isolation tree by randomly choosing a feature and randomly partitioning. This is very similar to the Binary Search tree. Thus we can approximate the average path length of for a node termination with the unsuccessful search in a Binary Search tree. Thus we use c(n) as the reference.

If you’re having a hard time figuring out about Binary search trees, c(n) is just a reference metric. It normalises the score between 0 to 1.

Note that it is always better to represent score between 0 to 1 because the score can now be interpreted as a probability. For example, say for a data point if we get the anomaly score as 0.8, then we can interpret such that the point has a probability of 80% to be an anomalous point.

E(h(x))– Average of path lengths from the Isolation forest

  • As score is closer to 1, then it is an anomalous point
  • As the score is closer to 0, it a normal observation
  • A score near 0.5, indicates it doesn’t have much distinction from normal observations

Let us now implement Isolation forest algorithm in Python using sklearn library.

Let us first execute it on a synthetic dataset and then discuss a real world example from Vendor-TAT dataset.

Anomaly detection on synthetic dataset using Python

I will take you through the code and we will interpret on the go.

First, we import necessary libraries

https://medium.com/media/89b64bdb33b4f178c5ac43159f5478e8/href

Next, we generate data for our purpose. Here we generate 4 different datasets. 2 datasets with standard normal distribution and 2 with uniform distribution. The former will essentially be our normal(usual) points while the latter will be the outliers. 2 datasets are generated because one will be used for training and the other for testing.

https://medium.com/media/1eef02148a620b3d0486b8ecd12899ba/href

We will plot and visualise the dataset to get a better understanding.

https://medium.com/media/8debfb4735ff143524657e5030a492dd/href

Plotting the dataset

Now we come to the important part where we call the Isolation Forest algorithm, train with our training data and then generate predictions.

We first append the normal and outlier data for both testing as well as training dataset.

We then fit the Isolation forest algorithm. Here we have two parameters. Random state is just to set the random seed, so that it generates the same trees anytime we run it.

Contamination- Contamination is the assumption about the fraction of anomalies in the dataset. This number is set by the intuition of the domain experts- generally the domain experts will have a rough idea about the fraction of the anomalies in the dataset. For most real-life cases, a value of 0.1 works.

Once we are done with fitting, we now generate predictions. Here y_train has the predictions on training dataset while y_test corresponds to that of test data.

https://medium.com/media/a0fa1937fec2d962f8fc41e182756aaa/href

Now we will plot and visualise the predictions of the Isolation forest algorithm in our toy dataset.

First we will look at the training predictions.

https://medium.com/media/67530613ba0644e02b6fcab852c2e5f2/href

Visualising training predictions

Now let us look at the test predictions.

https://medium.com/media/44e64d1daabe05fef5b2ca1b50144440/href

Visualising test predictions

Hurray!! We implemented the isolation forest algorithm.

Grofers’ use-case

Isolation forest can be useful in various real world applications.

  1. When anomalies affect your Machine learning algorithm-

All machine learning algorithm goes with the ideal scenario that the values of the input features are representative of the real-world.

Any data comes with unavoidable errors, but anomalies can be deviated from real world behaviour in such a way that this can change the actual distribution of the data. This can affect the performance of the Machine learning algorithm which will ultimately end up misinterpreting the relation between inputs and the output.

This will be very well evident in the next point.

2. When anomalies makes your analysis to be mislead-

At Grofers, we use the TAT of the vendor for various operations. TAT is the Total action time. It is the total time between the time at which the purchase order is placed and time at which the order is delivered.

Say we analyse the standard deviation of the TAT for a particular vendor. Let us see the result of our analysis in two cases- data with anomalies and the one without.

We first calculate the standard deviation of feature which has anomalies in it.

Data with anomaly

Now we apply Isolation forest algorithm and remove the outliers. Now, after removing the outliers, we calculate the standard deviation.

Data after removing anomalies

We clearly see that the presence of anomalies changes the essence of the data. Thus we use Isolation forest to remove the outliers, before applying the data to any algorithm or analysis.

2. Anomalous points can detect mistakes in process

Manual errors are inevitable in data management. Isolation forest can detect manual errors, since manual errors are mostly situated far from the normal data points in the domain space.

Let us see an example in the next part, where we detect the manual errors in TAT dataset.

Applying Isolation Forest on TAT data

We use the TAT dataset again. Due to manual errors, there may exist anomalous values in the dataset. The job is to detect the anomalies and remove it.

6 months vendor data was taken into consideration. The data was grouped outlet wise, since the delivery time depends on the distance of the outlet from the vendor station. Contamination was set to 0.01.

The predictions generated.

Let us take two vendors and visualise the predictions.

Detecting anomalies for each vendor outlet-wise
Detecting anomalies for each vendor outlet-wise

The algorithm was able to detect anomalies much efficiently and accurately.

References

  1. https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf?q=isolation-forest
  2. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html


Anomaly detection using Isolation forest was originally published in Lambda – The Grofers Engineering Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Grofers