10 European alternatives to Amazon and eBay in Europe

Consumers purchased over $1 trillion worth of goods via online marketplaces in 2016, according to an Internet Retailer report. That was nearly half of all e-commerce transactions globally. As an…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




An Introduction to Outlier Detection Methods using Python

Outliers are data points which lie at an abnormal distance from other points in a random sample from a population. In statistics, an outlier is a data point that differs significantly from other observations. They may be due to variability in the measurement or may indicate experimental errors.

Types of Outliers :

Outliers can come in three types, depending on the environment:

1. Point outlier: If an individual data point can be considered as anomalous with respect to the rest of the data, then the instance is termed as a point outlier. For example, Intrusion detection in computer networks.

2. Contextual outliers: If a data instance is anomalous in a specific context (but not otherwise), then it is termed as a contextual outlier. Attributes of data objects should be divided into two groups :
⦁ Contextual attributes: defines the context, e.g., time & location
⦁ Behavioural attributes: characteristics of the object, used in outlier evaluation, e.g., temperature

3. Collective outliers: If a collection of data points is anomalous concerning the entire data set, it is termed as a collective outlier.

In machine learning, Outliers can be a serious issue when training algorithm.
They can spoil and mislead the training process resulting in longer training times and less accurate models. Outlier Detection may be defined as the process of detecting and then removing outliers from a given data set.

There are top 3 methods most commonly used to detect outliers.

There are several methods to identify the outliers from datasets. In this article, we focus on the following methods.

We can implement the above methods to Weight-Height dataset using jupyter notebook for finding the outliers.

Output :

Below fig shows a scatterplot between the variables ‘Height’ and ‘Weight’. We can visualize the relationship between two quantitative variables using a scatterplot.

The interquartile range (IQR) is a measure of statistical dispersion by dividing a data set into quartiles and is also called as Midspread or H‑spread. It shows how the data is spread about the median. The data sorted in ascending order and then divided into quartiles.

IQR is calculated as the difference between the 75th and 25th percentiles. Mathematically, it is given as :

The IQR can be used to identify outliers by defining a new range on the sample values. The range is given as follows :

Any data point lying outside this range is considered as an outlier.

Let’s apply the Inter Quantile Range method for our dataset,

Using the IQR, we calculate the upper limit and lower limit using the formulas mentioned above,

Output :

Now, let’s see the data points above the upper limit & extreme upper limit. ie, the outliers.

output :

This method is also called Extreme Value analysis.

Let’s now plot the clusters we have got. Red points are captured as outliers by model and blue points show as normal observations.

DBSCAN is a non-parametric, density-based outlier detection technique used for one dimensional or multi-dimensional feature space. This technique is based on the DBSCAN (Density-based spatial clustering of applications with noise) clustering method. In DBSCAN, there are three classes of points which are given as follows :

DBSCAN requires some of the below parameters :

Let’s apply the DBSCAN method for our dataset,

Output :

Let’s now plot the clusters we have got. Red points are captured as outliers by model and blue points show as normal observations.

Isolation forest is a non-parametric method, binary decision trees based outlier detection technique used for one dimensional or multi-dimensional feature space. In this technique, outliers are isolated by creating decision trees over random features. The random partitioning produces noticeable shorter paths for outliers since

Hence, when a forest of random trees collectively produces shorter path lengths for some points, then they are highly likely to be outliers.

IsolationForest algorithm requires some of the below parameters :

Let’s apply the Isolation Forest method for our dataset,

Output :

Let’s now plot the clusters we have got. Red points are captured as outliers by model and blue points show as normal observations.

In this article, we have learned methods of identifying outliers using quantitative techniques. We have also learned the implementation of these methods in python.

Reference :

Add a comment

Related posts:

Issues Men Face in Our Society

Talking about issues prevailing in our society, we automatically slide in the pool of problems women are dipped into. And, that is not because of the fact that the cases of women issues and felonies…

9 Mistakes to Avoid While Developing a Mobile App for your Company

Statista reports that mobile apps are projected to generate $188.9 billion in revenue by 2020. This data is evident in how well businesses can use mobile apps. Any custom mobile app development…

aesthetics

Art and Design have been a huge part of my life ever since childhood. I still remember when I used to come back home from school after exams, I’d take out my drawing book and paints and just…