nsored Links
-->

Tuesday, September 4, 2018

Oversampling and undersampling in data analysis

Conceptual Business Illustration Words Oversampling Undersampling ...
src: thumb9.shutterstock.com

Oversampling and undersampling in data analysis are techniques used to adjust the class distribution of a data set (i.e. the ratio between the different classes/categories represented).

Oversampling and undersampling are opposite and roughly equivalent techniques. They both involve using a bias to select more samples from one class than from another.

The usual reason for oversampling is to correct for a bias in the original dataset. One scenario where it is useful is when training a classifier using labelled training data from a biased source, since labelled training data is valuable but often comes from un-representative sources.

For example, suppose we have a sample of 1000 people of which 66.7% are male. We know the general population is 50% female, and we may wish to adjust our dataset to represent this. Simple oversampling will select each female example twice, and this copying will produce a balanced dataset of 1333 samples with 50% female. Simple undersampling will drop some of the male samples at random to give a balanced dataset of 667 samples, again with 50% female.

There are also more complex oversampling techniques, including the creation of artificial data points .


Video Oversampling and undersampling in data analysis



Oversampling techniques for classification problems

SMOTE

There are a number of methods available to oversample a dataset used in a typical classification problem (using a classification algorithm to classify a set of images, given a labelled training set of images). The most common technique is known as SMOTE: Synthetic Minority Over-sampling Technique. To illustrate how this technique works consider some training data which has s samples, and f features in the feature space of the data. Note that these features, for simplicity, are continuous. As an example, consider a dataset of birds for clarification. The feature space for the minority class for which we want to oversample could be beak length, wingspan, and weight (all continuous). To then oversample, take a sample from the dataset, and consider its k nearest neighbors (in feature space). To create a synthetic data point, take the vector between one of those k neighbors, and the current data point. Multiply this vector by a random number x which lies between 0, and 1. Add this to the current data point to create the new, synthetic data point.

ADASYN

The adaptive synthetic sampling approach, or ADASYN algorithm, builds on the methodology of SMOTE, by shifting the importance of the classification boundary to those minority classes which are difficult. ADASYN uses a weighted distribution for different minority class examples according to their level of difficulty in learning, where more synthetic data is generated for minority class examples that are harder to learn.


Maps Oversampling and undersampling in data analysis



Undersampling techniques for classification problems

Cluster

Cluster centroids is a method that replaces cluster of samples by the cluster centroid of a K-means algorithm, where the number of clusters is set by the level of undersampling.

Tomek links

Tomek links remove unwanted overlap between classes where majority class links are removed until all minimally distanced nearest neighbor pairs are of the same class. A Tomek link is defined as follows: given an instance pair ( x i , x j ) {\displaystyle (x_{i},x_{j})} , where x i ? S min , x j ? S max {\displaystyle x_{i}\in S_{\min },x_{j}\in S_{\operatorname {max} }} and d ( x i , x j ) {\displaystyle d(x_{i},x_{j})} is the distance between x i {\displaystyle x_{i}} and x j {\displaystyle x_{j}} , then the pair ( x i , x j ) {\displaystyle (x_{i},x_{j})} is called a Tomek link if there's no instance x k {\displaystyle x_{k}} such that d ( x i , x k ) < d ( x i , x j ) {\displaystyle d(x_{i},x_{k})<d(x_{i},x_{j})} or d ( x j , x k ) < d ( x i , x j ) {\displaystyle d(x_{j},x_{k})<d(x_{i},x_{j})} . In this way, if two instances form a Tomek link then either one of these instances is noise or both are near a border. Thus, one can use Tomek links to clean up overlap between classes. By removing overlapping examples, one can establish well-defined clusters in the training set and lead to improved classification performance.

Undersampling with ensemble learning

A recent study shows that the combination of Undersampling with ensemble learning can achieve better results, see IFME: information filtering by multiple examples with under-sampling in a digital library environment.


Conceptual Business Illustration Words Oversampling Undersampling ...
src: image.shutterstock.com


Additional techniques

It's possible to combine oversampling and undersampling techniques into a hybrid strategy. Common examples include SMOTE and Tomek links or SMOTE and Edited Nearest Neighbors (ENN). Additional ways of learning on imbalanced datasets include weighing training instances, introducing different misclassification costs for positive and negative examples and bootstrapping


Conceptual Business Illustration Words Oversampling Undersampling ...
src: thumb9.shutterstock.com


Implementations

A variety of data re-sampling techniques are implemented in the imbalanced-learn package compatible with Python's scikit-learn interface. The re-sampling techniques are implemented in four different categories: undersampling the majority class, oversampling the minority class, combining over and under sampling, and ensembling sampling.


Conceptual Business Illustration Words Business Interruption Stock ...
src: image.shutterstock.com


See also

  • Sampling (statistics)

Conceptual Business Illustration Words Oversampling Undersampling ...
src: thumb9.shutterstock.com


References

  • Chawla, Nitesh V. (2010) Data Mining for Imbalanced Datasets: An Overview doi:10.1007/978-0-387-09823-4_45 In: Maimon, Oded; Rokach, Lior (Eds) Data Mining and Knowledge Discovery Handbook, Springer ISBN 978-0-387-09823-4 (pages 875-886)
  • LemaĆ®tre, G. Nogueira, F. Aridas, Ch.K. (2017) Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning, Journal of Machine Learning Research, vol. 18, no. 17, 2017, pp. 1-5.
  • Rahman,M.M. Davis,D.N. (2010) Addressing the Class Imbalance Problem in Medical Datasets, International Journal of Machine Learning and Computing vol. 3, no. 2, pp. 224-228, 2013.

Source of article : Wikipedia