post list
QuantDare
categories
artificial intelligence

Classification trees in MATLAB

xristica

artificial intelligence

Applying Genetic Algorithms to define a Trading System

aparra

artificial intelligence

Graph theory: connections in the market

T. Fuertes

artificial intelligence

Learning with kernels: an introductory approach

ogonzalez

artificial intelligence

SVM versus a monkey. Make your bets.

P. López

artificial intelligence

Clustering: “Two’s company, three’s a crowd”

libesa

artificial intelligence

Euro Stoxx Strategy with Machine Learning

fjrodriguez2

artificial intelligence

Visualizing Fixed Income ETFs with T-SNE

j3

artificial intelligence

Hierarchical clustering, using it to invest

T. Fuertes

artificial intelligence

Markov Switching Regimes say… bear or bullish?

mplanaslasa

artificial intelligence

“K-Means never fails”, they said…

fjrodriguez2

artificial intelligence

What is the difference between Bagging and Boosting?

xristica

artificial intelligence

Outliers: Looking For A Needle In A Haystack

T. Fuertes

artificial intelligence

Machine Learning: A Brief Breakdown

libesa

artificial intelligence

Stock classification with ISOMAP

j3

artificial intelligence

Sir Bayes: all but not naïve!

mplanaslasa

artificial intelligence

Returns clustering with k-Means algorithm

psanchezcri

artificial intelligence

Confusion matrix & MCC statistic

mplanaslasa

artificial intelligence

Reproducing the S&P500 by clustering

fuzzyperson

artificial intelligence

Random forest vs Simple tree

xristica

artificial intelligence

Clasificando el mercado mediante árboles de decisión

xristica

artificial intelligence

Árboles de clasificación en Matlab

xristica

artificial intelligence

Redes Neuronales II

alarije

artificial intelligence

Análisis de Componentes Principales

j3

artificial intelligence

Vecinos cercanos en una serie temporal

xristica

artificial intelligence

Redes Neuronales

alarije

artificial intelligence

Caso Práctico: Multidimensional Scaling

rcobo

artificial intelligence

Data Cleansing & Data Transformation

psanchezcri

10/11/2016

No Comments
Data Cleansing & Data Transformation

“Machine Learning” or “Data Science” are trending concepts. There are different competitions or websites (like Kaggle) where Data Scientists can analyze big datasets to resolve some real problems using Machine Learning techniques. This techniques are applied to huge amounts of information to learn the relationships between its features.

Machine Learning algorithms use all the values of the dataset. If we have a “dirty” dataset with a lot of mistakes and issues these algorithms will learn poorer. It’s neccesary to fix the issues first and then apply the Machine Learning algorithm on it.

Data Cleansing

 

What kind of issues affect the quality of the data?

  • Invalid values: Some dataset have well-known values, e.g. gender must only have “F” (Female) and “M” (Male). In this case it’s easy to detect wrong values.
  • Formats: It’s the most common issue. It’s possible to get values in different formats like a name wrote as “Name, Surname” or “Surname, Name”.
  • Attribute dependencies: When the value of some feature depends on the value of another feature. For example, if we have some school data, the “number of students” is related to whether the person “is teacher?”. If someone is not a teacher he/she can’t have any students.
  • Uniqueness: It’s possible to find repeated data in features that only allow unique values. For example, we can’t have two products with the same identifier.
  • Missing values: It’s possible that some features in the dataset have blank or null values.
  • Misspellings. Incorrectly written values.
  • Misfielded values: When a feature contains the values of another one.

Data Cleansing

How can I detect and fix these issues?

There are a great deal of methods that you can use to find these issues. For instance:

  • Visualization: Visualizing all the values of each feature or taking a random sample to see if it is right.
  • Outlier analysis: Analyzing if data can be a human error. E.g. a 300 year old person in the “age” feature.
  • Validation code: It’s possible to create a code that check if the data is right. For example, in uniqueness, checking if the length of the data is the same as the length of the vector of unique values.

We can apply many methods to fix the different issues:

  • Misspelled data: Replacing incorrect fields by the most similar value in the feature.
  • Uniqueness: Switching one of the repeated field with another value that is not in the feature.
  • Missing data: Handling missing data is a key decision. We can change null values with the mean, median or mode of the feature.
  • Formats: Having the same number of decimals, the same format in the dates …

 

Data Transformation

 

Is it possible to transform the features to gain more information?

There are many methods that add information to the algorithm:

  • Data Binning or Bucketing: It’s a pre-processing technique used to reduce the effects of minor observation errors. The sample is divided in some intervals and it’s replaced by categorical values.

Data binning

  • Indicator variables: This technique converts categorical data into boolean values by creating an indicator variables. If we have more than two values (n) we have to create n-1 columns.

Indicator variables

  • Centering & Scaling: We can center the data of one feature by substracting the mean to all values. If we want to scale the data we should divide the centered feature by the standard deviation:

Centering and scaling

  • Other techinques: There are more techniques to get more information. For example, we can group the outliers with the same value or replace the value with the number of times that it appears in the feature:

Data Transformation

Data Transformation

This post is based on a Udemy course that I recommend you take a look at in order to learn more about “Data Science”.

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Email this to someone

add a comment

wpDiscuz