post list
QuantDare
categories
artificial intelligence

What is the difference between Artificial Intelligence and Machine Learning?

ogonzalez

artificial intelligence

Random forest: many is better than one

xristica

artificial intelligence

Non-parametric Estimation

T. Fuertes

artificial intelligence

Classification trees in MATLAB

xristica

artificial intelligence

Applying Genetic Algorithms to define a Trading System

aparra

artificial intelligence

Graph theory: connections in the market

T. Fuertes

artificial intelligence

Learning with kernels: an introductory approach

ogonzalez

artificial intelligence

SVM versus a monkey. Make your bets.

P. López

artificial intelligence

Clustering: “Two’s company, three’s a crowd”

libesa

artificial intelligence

Euro Stoxx Strategy with Machine Learning

fjrodriguez2

artificial intelligence

Visualizing Fixed Income ETFs with T-SNE

j3

artificial intelligence

Hierarchical clustering, using it to invest

T. Fuertes

artificial intelligence

Markov Switching Regimes say… bear or bullish?

mplanaslasa

artificial intelligence

“K-Means never fails”, they said…

fjrodriguez2

artificial intelligence

What is the difference between Bagging and Boosting?

xristica

artificial intelligence

Outliers: Looking For A Needle In A Haystack

T. Fuertes

artificial intelligence

Machine Learning: A Brief Breakdown

libesa

artificial intelligence

Stock classification with ISOMAP

j3

artificial intelligence

Sir Bayes: all but not naïve!

mplanaslasa

artificial intelligence

Returns clustering with k-Means algorithm

psanchezcri

artificial intelligence

Confusion matrix & MCC statistic

mplanaslasa

artificial intelligence

Reproducing the S&P500 by clustering

fuzzyperson

artificial intelligence

Random forest vs Simple tree

xristica

artificial intelligence

Clasificando el mercado mediante árboles de decisión

xristica

artificial intelligence

Árboles de clasificación en Matlab

xristica

artificial intelligence

Redes Neuronales II

alarije

artificial intelligence

Análisis de Componentes Principales

j3

artificial intelligence

Vecinos cercanos en una serie temporal

xristica

artificial intelligence

Redes Neuronales

alarije

artificial intelligence

Caso Práctico: Multidimensional Scaling

rcobo

artificial intelligence

Data Cleansing & Data Transformation

psanchezcri

10/11/2016

No Comments
Data Cleansing & Data Transformation

“Machine Learning” or “Data Science” are trending concepts. There are different websites (like Kaggle) where Data Scientists can analyse big datasets to resolve some real problems using Machine Learning techniques. These techniques are applied to huge amounts of information, to learn the relationships between its features.

Machine Learning algorithms use all the values of the dataset. If we have a “dirty” dataset with a lot of mistakes and issues, these algorithms will not learn as effectively. It’s therefore neccesary to fix the issues first and then apply the Machine Learning algorithm.

Data Cleansing

 

What kind of issues affect the quality of data?

  • Invalid values: Some datasets have well-known values, e.g. gender must only have “F” (Female) and “M” (Male). In this case it’s easy to detect wrong values.
  • Formats: The most common issue. It’s possible to get values in different formats like a name written as “Name, Surname” or “Surname, Name”.
  • Attribute dependencies: When the value of a feature depends on the value of another feature. For example, if we have some school data, the “number of students” is related to whether the person “is teacher?”. If someone is not a teacher he/she can’t have any students.
  • Uniqueness: It’s possible to find repeated data in features that only allow unique values. For example, we can’t have two products with the same identifier.
  • Missing values: Some features in the dataset may have blank or null values.
  • Misspellings: Incorrectly written values.
  • Misfielded values: When a feature contains the values of another.

Data Cleansing

How can I detect and fix these issues?

There are a great deal of methods that you can use to find these issues. For instance:

  • Visualisation: Visualising all the values of each feature, or taking a random sample to see if it’s right.
  • Outlier analysis: Analysing if data can be a human error. E.g. a 300 year old person in the “age” feature.
  • Validation code: It’s possible to create a code that checks if the data is right. For example, in uniqueness, checking if the length of the data is the same as the length of the vector of unique values.

We can apply many methods to fix the different issues:

  • Misspelled data: Replacing incorrect fields by the most similar value in the feature.
  • Uniqueness: Switching one of the repeated field with another value that is not in the feature.
  • Missing data: Handling missing data is a key decision. We can change null values with the mean, median or mode of the feature.
  • Formats: Having the same number of decimals, the same format in the dates …

 

Data Transformation

 

Is it possible to transform the features to gain more information?

There are many methods that add information to the algorithm:

  • Data Binning or Bucketing: A pre-processing technique used to reduce the effects of minor observation errors. The sample is divided into intervals and replaced by categorical values.

Data binning

  • Indicator variables: This technique converts categorical data into boolean values by creating indicator variables. If we have more than two values (n) we have to create n-1 columns.

Indicator variables

  • Centering & Scaling: We can centre the data of one feature by substracting the mean to all values. To scale the data, we should divide the centered feature by the standard deviation:

Centering and scaling

  • Other techinques: For example, we can group the outliers with the same value or replace the value with the number of times that it appears in the feature:

Data Transformation

Data Transformation

This post is based on a Udemy course that I recommend for learning more about “Data Science”.

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Email this to someone

add a comment

wpDiscuz