“Machine Learning” or “Data Science” are trending concepts. There are different websites (like Kaggle) where Data Scientists can analyse big datasets to resolve some real problems using Machine Learning techniques. These techniques are applied to huge amounts of information, to learn the relationships between its features.
Machine Learning algorithms use all the values of the dataset. If we have a “dirty” dataset with a lot of mistakes and issues, these algorithms will not learn as effectively. It’s therefore neccesary to fix the issues first and then apply the Machine Learning algorithm.
What kind of issues affect the quality of data?
- Invalid values: Some datasets have well-known values, e.g. gender must only have “F” (Female) and “M” (Male). In this case it’s easy to detect wrong values.
- Formats: The most common issue. It’s possible to get values in different formats like a name written as “Name, Surname” or “Surname, Name”.
- Attribute dependencies: When the value of a feature depends on the value of another feature. For example, if we have some school data, the “number of students” is related to whether the person “is teacher?”. If someone is not a teacher he/she can’t have any students.
- Uniqueness: It’s possible to find repeated data in features that only allow unique values. For example, we can’t have two products with the same identifier.
- Missing values: Some features in the dataset may have blank or null values.
- Misspellings: Incorrectly written values.
- Misfielded values: When a feature contains the values of another.
How can I detect and fix these issues?
There are a great deal of methods that you can use to find these issues. For instance:
- Visualisation: Visualising all the values of each feature, or taking a random sample to see if it’s right.
- Outlier analysis: Analysing if data can be a human error. E.g. a 300 year old person in the “age” feature.
- Validation code: It’s possible to create a code that checks if the data is right. For example, in uniqueness, checking if the length of the data is the same as the length of the vector of unique values.
We can apply many methods to fix the different issues:
- Misspelled data: Replacing incorrect fields by the most similar value in the feature.
- Uniqueness: Switching one of the repeated field with another value that is not in the feature.
- Missing data: Handling missing data is a key decision. We can change null values with the mean, median or mode of the feature.
- Formats: Having the same number of decimals, the same format in the dates …
Is it possible to transform the features to gain more information?
There are many methods that add information to the algorithm:
- Data Binning or Bucketing: A pre-processing technique used to reduce the effects of minor observation errors. The sample is divided into intervals and replaced by categorical values.
- Indicator variables: This technique converts categorical data into boolean values by creating indicator variables. If we have more than two values (n) we have to create n-1 columns.
- Centering & Scaling: We can centre the data of one feature by substracting the mean to all values. To scale the data, we should divide the centered feature by the standard deviation:
- Other techinques: For example, we can group the outliers with the same value or replace the value with the number of times that it appears in the feature:
This post is based on a Udemy course that I recommend for learning more about “Data Science”.