Data science | Data preprocessing using scikit learn|California Housing Prices dataset
Data preprocessing is an extremely important step in machine learning or deep learning. We cannot just dump the raw data into a model and expect it to perform well. Even if we build a complex, well structured model, its performance gets as good as the data we feed to it. Thus, we need to process the raw data to boost the performance of models.
Scikit-learn is a widely-used machine learning library for Python. It has gained tremendous popularity among data science practitioners thanks to the variety of algorithms and its easy-to-understand syntax. In addition to ready-to-use algorithms, scikit-learn also provides useful functions and methods for data preprocessing.
There are numbers of methodologies of data preprocessing but our main focus is toward :
1)Data Encoding
2)Normalization
3)Standardization
4) Imputing the Missing Values
5) Discretization
Dataset Description
Here i have used ‘California Housing Prices dataset’. This dataset contains information about longitude, latitude of ocean proximity area, population, number of beds, number of rooms, house price etc…
This dataset contains numeric as well as categorical data. Dataset also has different scaled columns and contains missing values. So this is the perfect dataset for preprocessing.
Dataset: California Housing Prices dataset
Data Encoding
Encoding is the process of converting the data or a given sequence of characters, symbols, alphabets etc., into a specified format, for the secured transmission of data. Decoding is the reverse process of encoding which is to extract the information from the converted format. An example is to treat male or female for gender as 1 or 0. so there are two types so data encoding 1)label encoding 2)Onehot encoding
1)Label encoding
If we will have more than one category in the dataset that to convert those categories into numerical features we can use a Label encoder. Label Encoder will assign a unique number to each category.
As you can see ‘median_house_value’ column has 3842 categories that is nothing but house ranges. After Using Label Encoder we labeled the data. The 500001 housing range converted to 3841, 137500 housing range converted to 959, 162500 housing range converted to 1209 and so on…
classes_ attribute is helping us to identify numerical categories for particular label categories. ( 0 index: 14999 house range, 1 index: 17500 house range…)
(2)Onehot encoder
One hot encoder does the same things but in a different way. Label Encoder initializes the particular number but one hot encoder will assign a whole new column to particular categories. So if you have 3 categories in the column then one hot encoder will add 3 more columns to your dataset.
Now it totally depends on the dataset and its behavior. One Hot Encoder will increase the dimensional but it is useful most time because in the label encoder sometimes all the numerical categories will compare with each other by machine so it will make wrong assumptions. So that’s why OneHot is used more in the real world. But I advise you to do an experiment with both.
Normalization
Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling. Here’s the formula for normalization: Here, Xmax and Xmin are the maximum and the minimum values of the feature respectively.
MinMax scaler type normalization
MinMaxScaler : For each value in a feature, MinMaxScaler subtracts the minimum value in the feature and then divides by the range. The range is the difference between the original maximum and original minimum.
Standardization
Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation(i.e. standard deviation = 1).
Imputing Missing Values
Missing data are values that are not recorded in a dataset. They can be a single value missing in a single cell or missing of an entire observation (row). Missing data can occur both in a continuous variable (e.g. height of students) or a categorical variable (e.g. gender of a population).
We can handle missing values in two ways. : (1) Remove the data (whole row) which have missing values.(2) Add the values by using some strategies or using Imputer.
Simple Imputer
Discretization
Data discretization is the process of converting continuous data into discrete buckets by grouping it. by doing this we can limit the number of possible states. basically we convert the numerical features into categorical columns.
There are 3 types of Discretization available in Sci-kit learn.(1) Quantile Discretization Transform (2) Uniform Discretization Transform (3) KMeans Discretization Transform
Quantile Discretization Transform
Uniform Discretization Transform
KMeans Discretization Transform
Source code GitHub