Data Preprocessing with Orange tool

Anjali Bhavsar
3 min readSep 13, 2021

--

Data Preprocessing

Data preprocessing is the process of transforming raw data into an understandable format. It is also an important step in data mining as we cannot work with raw data. The quality of the data should be checked before applying machine learning or data mining algorithms.

Discretization

Data discretization is characterized as a method of translating attribute values of continuous data into a finite set of intervals with minimal information loss. Data discretization facilitates the transfer of data by substituting interval marks for the values of numeric data. Similar to the values for the ‘generation’ variable, interval labels such as (0–10, 11–20…) or (0–10, 11–20…) may be substituted (kid, youth, adult, senior). Data discretization can be divided into two forms of supervised discretization in which the class data is used and the other is unsupervised discretization depending on which way the operation proceeds, i.e. ‘top-down splitting strategy’ or ‘bottom-up merging strategy.’

The original dataset before performing discretization
Data after performing discretization

As we can see in the above image instead of the continuous value in the original dataset, after discretization the dataset has values into finite set of intervals.

Continuization

Given a data table, return a new table in which the discretize attributes are replaced with continuous or removed.

  • binary variables are transformed into 0.0/1.0 or -1.0/1.0 indicator variables, depending upon the argument zero_based.
  • multinomial variables are treated according to the argument multinomial_treatment.
  • discrete attribute with only one possible value are removed.

Continuize_Indicators

The variable is replaced by indicator variables, each corresponding to one value of the original variable. For each value of the original attribute, only the corresponding new attribute will have a value of one and others will be zero. This is the default behaviour.

For example as shown in the below code snippet, dataset “titanic” has feature “status” with values “crew”, “first”, “second” and “third”, in that order. Its value for the 10th row is “first”. Continuization replaces the variable with variables “status=crew”, “status=first”, “status=second” and “status=third”.

For the 10th row, the variable “status=first” has value 1 and the values of the other three variables are 0

Normalization

Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. Normalization is generally required when we are dealing with attributes on a different scale, otherwise, it may lead to a dilution in effectiveness of an important equally important attribute(on lower scale) because of other attribute having values on larger scale. We use the Normalize function to perform normalization.

All values between range 0 and 1 after normalization

Randomization

With randomization, given a data table, preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.

The output value is randomized

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet