Data normalization using scikit-learn

Normalization of data is required in cases where different features of data are at different scale. Normalization helps the model training less sensitive to the scale of the features. This is called as feature scaling.

The `preprocessing` package in scikit-learn provides normalization function. Following is how we can do the L1 and L2 norm.

L1 normalization

L1 normalization transforms the data such that the sum of absolute values for each sample (row in a dataframe) is equal to 1.

For example, let’s create small pandas dataframe.

import pandas as pd
from sklearn import preprocessing

df = pd.DataFrame({'A' : [-10,-8,-6,-4,-2, 0],
                   'B': [0,2,4,6,8,10]
                  })
df

preprocessing.normalize(df, norm='l1')

array([[-1. ,  0. ],
       [-0.8,  0.2],
       [-0.6,  0.4],
       [-0.4,  0.6],
       [-0.2,  0.8],
       [ 0. ,  1. ]])

We can see above that for each row, the sum of absolute values is equal to 1.

The normalize function returns a NumPy array. However, generally we would want is to have a pandas dataframe with same column names as the original dataframe. This is how we do it.

df_l1 = pd.DataFrame(preprocessing.normalize(df, norm='l1'), columns=df.columns)
df_l1

 	    A 	B
0 	-1.0 	0.0
1 	-0.8 	0.2
2 	-0.6 	0.4
3 	-0.4 	0.6
4 	-0.2 	0.8
5 	0.0 	1.0

L2 normalization

The normalization function does ‘L2’ normalization by default, i.e., when we do not give any value to the norm argument of the normalize function.

It is the Euclidean norm, which transforms data such that the sum of squares of values in each each sample (row in a dataframe) is equal to 1.

We will print out the dataframe first.

df

And, normalize. We will give the `norm` argument to the function for clarity.

df_l2 = pd.DataFrame(preprocessing.normalize(df, norm='l2'), columns=df.columns)
df_l2

 	      A 	       B
0 	-1.000000 	0.000000
1 	-0.970143 	0.242536
2 	-0.832050 	0.554700
3 	-0.554700 	0.832050
4 	-0.242536 	0.970143
5 	0.000000 	  1.000000

The normalize function has default `axis` value equal to 1. If the axis=0, then the transformation would act of columns (features) instead of rows (samples)

df_l2 = pd.DataFrame(preprocessing.normalize(df, norm='l2', axis=0), columns=df.columns)
df_l2

 	      A 	    B
0 	-0.67420 	0.00000
1 	-0.53936 	0.13484
2 	-0.40452 	0.26968
3 	-0.26968 	0.40452
4 	-0.13484 	0.53936
5 	0.00000 	0.67420

In short, the code for normalization would look like this.

import pandas as pd
from sklearn import preprocessing
df = pd.DataFrame({'A' : [-10,-8,-6,-4,-2, 0],
                   'B': [0,2,4,6,8,10]
                  })
                  
df_l2 = pd.DataFrame(preprocessing.normalize(df, norm='l2'),
                     columns=df.columns)

Analytics Notes