Tag: preprocessing

  • Data regularization using scikit-learn

    Data regularization using scikit-learn

    The `preprocessing` package in scikit-learn provides normalization function. Following is how we can do the L1 and L2 norm.

    L1 normalization

    L1 normalization transforms the data such that the sum of absolute values for each sample (row in a dataframe) is equal to 1.

    For example, let’s create small pandas dataframe.

    import pandas as pd
    from sklearn import preprocessing
    df = pd.DataFrame({'A' : [-10,-8,-6,-4,-2, 0],
                       'B': [0,2,4,6,8,10]
                      })
    df
     	A 	B
    0 -10	0
    1 -8 	2
    2 -6 	4
    3 -4 	6
    4 -2 	8
    5 0 	10
    preprocessing.normalize(df, norm='l1')
    array([[-1. ,  0. ],
           [-0.8,  0.2],
           [-0.6,  0.4],
           [-0.4,  0.6],
           [-0.2,  0.8],
           [ 0. ,  1. ]])

    We can see above that for each row, the sum of absolute values is equal to 1.

    The normalization function returns a NumPy array. However, generally we would want it to be a pandas dataframe with same column names as the original dataframe. This is how we do it.

    df_l1 = pd.DataFrame(preprocessing.normalize(df, norm='l1'), columns=df.columns)
    df_l1
     	    A 	B
    0 	-1.0 	0.0
    1 	-0.8 	0.2
    2 	-0.6 	0.4
    3 	-0.4 	0.6
    4 	-0.2 	0.8
    5 	0.0 	1.0

    L2 normalization

    The normalization function does ‘L2’ normalization by default, i.e., when we do not give any value to the norm argument of the normalize function.

    It is the Euclidean norm, which transforms data such that the sum of squares of values in each each sample (row in a dataframe) is equal to 1.

    We will print out the dataframe first.

    df
     	  A 	B
    0 	-10 0
    1 	-8 	2
    2 	-6 	4
    3 	-4 	6
    4 	-2 	8
    5 	0 	10

    And, normalize.

    We will give the `norm` argument to the function for clarity.

    df_l2 = pd.DataFrame(preprocessing.normalize(df, norm='l2'), columns=df.columns)
    df_l2
     	      A 	       B
    0 	-1.000000 	0.000000
    1 	-0.970143 	0.242536
    2 	-0.832050 	0.554700
    3 	-0.554700 	0.832050
    4 	-0.242536 	0.970143
    5 	0.000000 	  1.000000

    The normalize function has default `axis` value equal to 1. If the axis=0, then the transformation would act of columns (features) instead of rows (samples)

    df_l2 = pd.DataFrame(preprocessing.normalize(df, norm='l2', axis=0), columns=df.columns)
    df_l2
     	      A 	    B
    0 	-0.67420 	0.00000
    1 	-0.53936 	0.13484
    2 	-0.40452 	0.26968
    3 	-0.26968 	0.40452
    4 	-0.13484 	0.53936
    5 	0.00000 	0.67420

    In short, the code for normalization would look like this.

    import pandas as pd
    from sklearn import preprocessing
    df = pd.DataFrame({'A' : [-10,-8,-6,-4,-2, 0],
                       'B': [0,2,4,6,8,10]
                      })
                      
    df_l2 = pd.DataFrame(preprocessing.normalize(df, norm='l2'),
                         columns=df.columns)

Privacy Overview
Analytics Notes

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookies

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

3rd Party Cookies

This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.

Keeping this cookie enabled helps us to improve our website.