Author: trunild@gmail.com

  • Loops in Python: For Loop

    Loops are used when one wants to repeat a set of commands several times (iterations). For example,when you want to apply a set of operations on elements of a list, by choosing one element at a time, you can use a python loop.

    There are two types of loops:

    • For loop
    • While loop

    For loop: 

    For loop is used when we want to run the set of commands for a known or defined number of times. Few of the ways in which we can use the “for loop” is as follows:

    Python
    for i in range(10):
        print(i)
    range(start, stop, step)

    Here the range(10) function will create number in sequence starting from 0 and ending at 9. The value of i will thus stat from 0 and increase by 1 in each iteration stopping at 9

    The general format of range function is:

    Python
    range(start, stop, step)

    By default the value of step is 1. 

    The other way of running for loop is:

    Python
    mylist = ["a", "b", "c"]
    for i in mylist:
        print(i)
        
    mylist = ["a", "b", "c"]
    for i, element in enumerate(mylist):
        print(f'index: {i}, element: {element}')
    a
    b
    c

    Here, the for loop iterates over each element in the list (mylist). In the first loop the value of “i” is “a”; in the second loop the “i” has the value of “b”; and so on.

    The enumerate function allows us to simultaneously iterate over the elements of a list as well as yielding the indices of the current element in the list in a particular iteration.

    Python
    mylist = ["a", "b", "c"]
    for i, element in enumerate(mylist):
        print(f'index: {i}, element: {element}')
    index: 0, element: a
    index: 1, element: b
    index: 2, element: c

    This is useful when you might need to access the elements of other lists relative to this index in the loop.

  • List in Python

    Lists in python are collection of objects. They can include any python objects such as numerals, strings, other lists and dictionaries. The list is defined by square bracket as follows:

    Python
    my_list = []

    The above example shows an empty list. Following is an example of a list with numbers:

    Python
    my_list = [1,2,4,65,43]
    my_list
    my_list = [1,2,4,65,43]
    my_list

    Lists can also contain strings.

    Python
    list_of_str = ['abc', 'goku', 'vada_pav', 'chicken_rice']
    list_of_str
    ['abc', 'goku', 'vada_pav', 'chicken_rice']

    Single list can also contain different types of objects. Following is an example of a list containing number, string, another list and a dictionary.

    Python
    list_mix = [3, 4, 'burger', ['Chicago', 23, 'Nile'], {'city_names': ['Delhi', 'London'], 'food_items': ['idli', 'mala-hotpot', 'rojak']}]
    list_mix
    [3,
     4,
     'burger',
     ['Chicago', 23, 'Nile'],
     {'city_names': ['Delhi', 'London'],
      'food_items': ['idli', 'mala-hotpot', 'rojak']}]

    When the command lines become too long we can break the line after a comma in the list.

    Python
    list_mix = [3,
                4,
                'burger',
                ['Chicago', 23, 'Nile'],
                {'city_names': ['Delhi', 'London'],
                 'food_items': ['idli', 'mala-hotpot', 'rojak']}]
    list_mix
    [3,
     4,
     'burger',
     ['Chicago', 23, 'Nile'],
     {'city_names': ['Delhi', 'London'],
      'food_items': ['idli', 'mala-hotpot', 'rojak']}]

    List indexing

    List index starts with zero as all indexes in python. Following is the general format of indexing a list.

    my_list[from_index:to_index:step]

    The to_index is not inclusive. It means that the list item at to_index will not be included in the result of indexing. We will see this an example of a list of number from 0 to 20.

    Python
    a = list(range(20))
    a
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
     
    Python
    a[2:18:3]
    [2, 5, 8, 11, 14, 17]

    Here, we indexed the list from 2nd index to the 18th index with a step size of 3. The default stepsize is 1.

    We will see different ways of indexing/slicing a list not.

    From a defined index (e.g. 5) to end of the list.

    Python
    a[5::]
    [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

    From start of the list to a defined index (e.g. 15) including the item at that index.

    Python
    a[:15+1]
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
    Python
    a[:16]
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

    The last element of a list can be indexed as -1, and the second last element is indexed as -2 and so on. Following are few example to index a list using negative numbers.

    Python
    a[-1]
    19
    Python
    a[-10:-1]
    [10, 11, 12, 13, 14, 15, 16, 17, 18]

  • Join two or more arrays using `numpy.concatenate`

    Join two or more arrays using `numpy.concatenate`

    np.concatenate is used for concatenating numpy arrays.
    We will discuss here some of the functionalities of this method of numpy arrays.
    It takes a list of two or more arrays as input argument, and some other keyword arguments, one of which we will cover here.

    Simplest usage of the concatenate method is as follows:

    Python
    import numpy as np
    
    # make two arrays
    a = np.array([1,2,3])
    b = np.array([4,5,6])
    
    # concatenate
    c = np.concatenate([a,b], axis=0)
    
    # print all the arrays and their shapes
    print('\n\n', a)
    print('\n\n', b)
    print('\n\n', c)
    
    print('\nshape of a = ', c.shape)
    print('\nshape of b = ', c.shape)
    print('\nshape of c = ', c.shape)
    a:
     [1 2 3]
    
    b:
     [4 5 6]
    
    c:
     [1 2 3 4 5 6]
    
    shape of a =  (6,)
    
    shape of b =  (6,)
    
    shape of c =  (6,)
    

    Both of the inputs in above code are one dimensional. Hence, they have just one axis. np.concatenate has keyword argument, axis, whose default value is 0. In the above code, we have written it explicitly for clarity.

    Let’s generate similar array of 2 dimensions. For this we need to provide the list of numbers inside a list to the np.array method.

    Python
    a = np.array([[1,2,3]])
    b = np.array([[4,5,6]])
    
    print('\n\nshape of a = ', a.shape)
    shape of a =  (1, 3)

    We can now see that the shape of the arrays a, and b is (1,3),
    meaning they have 1 row and 3 columns.

    The rows correspond to the 0 (zeroth) axis and the columns correspond
    to the 1 (first) axis.

    Let’s now say we want to stack the two arrays on top of each other. The resultant array should give us two rows and three columns.

    We use the concatenate method as follows:

    Python
    c = np.concatenate([a,b], axis=0)
    
    print('\n\na:\n', a)
    print('\n\nb:\n', b)
    print('\n\nc:\n', c)
    
    print('\n\nshape of c = ', c.shape)
    a:
     [[1 2 3]]
    
    b:
     [[4 5 6]]
    
    c:
     [[1 2 3]
     [4 5 6]]
    
    shape of c =  (2, 3)

    To stack the two arrays side ways to get an array of shape, (1,6), make
    the value of keyword argument, axis equal to ‘1’.

    Python
    c = np.concatenate([a,b], axis=1)
    
    print('\n\na:\n', a)
    print('\n\nb:\n', b)
    print('\n\nc:\n', c)
    
    print('\n\nshape of c = ', c.shape)
    a:
     [[1 2 3]]
    
    b:
     [[4 5 6]]
    
    c:
     [[1 2 3 4 5 6]]
    
    shape of c =  (1, 6)

    When axis=0, the number of columns in each of the arrays need to be same.
    Also, when axis=1, the number of rows in each of the arrays need to be same.

    You may concatenate as many arrays in one statements, provided their shapes are compatible for concatenation.

    Python
    c = np.concatenate([a,b,a,a,b,a,b], axis=0)
    
    print('\n\na:\n', a)
    print('\n\nb:\n', b)
    print('\n\nc:\n', c)
    
    print('\n\nshape of c = ', c.shape)
    a:
     [[1 2 3]]
    
    b:
     [[4 5 6]]
    
    c:
     [[1 2 3]
     [4 5 6]
     [1 2 3]
     [1 2 3]
     [4 5 6]
     [1 2 3]
     [4 5 6]]
    
    shape of c =  (7, 3)
  • Convert class labels to categories using keras

    Convert class labels to categories using keras

    Class labels can be converted to OneHot encoded array using keras.utils.to_categorical.
    The resultant array has number of rows equal to the number of samples, and number of columns equal to the number of classes.

    Let’s take an example of an arrray containing labels.
    First we need to import numpy to create the labels array and then define the labels array.

    import numpy as np
    labels = np.array([0,0,1,2,1,3,2,1])

    The labels contain four categories.

    np.unique(labels)
    array([0, 1, 2, 3])

    To convert the labels to OneHot encoded array, excute the following:

    import tensorflow as tf
    labels_encoded = tf.keras.utils.to_categorical(labels)
    print(labels_encoded)
    [[1. 0. 0. 0.]
     [1. 0. 0. 0.]
     [0. 1. 0. 0.]
     [0. 0. 1. 0.]
     [0. 1. 0. 0.]
     [0. 0. 0. 1.]
     [0. 0. 1. 0.]
     [0. 1. 0. 0.]]

    This encoded array can be used for training multiclass classification model.

  • Change elements of an array based on a condition using np.where

    Change elements of an array based on a condition using np.where

    Let’s say we want to convert multiple categorical variables into binary variables by selecting one category as “0” and the rest as “1”.

    Or we want to change the values of an array based on a condition, such as in RELU function where all negative values are converted to zero and rest stay the same.

    We can do this using np.where function.

    Let’s take an array of letter from “A” to “E”. We want to have the letter “C” to be labelled as “0” and rest of the letter to be labelled as one. Following is ho we do it:

    import numpy as np
    # create array
    a = np.array(['A', 'B', 'C', 'D', 'E'])
    
    # convert to binary labels
    b = np.where(a == 'C', 0, 1)
    
    print(f'a = {a}')
    print(f'b = {b}')
    a = ['A' 'B' 'C' 'D' 'E']
    b = [1 1 0 1 1]

    Now, let’s say we have a three dimensional array of numbers of size (20, 4, 4) and we want to emulate the RELU function where all negative number of the array would be set to 0 and rest remain the same.

    We will generate an array of random numbers from standard normal distribution for our example and apply the np.where function to do the transformation.

    a = np.random.normal(size=(20, 4, 4))
    b = np.where(a < 0, 0, a)
    
    # print part of the arrays for understanding
    print(a[0])
    print(b[0])
    [[ 1.45872533 -0.24965688 -1.11663205 -0.65852554]
     [-1.13076242 -0.49868332 -0.46350182 -0.02889719]
     [-0.99350298  0.88240974  0.87975654 -0.28836425]
     [-0.10684949 -0.88570172  1.70835701 -0.16105656]]
     
     [[1.45872533 0.         0.         0.        ]
     [0.         0.         0.         0.        ]
     [0.         0.88240974 0.87975654 0.        ]
     [0.         0.         1.70835701 0.        ]]

    It is clear that the negative numbers in the array have been converted to 0.

    We have only used a equal to (==) condition in above examples, but we may use any comparative operators as per our need. For example we can take log2 value of those numbers in an array which are greater than or equal to a certain number.

    a = np.random.normal(size=(20, 4, 4)) * 10
    a = a * 10
    b = np.where(a >= 5, np.log2(a), a)

  • Histograms are the best way to visualize distribution of data points

    Histograms are the best way to visualize distribution of data points

    Data distribution plots help visualize how quantitative data points are spread over the range of their values. Distribution of quantitative data can be shown in various ways such as box-plots, violin-plots, histograms, and scatter-plot with artificially introduced deviations to depicts density of the data points. However, I think the relatively simple looking histogram is the best way we can visualize the data distribution, because it visualizes exactly how many data points are present in a given range of numbers.

    Boxplots are perhaps the worst because they only depict certain benchmark intervals like minimum, maximum, 25th percentile, median, and 75th percentile. We can’t really know what is the distribution of data points withing these percentile ranges.

    Violin-plot fairs a bit better in giving us an idea of how many data points are there near a value. However, the smoothening of the width of the violin curve in the plots gives us a false impression of presence of data where there isn’t any. In the figure we can see the same data-points plotted in various forms. The violin plot’s smoothening effect causes the violin to never depict the absence of data in a range.

    The scatter plot with random deviations, clearly shows that in certain range there is no data-point, but it fails to show us how many data points are there in a range of values.

    The histogram solves all these problems. By adjusting the number of bins to show, we can exactly see how many data-points are there in a range of values and thus visualize the exact data distribution.

  • How to save python objects using joblib

    Often times you would want to save python objects for later use. For example, a dataset you constructed which could be used for several projects, or a transformer object with specific parameters you want to apply for different data, or even a machine learning learning model you trained.

    This is how to do it. First, we will create a dummy data using numpy.

    import numpy as np
    a = np.random.chisquare(45, size=(10000, 50))

    Let’s say you want to transform it using the quantile transformer and save the transformed data for later use. Following is how you would transform it.

    from sklearn.preprocessing import QuantileTransformer
    qt = QuantileTransformer(n_quantiles=1000, random_state=10)
    a_qt = qt.fit_transform(a)

    Save python objects

    To save a_qt using joblib the dump method is used.

    import joblib
    joblib.dump(a_qt, 'out/a_qt.pckl')

    You can even save the transformer object qt.

    joblib.dump(qt, 'out/qt.pckl')

    Load python objects

    To load the objects we use the joblib method load.

    # load array
    b_qt = joblib.load('out/a_qt.pckl')
    
    # load transformer
    qt2 = joblib.load('out/qt.pckl')

    You can verify that the saved objects and loaded objects are same, by printing the arrays or printing the class of the objects.

    print('Shape of a_qt: ', a_qt.shape)
    print('Shape of b_qt: ', b_qt.shape)
    
    print('Class of qt', type(qt))
    print('Class of qt2', type(qt2))
    Shape of a_qt:  (10000, 50)
    Shape of b_qt:  (10000, 50)
    Class of qt <class 'sklearn.preprocessing._data.QuantileTransformer'>
    Class of qt2 <class 'sklearn.preprocessing._data.QuantileTransformer'>

  • Data regularization using scikit-learn

    Data regularization using scikit-learn

    The `preprocessing` package in scikit-learn provides normalization function. Following is how we can do the L1 and L2 norm.

    L1 normalization

    L1 normalization transforms the data such that the sum of absolute values for each sample (row in a dataframe) is equal to 1.

    For example, let’s create small pandas dataframe.

    import pandas as pd
    from sklearn import preprocessing
    df = pd.DataFrame({'A' : [-10,-8,-6,-4,-2, 0],
                       'B': [0,2,4,6,8,10]
                      })
    df
     	A 	B
    0 -10	0
    1 -8 	2
    2 -6 	4
    3 -4 	6
    4 -2 	8
    5 0 	10
    preprocessing.normalize(df, norm='l1')
    array([[-1. ,  0. ],
           [-0.8,  0.2],
           [-0.6,  0.4],
           [-0.4,  0.6],
           [-0.2,  0.8],
           [ 0. ,  1. ]])

    We can see above that for each row, the sum of absolute values is equal to 1.

    The normalization function returns a NumPy array. However, generally we would want it to be a pandas dataframe with same column names as the original dataframe. This is how we do it.

    df_l1 = pd.DataFrame(preprocessing.normalize(df, norm='l1'), columns=df.columns)
    df_l1
     	    A 	B
    0 	-1.0 	0.0
    1 	-0.8 	0.2
    2 	-0.6 	0.4
    3 	-0.4 	0.6
    4 	-0.2 	0.8
    5 	0.0 	1.0

    L2 normalization

    The normalization function does ‘L2’ normalization by default, i.e., when we do not give any value to the norm argument of the normalize function.

    It is the Euclidean norm, which transforms data such that the sum of squares of values in each each sample (row in a dataframe) is equal to 1.

    We will print out the dataframe first.

    df
     	  A 	B
    0 	-10 0
    1 	-8 	2
    2 	-6 	4
    3 	-4 	6
    4 	-2 	8
    5 	0 	10

    And, normalize.

    We will give the `norm` argument to the function for clarity.

    df_l2 = pd.DataFrame(preprocessing.normalize(df, norm='l2'), columns=df.columns)
    df_l2
     	      A 	       B
    0 	-1.000000 	0.000000
    1 	-0.970143 	0.242536
    2 	-0.832050 	0.554700
    3 	-0.554700 	0.832050
    4 	-0.242536 	0.970143
    5 	0.000000 	  1.000000

    The normalize function has default `axis` value equal to 1. If the axis=0, then the transformation would act of columns (features) instead of rows (samples)

    df_l2 = pd.DataFrame(preprocessing.normalize(df, norm='l2', axis=0), columns=df.columns)
    df_l2
     	      A 	    B
    0 	-0.67420 	0.00000
    1 	-0.53936 	0.13484
    2 	-0.40452 	0.26968
    3 	-0.26968 	0.40452
    4 	-0.13484 	0.53936
    5 	0.00000 	0.67420

    In short, the code for normalization would look like this.

    import pandas as pd
    from sklearn import preprocessing
    df = pd.DataFrame({'A' : [-10,-8,-6,-4,-2, 0],
                       'B': [0,2,4,6,8,10]
                      })
                      
    df_l2 = pd.DataFrame(preprocessing.normalize(df, norm='l2'),
                         columns=df.columns)

  • How to plot metabolite concentrations in R?

    How to plot metabolite concentrations in R?

    The data source

    Let’s say we have the final concentrations of following metabolites from three strains of bacteria:

    • Glucose
    • Ethanol
    • Succinate
    • Acetate
    • Formate
    • Lactate

    Each strain was grown in triplicates, so we have three values of each metabolite for each strain.

    Load tidyverse

    We will first load the tidyverse library which contains all the functions that we are going to need here.

    library(tidyverse)

    Read data

    The data for this example is stored in an excel file which would read first.

    data <- readxl::read_excel('data/fermentation_products.xlsx')
    data
    ## # A tibble: 18 × 4
    ##    Product   Strain_A Strain_B Strain_C
    ##    <chr>        <dbl>    <dbl>    <dbl>
    ##  1 Glucose       1.59     1.73     1.72
    ##  2 Glucose       1.65     1.82     1.68
    ##  3 Glucose       1.45     1.86     1.74
    ##  4 Succinate     0.08     0.07     1.2 
    ##  5 Succinate     0.1      0.05     1.23
    ##  6 Succinate     0.11     0.08     1.3 
    ##  7 Lactate       0.26     0.12     0.15
    ##  8 Lactate       0.28     0.1      0.14
    ##  9 Lactate       0.32     0.09     0.18
    ## 10 Formate       1.29     1.65     2.72
    ## 11 Formate       1.34     1.74     2.71
    ## 12 Formate       1.38     1.87     2.6 
    ## 13 Acetate      10.5     14.3      8.73
    ## 14 Acetate      10.7     15.6      9.47
    ## 15 Acetate       9.4     16.3      6.52
    ## 16 Ethanol      15.7     21.2     30.3 
    ## 17 Ethanol      14.3     19.0     32.2 
    ## 18 Ethanol      13.6     22.8     33.8

    The data as seen above, has four columns. First column, “Product”, represents the product, each of the other columns represents measured concentration of the metabolite in grams per litre. Each metabolite will occur three times in the ‘Product’ column as there are three replicates.

    Rearrange data

    We will rearrange our data so as not to have separate columns for each strain. We will rearrange it such that we have three columns, Product, Strain and Amount.

    This is called as pivoting which is done by pivot_longer function. The resulting data will have nine instances of each metabolite in the ‘Product’ column.

    data %>%
      pivot_longer(!Product,
                   names_to='Strain',
                   values_to='Amount')
    ## # A tibble: 54 × 3
    ##    Product   Strain   Amount
    ##    <chr>     <chr>     <dbl>
    ##  1 Glucose   Strain_A   1.59
    ##  2 Glucose   Strain_B   1.73
    ##  3 Glucose   Strain_C   1.72
    ##  4 Glucose   Strain_A   1.65
    ##  5 Glucose   Strain_B   1.82
    ##  6 Glucose   Strain_C   1.68
    ##  7 Glucose   Strain_A   1.45
    ##  8 Glucose   Strain_B   1.86
    ##  9 Glucose   Strain_C   1.74
    ## 10 Succinate Strain_A   0.08
    ## # ℹ 44 more rows

    Plot

    The rearranged data can now be easily plotted as a bar graph. Here is how we combine the rearrangement and plotting statements in one piped code.

    data %>%
      pivot_longer(!Product,
                   names_to='Strain',
                   values_to='Amount') %>%
      ggplot(mapping = aes(x=Product,
                           y = Amount,
                           fill=Strain)) +
      geom_bar(stat="identity",
               position = 'dodge')+
      theme_bw()

    The above plot may look all right, but it doesn’t represent our data. The three replicates are plotted above one another.

    In the data such as ours, we will plot the mean of the three replicates and the standard deviations in the measurements as the error bars. So, first nee to calculate the means and standard deviations in our data.

    Let’s start again by reading the data.

    data <- readxl::read_excel('data/fermentation_products.xlsx')
    data
    ## # A tibble: 18 × 4
    ##    Product   Strain_A Strain_B Strain_C
    ##    <chr>        <dbl>    <dbl>    <dbl>
    ##  1 Glucose       1.59     1.73     1.72
    ##  2 Glucose       1.65     1.82     1.68
    ##  3 Glucose       1.45     1.86     1.74
    ##  4 Succinate     0.08     0.07     1.2 
    ##  5 Succinate     0.1      0.05     1.23
    ##  6 Succinate     0.11     0.08     1.3 
    ##  7 Lactate       0.26     0.12     0.15
    ##  8 Lactate       0.28     0.1      0.14
    ##  9 Lactate       0.32     0.09     0.18
    ## 10 Formate       1.29     1.65     2.72
    ## 11 Formate       1.34     1.74     2.71
    ## 12 Formate       1.38     1.87     2.6 
    ## 13 Acetate      10.5     14.3      8.73
    ## 14 Acetate      10.7     15.6      9.47
    ## 15 Acetate       9.4     16.3      6.52
    ## 16 Ethanol      15.7     21.2     30.3 
    ## 17 Ethanol      14.3     19.0     32.2 
    ## 18 Ethanol      13.6     22.8     33.8

    Get mean values

    We will use two functions, group_by and summarize to get the mean values of the replicate experiments.

    n_data <- data %>%
      group_by(Product) %>%
      summarize(A = mean(Strain_A),
                B = mean(Strain_B),
                C = mean(Strain_C),
                )
    n_data
    ## # A tibble: 6 × 4
    ##   Product         A       B      C
    ##   <chr>       <dbl>   <dbl>  <dbl>
    ## 1 Acetate   10.2    15.4     8.24 
    ## 2 Ethanol   14.5    21.0    32.1  
    ## 3 Formate    1.34    1.75    2.68 
    ## 4 Glucose    1.56    1.80    1.71 
    ## 5 Lactate    0.287   0.103   0.157
    ## 6 Succinate  0.0967  0.0667  1.24

    In the above table, we see that we now have only one instance of each metabolite and each value represents the mean value of the metabolite in a particular strain.

    We also assigned the column names so that it contain only the name of the trains.

    This data is now suitable to do pivot_longer as seen before.

    n_data_long <- n_data %>%
      pivot_longer(!Product,
                   names_to = 'Strain',
                   values_to = 'mean')
      
    n_data_long
    ## # A tibble: 18 × 3
    ##    Product   Strain    mean
    ##    <chr>     <chr>    <dbl>
    ##  1 Acetate   A      10.2   
    ##  2 Acetate   B      15.4   
    ##  3 Acetate   C       8.24  
    ##  4 Ethanol   A      14.5   
    ##  5 Ethanol   B      21.0   
    ##  6 Ethanol   C      32.1   
    ##  7 Formate   A       1.34  
    ##  8 Formate   B       1.75  
    ##  9 Formate   C       2.68  
    ## 10 Glucose   A       1.56  
    ## 11 Glucose   B       1.80  
    ## 12 Glucose   C       1.71  
    ## 13 Lactate   A       0.287 
    ## 14 Lactate   B       0.103 
    ## 15 Lactate   C       0.157 
    ## 16 Succinate A       0.0967
    ## 17 Succinate B       0.0667
    ## 18 Succinate C       1.24

    Get standard deviation

    Now that we have the data of means, we will calculate the standard deviation in similar way.

    First get the standard deviation values.

    sds <- data %>%
      group_by(Product) %>%
      summarize(A = sd(Strain_A),
                B = sd(Strain_B),
                C = sd(Strain_C))
    
    sds
    ## # A tibble: 6 × 4
    ##   Product        A      B      C
    ##   <chr>      <dbl>  <dbl>  <dbl>
    ## 1 Acetate   0.690  1.01   1.53  
    ## 2 Ethanol   1.11   1.92   1.73  
    ## 3 Formate   0.0451 0.111  0.0666
    ## 4 Glucose   0.103  0.0666 0.0306
    ## 5 Lactate   0.0306 0.0153 0.0208
    ## 6 Succinate 0.0153 0.0153 0.0513

    And rearrange by pivot_longer.

    sds_long <- sds %>%
      pivot_longer(!Product,
                   names_to = 'Strain',
                   values_to = 'stdev')
    
    sds_long
    ## # A tibble: 18 × 3
    ##    Product   Strain  stdev
    ##    <chr>     <chr>   <dbl>
    ##  1 Acetate   A      0.690 
    ##  2 Acetate   B      1.01  
    ##  3 Acetate   C      1.53  
    ##  4 Ethanol   A      1.11  
    ##  5 Ethanol   B      1.92  
    ##  6 Ethanol   C      1.73  
    ##  7 Formate   A      0.0451
    ##  8 Formate   B      0.111 
    ##  9 Formate   C      0.0666
    ## 10 Glucose   A      0.103 
    ## 11 Glucose   B      0.0666
    ## 12 Glucose   C      0.0306
    ## 13 Lactate   A      0.0306
    ## 14 Lactate   B      0.0153
    ## 15 Lactate   C      0.0208
    ## 16 Succinate A      0.0153
    ## 17 Succinate B      0.0153
    ## 18 Succinate C      0.0513

    Now, we will add the standard deviation values into the data table that contains the mean values.

    n_data_long <- n_data_long %>%
      mutate(stdev = sds_long$stdev)
    
    n_data_long
    ## # A tibble: 18 × 4
    ##    Product   Strain    mean  stdev
    ##    <chr>     <chr>    <dbl>  <dbl>
    ##  1 Acetate   A      10.2    0.690 
    ##  2 Acetate   B      15.4    1.01  
    ##  3 Acetate   C       8.24   1.53  
    ##  4 Ethanol   A      14.5    1.11  
    ##  5 Ethanol   B      21.0    1.92  
    ##  6 Ethanol   C      32.1    1.73  
    ##  7 Formate   A       1.34   0.0451
    ##  8 Formate   B       1.75   0.111 
    ##  9 Formate   C       2.68   0.0666
    ## 10 Glucose   A       1.56   0.103 
    ## 11 Glucose   B       1.80   0.0666
    ## 12 Glucose   C       1.71   0.0306
    ## 13 Lactate   A       0.287  0.0306
    ## 14 Lactate   B       0.103  0.0153
    ## 15 Lactate   C       0.157  0.0208
    ## 16 Succinate A       0.0967 0.0153
    ## 17 Succinate B       0.0667 0.0153
    ## 18 Succinate C       1.24   0.0513

    Plot

    Now this data is suitable for plotting. As above, we will use ggplot with its geom_bar function.

    Plot means

    We will first plot the means and step by step add other things to make our plot.

    n_data_long %>%
      ggplot(mapping = aes(x = Product,
                           y = mean,
                           fill = Strain)) +
      geom_bar(stat='identity',
               position = 'dodge',
               width = 0.8)

    Plot means and standards deviations

    To the above plot we will add the layer of error bars.

    n_data_long %>%
      ggplot(mapping = aes(x=Product, y=mean, fill=Strain)) +
      geom_bar(stat='identity',
               position = 'dodge',
               width = 0.8) +
      geom_errorbar(mapping = aes(x = Product,
                                  ymin = mean - stdev,
                                  ymax = mean + stdev),
                    width = 0.2,
                    position = position_dodge(width=0.8)) +
      theme_bw()

    Theme

    We will use the theme_bw theme.

    n_data_long %>%
      ggplot(mapping = aes(x=Product, y=mean, fill=Strain)) +
      geom_bar(stat='identity',
               position = 'dodge',
               width = 0.8) +
      geom_errorbar(mapping = aes(x = Product,
                                  ymin = mean - stdev,
                                  ymax = mean + stdev),
                    width = 0.2,
                    position = position_dodge(width=0.8)) +
      theme_bw()

    Add axes labels and plot title

    n_data_long %>%
      ggplot(mapping = aes(x=Product, y=mean, fill=Strain)) +
      geom_bar(stat='identity',
               position = 'dodge',
               width = 0.8) +
      geom_errorbar(mapping = aes(x = Product,
                                  ymin = mean - stdev,
                                  ymax = mean + stdev),
                    width = 0.2,
                    position = position_dodge(width=0.8)) +
      theme_bw() +
      labs(title = 'Fermetation product titres',
           y = 'Concetration (g/l)')

    Facet wrap

    The above plot looks good but some of the products are too low in concentration. We could barely view their bars. facet_wrap function can be used to separate the plot into subplots so that each product gets its own subplot.

    n_data_long %>%
      ggplot(mapping = aes(x=Product, y=mean, fill=Strain)) +
      geom_bar(stat='identity',
               position = 'dodge',
               width = 0.8) +
      geom_errorbar(mapping = aes(x = Product,
                                  ymin = mean - stdev,
                                  ymax = mean + stdev),
                    width = 0.2,
                    position = position_dodge(width=0.8)) +
      theme_bw() +
      labs(title = 'Fermetation product titres',
           y = 'Concetration (g/l)') +
      facet_wrap(vars(Product), nrow=2, scales='free')

    The above graph looks much better.

    Combining everything into functions

    When we are going to use above code for our work, it would be very inconvenient run all lines of code each time we want to make the graph. Instead, we will write function such that we can run all the above code in one line and get the plot.

    We will divide the code into two function, one for data wrangling and another for data visualization. A third function would call these two function so that we can just call the third function and draw the graph.

    The three functions are written below.

    # data wrangling
    transform_data <- function(data){
      
      # calculate means and pivot longer
      n_data_long <- data %>%
        group_by(Product) %>%
        summarize(A = mean(Strain_A),
                  B = mean(Strain_B),
                  C = mean(Strain_C),
        ) %>%
        pivot_longer(!Product,
                     names_to = 'Strain',
                     values_to = 'mean')
      
      # calculate st dev and pivot longer
      sds <- data %>%
        group_by(Product) %>%
        summarize(A = sd(Strain_A),
                  B = sd(Strain_B),
                  C = sd(Strain_C)) %>%
        pivot_longer(!Product,
                     names_to = 'Strain',
                     values_to = 'stdev')
      
      
      # add st dev column to the n_data
      n_data_long <- n_data_long %>%
        mutate(stdev = sds_long$stdev)
      
      return(n_data_long)
      }
      
    
    # data plotting
    ferm_plot <- function(data){
      
      #plot
      data %>%
        ggplot(mapping = aes(x=Product, y=mean, fill=Strain)) +
        geom_bar(stat='identity',
                 position = 'dodge',
                 width = 0.8) +
        geom_errorbar(mapping = aes(x = Product,
                                    ymin = mean - stdev,
                                    ymax = mean + stdev),
                      width = 0.2,
                      position = position_dodge(width=0.8)) +
        theme_bw() +
        labs(title = 'Fermetation product titres',
             y = 'Concetration (g/l)') +
        facet_wrap(vars(Product), nrow=2, scales='free')
      
      }
    
    
    # call above two functions
    plot_data <- function(data){
      n_data_long <- transform_data(data)
      ferm_plot(n_data_long)
    }

    Now when we have these three functions, we can plot the graph in two lines.

    • Read the data.
    • Call the plot_data function.
    # utilize the above function
    data <- readxl::read_excel('data/fermentation_products.xlsx')
    
    plot_data(data)

    And we get the plot.

Privacy Overview
Analytics Notes

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.

Strictly Necessary Cookies

Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.

If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.

3rd Party Cookies

This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.

Keeping this cookie enabled helps us to improve our website.