[Data Anaylsis] Fill the blanks in data with Spline Interpolation & Python

    INTRODUCTION

     

    If we missed some elements of data like Fig. 1, how can we fill the blanks ? 

    Fig. 1. Time Series Data

    There're a lot of method to fill them. It's matter of ' Data Science '. Let's me introduce one method, The Spline Interpolation (Cubic).
     
    For who doesn't familiar with ' Spline Interpolation ' method, Wikipedia tells you the infomation with link below.
    https://en.wikipedia.org/wiki/Spline_interpolation

     

    Spline interpolation - Wikipedia

    From Wikipedia, the free encyclopedia Mathematical method In the mathematical field of numerical analysis, spline interpolation is a form of interpolation where the interpolant is a special type of piecewise polynomial called a spline. That is, instead of

    en.wikipedia.org

     

    So Check the Result and Entire Code below.

     

    RESULT

    Fig. 2. Result

    CODE

    import pandas as pd
    import numpy as np
    import scipy.interpolate as interpolate
    import matplotlib.pyplot as plt
    import copy
    import math
    
    class Spline_Interpolation:
        
        def __init__(self):
            
            self.data = pd.read_csv('tmp.csv')
            self.time = []
            self.data_list = []
            self.data_list_splined = []
            self.time_extended = np.arange(0, 23, 0.1)
            self.label = ["PM10", "PM25", "O3", "NO2", "CO", "SO2"]
            self.unit = ["mcg/m3", "mcg/m3", "ppm", "ppm", "ppm", "ppm"]
            self.blank = []
            self.prediction = []
            self.duplicate = copy.deepcopy(self.data)
            self.round_criterion = [0, 0, 3, 3, 1, 3]
            
        def Whole_plot(self):
            
            for idx in range(1, 7):
                
                self.draw_spl_plot(idx)
                
                self.Fill_Blank(idx)
                
            print(self.duplicate)
            
            
        def Initialized(self):
            
            self.time = []
            self.data_list = []
            self.blank = []
                
            
        def draw_spl_plot(self, num):
            
            self.Initialized()
            
            for idx in range(24):
                
                if not math.isnan(self.data.iloc[idx][num]):
                    
                    self.time.append(idx)
                    
                    self.data_list.append(self.data.iloc[idx][num])
                    
                else:
                    
                    self.blank.append(idx)
            
            spl = interpolate.interp1d(self.time, self.data_list, kind='cubic')
            
            self.data_list_splined = spl(self.time_extended)
            
            self.prediction = spl(self.blank)
            
            plt.subplot(2, 3, num)
            
            plt.plot(self.time, self.data_list, "o", self.time_extended, self.data_list_splined, '--', self.blank, self.prediction, '*')
            
            plt.title(self.label[num-1])
            
            plt.xlabel('time [h]')
            
            plt.ylabel(self.unit[num-1])
            
            if num == 6:
            
                plt.tight_layout()
                plt.show()
                
                
        def Fill_Blank(self, num):
            
            cri = self.round_criterion[num-1]
            
            for idx in range(len(self.blank)):
                
                self.duplicate.loc[self.blank[idx], self.label[num-1]] = round(self.prediction[idx], cri)
                
            
    SPL = Spline_Interpolation()
    
    SPL.Whole_plot()

     

     

    CODE EXPLAIN

     

    Let's check it seperately.

    import pandas as pd
    import numpy as np
    import scipy.interpolate as interpolate
    import matplotlib.pyplot as plt
    import copy
    import math

    You can easilly use method with ' Scipy ' package and visualize with ' Matplotlib '.
    ' pandas ' for Read and Write data (in this case .csv).
     

    class Spline_Interpolation:
        
        def __init__(self):
            
            self.data = pd.read_csv('tmp.csv')
            self.time = []
            self.data_list = []
            self.data_list_splined = []
            self.time_extended = np.arange(0, 23, 0.1)
            self.label = ["PM10", "PM25", "O3", "NO2", "CO", "SO2"]
            self.unit = ["mcg/m3", "mcg/m3", "ppm", "ppm", "ppm", "ppm"]
            self.blank = []
            self.prediction = []
            self.duplicate = copy.deepcopy(self.data)
            self.round_criterion = [0, 0, 3, 3, 1, 3]

    I prefer ' OOP ' (Object Oriented Programing), so trying to use ' class '. But ' class ' isn't equal with OOP.
    self.duplicate is for keeping original data, so I used deep copy.
     
    You can check the difference between ' shallow copy ' and ' deep copy ' with link below.
    https://www.baeldung.com/cs/deep-vs-shallow-copy
     

    for idx in range(24):
                
                if not math.isnan(self.data.iloc[idx][num]):
                    
                    self.time.append(idx)
                    
                    self.data_list.append(self.data.iloc[idx][num])
                    
                else:
                    
                    self.blank.append(idx)

    In ' for ' sentence, the blanks in csv file read as ' Nan ', so I figured it out with ' math.isnan() ' function.
    You can access the data of pandas with data.iloc[idx_1][idx_2].
    Therefore, if an elements is not Nan, appended on data_list and time list, but if it's Nan, appended on blank list and will use prediction.
     

    spl = interpolate.interp1d(self.time, self.data_list, kind='cubic')
            
            self.data_list_splined = spl(self.time_extended)
            
            self.prediction = spl(self.blank)

    Next, If we finished classification, Calculates Spline Interpolation for data_list and also blank list too.
     

    plt.subplot(2, 3, num)
            
            plt.plot(self.time, self.data_list, "o", self.time_extended, self.data_list_splined, '--', self.blank, self.prediction, '*')
            
            plt.title(self.label[num-1])
            
            plt.xlabel('time [h]')
            
            plt.ylabel(self.unit[num-1])

    And visualized with ' Matplotlib ' and indicates it's title and unit.
     

    cri = self.round_criterion[num-1]
            
            for idx in range(len(self.blank)):
                
                self.duplicate.loc[self.blank[idx], self.label[num-1]] = round(self.prediction[idx], cri)

    But, the elements of data have length constraint, so ' round ' it with correspond number.
    You can use ' round ' function with round(number, number of digits).
     

    Thank you for Watching !

    댓글