[Data Anaylsis] Fill the blanks in data with Spline Interpolation & Python

INTRODUCTION
RESULT
CODE
CODE EXPLAIN
Thank you for Watching !

INTRODUCTION

If we missed some elements of data like Fig. 1, how can we fill the blanks ?

There're a lot of method to fill them. It's matter of ' Data Science '. Let's me introduce one method, The Spline Interpolation (Cubic).

For who doesn't familiar with ' Spline Interpolation ' method, Wikipedia tells you the infomation with link below.
https://en.wikipedia.org/wiki/Spline_interpolation

Spline interpolation - Wikipedia

From Wikipedia, the free encyclopedia Mathematical method In the mathematical field of numerical analysis, spline interpolation is a form of interpolation where the interpolant is a special type of piecewise polynomial called a spline. That is, instead of

en.wikipedia.org

So Check the Result and Entire Code below.

RESULT

CODE

import pandas as pd
import numpy as np
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt
import copy
import math

class Spline_Interpolation:
    
    def __init__(self):
        
        self.data = pd.read_csv('tmp.csv')
        self.time = []
        self.data_list = []
        self.data_list_splined = []
        self.time_extended = np.arange(0, 23, 0.1)
        self.label = ["PM10", "PM25", "O3", "NO2", "CO", "SO2"]
        self.unit = ["mcg/m3", "mcg/m3", "ppm", "ppm", "ppm", "ppm"]
        self.blank = []
        self.prediction = []
        self.duplicate = copy.deepcopy(self.data)
        self.round_criterion = [0, 0, 3, 3, 1, 3]
        
    def Whole_plot(self):
        
        for idx in range(1, 7):
            
            self.draw_spl_plot(idx)
            
            self.Fill_Blank(idx)
            
        print(self.duplicate)
        
        
    def Initialized(self):
        
        self.time = []
        self.data_list = []
        self.blank = []
            
        
    def draw_spl_plot(self, num):
        
        self.Initialized()
        
        for idx in range(24):
            
            if not math.isnan(self.data.iloc[idx][num]):
                
                self.time.append(idx)
                
                self.data_list.append(self.data.iloc[idx][num])
                
            else:
                
                self.blank.append(idx)
        
        spl = interpolate.interp1d(self.time, self.data_list, kind='cubic')
        
        self.data_list_splined = spl(self.time_extended)
        
        self.prediction = spl(self.blank)
        
        plt.subplot(2, 3, num)
        
        plt.plot(self.time, self.data_list, "o", self.time_extended, self.data_list_splined, '--', self.blank, self.prediction, '*')
        
        plt.title(self.label[num-1])
        
        plt.xlabel('time [h]')
        
        plt.ylabel(self.unit[num-1])
        
        if num == 6:
        
            plt.tight_layout()
            plt.show()
            
            
    def Fill_Blank(self, num):
        
        cri = self.round_criterion[num-1]
        
        for idx in range(len(self.blank)):
            
            self.duplicate.loc[self.blank[idx], self.label[num-1]] = round(self.prediction[idx], cri)
            
        
SPL = Spline_Interpolation()

SPL.Whole_plot()

CODE EXPLAIN

Let's check it seperately.

import pandas as pd
import numpy as np
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt
import copy
import math

You can easilly use method with ' Scipy ' package and visualize with ' Matplotlib '.
' pandas ' for Read and Write data (in this case .csv).

class Spline_Interpolation:
    
    def __init__(self):
        
        self.data = pd.read_csv('tmp.csv')
        self.time = []
        self.data_list = []
        self.data_list_splined = []
        self.time_extended = np.arange(0, 23, 0.1)
        self.label = ["PM10", "PM25", "O3", "NO2", "CO", "SO2"]
        self.unit = ["mcg/m3", "mcg/m3", "ppm", "ppm", "ppm", "ppm"]
        self.blank = []
        self.prediction = []
        self.duplicate = copy.deepcopy(self.data)
        self.round_criterion = [0, 0, 3, 3, 1, 3]

I prefer ' OOP ' (Object Oriented Programing), so trying to use ' class '. But ' class ' isn't equal with OOP.
self.duplicate is for keeping original data, so I used deep copy.

You can check the difference between ' shallow copy ' and ' deep copy ' with link below.
https://www.baeldung.com/cs/deep-vs-shallow-copy

for idx in range(24):
            
            if not math.isnan(self.data.iloc[idx][num]):
                
                self.time.append(idx)
                
                self.data_list.append(self.data.iloc[idx][num])
                
            else:
                
                self.blank.append(idx)

In ' for ' sentence, the blanks in csv file read as ' Nan ', so I figured it out with ' math.isnan() ' function.
You can access the data of pandas with data.iloc[idx_1][idx_2].
Therefore, if an elements is not Nan, appended on data_list and time list, but if it's Nan, appended on blank list and will use prediction.

spl = interpolate.interp1d(self.time, self.data_list, kind='cubic')
        
        self.data_list_splined = spl(self.time_extended)
        
        self.prediction = spl(self.blank)

Next, If we finished classification, Calculates Spline Interpolation for data_list and also blank list too.

plt.subplot(2, 3, num)
        
        plt.plot(self.time, self.data_list, "o", self.time_extended, self.data_list_splined, '--', self.blank, self.prediction, '*')
        
        plt.title(self.label[num-1])
        
        plt.xlabel('time [h]')
        
        plt.ylabel(self.unit[num-1])

And visualized with ' Matplotlib ' and indicates it's title and unit.

cri = self.round_criterion[num-1]
        
        for idx in range(len(self.blank)):
            
            self.duplicate.loc[self.blank[idx], self.label[num-1]] = round(self.prediction[idx], cri)

But, the elements of data have length constraint, so ' round ' it with correspond number.
You can use ' round ' function with round(number, number of digits).

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Chulyong Lee

[Data Anaylsis] Fill the blanks in data with Spline Interpolation & Python

INTRODUCTION

RESULT

CODE

CODE EXPLAIN

Thank you for Watching !

댓글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역