INTRODUCTION
If we missed some elements of data like Fig. 1, how can we fill the blanks ?

There're a lot of method to fill them. It's matter of ' Data Science '. Let's me introduce one method, The Spline Interpolation (Cubic).
For who doesn't familiar with ' Spline Interpolation ' method, Wikipedia tells you the infomation with link below.
https://en.wikipedia.org/wiki/Spline_interpolation
Spline interpolation - Wikipedia
From Wikipedia, the free encyclopedia Mathematical method In the mathematical field of numerical analysis, spline interpolation is a form of interpolation where the interpolant is a special type of piecewise polynomial called a spline. That is, instead of
en.wikipedia.org
So Check the Result and Entire Code below.
RESULT

CODE
import pandas as pd
import numpy as np
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt
import copy
import math
class Spline_Interpolation:
def __init__(self):
self.data = pd.read_csv('tmp.csv')
self.time = []
self.data_list = []
self.data_list_splined = []
self.time_extended = np.arange(0, 23, 0.1)
self.label = ["PM10", "PM25", "O3", "NO2", "CO", "SO2"]
self.unit = ["mcg/m3", "mcg/m3", "ppm", "ppm", "ppm", "ppm"]
self.blank = []
self.prediction = []
self.duplicate = copy.deepcopy(self.data)
self.round_criterion = [0, 0, 3, 3, 1, 3]
def Whole_plot(self):
for idx in range(1, 7):
self.draw_spl_plot(idx)
self.Fill_Blank(idx)
print(self.duplicate)
def Initialized(self):
self.time = []
self.data_list = []
self.blank = []
def draw_spl_plot(self, num):
self.Initialized()
for idx in range(24):
if not math.isnan(self.data.iloc[idx][num]):
self.time.append(idx)
self.data_list.append(self.data.iloc[idx][num])
else:
self.blank.append(idx)
spl = interpolate.interp1d(self.time, self.data_list, kind='cubic')
self.data_list_splined = spl(self.time_extended)
self.prediction = spl(self.blank)
plt.subplot(2, 3, num)
plt.plot(self.time, self.data_list, "o", self.time_extended, self.data_list_splined, '--', self.blank, self.prediction, '*')
plt.title(self.label[num-1])
plt.xlabel('time [h]')
plt.ylabel(self.unit[num-1])
if num == 6:
plt.tight_layout()
plt.show()
def Fill_Blank(self, num):
cri = self.round_criterion[num-1]
for idx in range(len(self.blank)):
self.duplicate.loc[self.blank[idx], self.label[num-1]] = round(self.prediction[idx], cri)
SPL = Spline_Interpolation()
SPL.Whole_plot()
CODE EXPLAIN
Let's check it seperately.
import pandas as pd
import numpy as np
import scipy.interpolate as interpolate
import matplotlib.pyplot as plt
import copy
import math
You can easilly use method with ' Scipy ' package and visualize with ' Matplotlib '.
' pandas ' for Read and Write data (in this case .csv).
class Spline_Interpolation:
def __init__(self):
self.data = pd.read_csv('tmp.csv')
self.time = []
self.data_list = []
self.data_list_splined = []
self.time_extended = np.arange(0, 23, 0.1)
self.label = ["PM10", "PM25", "O3", "NO2", "CO", "SO2"]
self.unit = ["mcg/m3", "mcg/m3", "ppm", "ppm", "ppm", "ppm"]
self.blank = []
self.prediction = []
self.duplicate = copy.deepcopy(self.data)
self.round_criterion = [0, 0, 3, 3, 1, 3]
I prefer ' OOP ' (Object Oriented Programing), so trying to use ' class '. But ' class ' isn't equal with OOP.
self.duplicate is for keeping original data, so I used deep copy.
You can check the difference between ' shallow copy ' and ' deep copy ' with link below.
https://www.baeldung.com/cs/deep-vs-shallow-copy
for idx in range(24):
if not math.isnan(self.data.iloc[idx][num]):
self.time.append(idx)
self.data_list.append(self.data.iloc[idx][num])
else:
self.blank.append(idx)
In ' for ' sentence, the blanks in csv file read as ' Nan ', so I figured it out with ' math.isnan() ' function.
You can access the data of pandas with data.iloc[idx_1][idx_2].
Therefore, if an elements is not Nan, appended on data_list and time list, but if it's Nan, appended on blank list and will use prediction.
spl = interpolate.interp1d(self.time, self.data_list, kind='cubic')
self.data_list_splined = spl(self.time_extended)
self.prediction = spl(self.blank)
Next, If we finished classification, Calculates Spline Interpolation for data_list and also blank list too.
plt.subplot(2, 3, num)
plt.plot(self.time, self.data_list, "o", self.time_extended, self.data_list_splined, '--', self.blank, self.prediction, '*')
plt.title(self.label[num-1])
plt.xlabel('time [h]')
plt.ylabel(self.unit[num-1])
And visualized with ' Matplotlib ' and indicates it's title and unit.
cri = self.round_criterion[num-1]
for idx in range(len(self.blank)):
self.duplicate.loc[self.blank[idx], self.label[num-1]] = round(self.prediction[idx], cri)
But, the elements of data have length constraint, so ' round ' it with correspond number.
You can use ' round ' function with round(number, number of digits).
댓글