TY - JOUR
T1 - A probability distribution-based approach to impute missing values in hourly PM10 concentration
AU - Mohamad Yunus, Rossita
AU - Zubairi, Yong Z.
AU - Hasan, Masud
PY - 2018
Y1 - 2018
N2 - PM10 data are one of the basic inputs in assessing air quality. However, most data series are unreliable for analysis as they posses significant number of missing records. The study adopt a probability distribution-based approach to impute missing values in hourly PM10 concentration data, and aim to preserve statistical properties similar to that of the observed data. This study considers twenty three monitoring stations in west Peninsular Malaysia. Since the log normal distribution fits reasonably well to hourly PM10 concentration data for most stations, this distribution was used to randomly generate a dataset of size equal to the size of the missing data using the estimated parameters. Considering correlated observations between stations, a well established spatial imputation method that is used in many applications is the k-stations-average, whereby the missing value is filled up with average of data from k nearest stations. Alternatively, in this study, missing values are filled by random numbers generated from log normal distribution that were previously sorted according to the rank of inverse distance weighted-average from four neighbouring stations. The distributional-based imputed data show more similar statistical properties to the observed data compared to the data imputed using the 4-stations-average (with inverse distance weight) method. The means and medians of observed and imputed datasets were found to be very similar. However, the log normal distribution slightly underestimates the 95th and 99th percentiles. Additionally, the distributional-based imputed data fail to maintain similar statistical properties to that of the observed data when at least 5% additional observed data were made missing.
AB - PM10 data are one of the basic inputs in assessing air quality. However, most data series are unreliable for analysis as they posses significant number of missing records. The study adopt a probability distribution-based approach to impute missing values in hourly PM10 concentration data, and aim to preserve statistical properties similar to that of the observed data. This study considers twenty three monitoring stations in west Peninsular Malaysia. Since the log normal distribution fits reasonably well to hourly PM10 concentration data for most stations, this distribution was used to randomly generate a dataset of size equal to the size of the missing data using the estimated parameters. Considering correlated observations between stations, a well established spatial imputation method that is used in many applications is the k-stations-average, whereby the missing value is filled up with average of data from k nearest stations. Alternatively, in this study, missing values are filled by random numbers generated from log normal distribution that were previously sorted according to the rank of inverse distance weighted-average from four neighbouring stations. The distributional-based imputed data show more similar statistical properties to the observed data compared to the data imputed using the 4-stations-average (with inverse distance weight) method. The means and medians of observed and imputed datasets were found to be very similar. However, the log normal distribution slightly underestimates the 95th and 99th percentiles. Additionally, the distributional-based imputed data fail to maintain similar statistical properties to that of the observed data when at least 5% additional observed data were made missing.
U2 - v1/uploads/files/4_Portal%20Content/2_%20Statistics/MyStats/2017/Abstract/2_c_ii_-Fullpaper-Dr_Rossita_M__Yunus.pdf
DO - v1/uploads/files/4_Portal%20Content/2_%20Statistics/MyStats/2017/Abstract/2_c_ii_-Fullpaper-Dr_Rossita_M__Yunus.pdf
M3 - Article
SP - 1
EP - 6
JO - Department of Statistics Malaysia
JF - Department of Statistics Malaysia
ER -