A probability distribution-based approach to impute missing values in hourly PM10 concentration

Rossita Mohamad Yunus, Yong Z. Zubairi, Masud Hasan

    Research output: Contribution to journalArticlepeer-review

    Abstract

    PM10 data are one of the basic inputs in assessing air quality. However, most data series are unreliable for analysis as they posses significant number of missing records. The study adopt a probability distribution-based approach to impute missing values in hourly PM10 concentration data, and aim to preserve statistical properties similar to that of the observed data. This study considers twenty three monitoring stations in west Peninsular Malaysia. Since the log normal distribution fits reasonably well to hourly PM10 concentration data for most stations, this distribution was used to randomly generate a dataset of size equal to the size of the missing data using the estimated parameters. Considering correlated observations between stations, a well established spatial imputation method that is used in many applications is the k-stations-average, whereby the missing value is filled up with average of data from k nearest stations. Alternatively, in this study, missing values are filled by random numbers generated from log normal distribution that were previously sorted according to the rank of inverse distance weighted-average from four neighbouring stations. The distributional-based imputed data show more similar statistical properties to the observed data compared to the data imputed using the 4-stations-average (with inverse distance weight) method. The means and medians of observed and imputed datasets were found to be very similar. However, the log normal distribution slightly underestimates the 95th and 99th percentiles. Additionally, the distributional-based imputed data fail to maintain similar statistical properties to that of the observed data when at least 5% additional observed data were made missing.

    Fingerprint

    Dive into the research topics of 'A probability distribution-based approach to impute missing values in hourly PM10 concentration'. Together they form a unique fingerprint.

    Cite this