Energy Demand Prediction

This notebook aims to predict the aggregate energy demand from a selected list of locations in the City of Helsinki using both the classical Box-Jenkins method and an advanced Deep Learning model.

The data is fetched from the Nuuka open API, courtesy of Avoindata.fi. Visit our Github repository at github.com/quan-possible/energy-demand-prediction.

Authors: Bruce Nguyen and Son Le

Table of Contents

  1. Exploratory Data Analysis
    1. Data Cleaning
    2. Data Visualization
  2. Baseline Model
  3. SARIMAX
    1. Model Identification
    2. Model Estimation
    3. Model Validation
  4. LSTM
    1. Feature Engineering
      1. Holidays Data Generation
      2. Cyclical feature encoding
      3. Feature Vector Normalization
    2. Modeling
  5. Further Readings

Exploratory Data Analysis

Data Cleaning

We start by reading the data and try to get familiar with it.

From what we can see, the data contains the electricity consumption from various locations in Helsinki in the time period in question. However, since we are interested only in the aggregate demand by date (without location), the task is to sum up the values from those locations.

In order to do that, we need to get more information about the eneregy consumption from each location and remove any abnormality, if need be.

Unfortunately, we can see that there are many problems with the data by looking at the view we created. The most obvious one is that the unit is not the same for all the data points. In fact, only one place has its electricity measured in $m^{3}$. We can fix this with a simple mask.

In addition, the number of recorded values are not the same for all the locations, meaning that there are missing values for each location. This will take a bit more effort, starting with creating a new DataFrame consisting only the locations as columns.

The NaN values are now visible, now we only need to decide what to do with them. Our approach is to drop columns (locations) where there are more than 100 missing entries, since such columns are not significant to our analysis anymore.

Concerning the rest of the columns, we can just interpolate the missing values linearly. This method is simple but effective, given that the number of missing points is nominal.

The only step left is to aggregate all the values into a single 'energy consumption' column. We will also rename the columns for convenience.

Finally, the data is in our desired shape. We only need to save our work now.

Data Visualization

First, we can take a look at the consumption during each year.

The seasonality within each year is very much predictable. The demand peaks in winter-spring then bottoms out during summer.

Let us proceed to zoom in the plot further and investigate monthly, weekly and daily time periods.

Upon examining the plot, we can see that there is a very visible pattern that repeats itself every 7 days. Therefore, it can be deduced that there is a weekly seasonality within the time series. We can further explore this pattern by looking at the days of the weeks during the time period.

From the result, we can see that energy consumption level is at its height during the weekdays, then drops significantly during the weekends. This very much fits our expectation, since factories and workplaces, which follow such a schedule, are places that consume the most energy.

In the following section, we will further investigate the time series by decomposing it in to the 3 components:

If we are looking at an additive decomposition, this can be mathematically formulated as $$ y_t = S_t + T_t + R_t, $$ where $y_t$ is the data, $S_t$ is the seasonal component, $T_t$ is the trend-cycle component, and $R_t$ is the remainder component, all at period $t$. Similarly, a multiplicative decomposition would be written as $$ y_t = S_t \times T_t \times R_t. $$

We first apply the classical decomposition method implemented in the class seasonal_decompose from the statsmodels package. For clarity, we only use a year of data.

Now, we can see the trends and the seasonality clearly. Concerning the residuals, it resembles white noise to some extent, thus indicating that the model is a good fit. However, it also exhibits some abnormalities with clear patterns, especially in the summer period. This suggests that the residuals also erroneously take into account of the trend component - a "leakage". The problem mostly stems from this method of decomposition itself, being a very old and outdated technique.

Fortunately, we can overcome this by substituting it for a novel, more sophisticated decomposition method called STL decomposition. The technique is implemented in the class STL in the same package.

We can see that the quality of the decomposition is much better. That said, much of the patterns we have seen is still very much present. This means we have some detrending and deseasonalizing to do in the next step - modelling.

Baseline Model

In every modeling process, there needs to be a baseline model whose results can be used to assess our more sophisticated models. Such a model should be easy to implement and can provide an adequate result. In this case, a simple linear regression model would suffice.

Conceptually, the energy consumption is the forecast variable $y$, while the time stamp is the predictor variable $x$ for this specific model. We can mathematically formulate the equation for this regression as follow $$ y_t = \beta_0 + \beta_1 t + \epsilon_t, $$ where $y_t$ is the data, and $t$ is the date time.

We start with splitting the data into train set and test set. To make the problem more challenging, we intentionally design our test set to be overlapped with the winter holiday season, which can help us more accurately judge the performance of the models when there is a strong presence of consecutive outliers.

In order to view both the in-sample and out-of-sample prediction, a part of the time series that contain both the data from the test set and train set is set aside, called demo. We will plot on the demo set in the end of this section.

Now, we know that we need to do much better than a root mean squared error of 101,738 kWh!

SARIMAX

Model Identification

Moving on to predicting the time series using SARIMAX models. Theoretically, they are simply combinations of Seasonal ARIMA with exogenous variables, which improves outliers modeling. One shorthand notation for the SARIMA model is $$ \text{SARIMA } (p,d,q) \times (P,D,Q)S, $$ where $p$ = non-seasonal autoregressive (AR) order, $d$ = non-seasonal differencing, $q$ = non-seasonal moving average (MA) order, $P$ = seasonal AR order, $D$ = seasonal differencing, $Q$ = seasonal MA order, and $S$ = time span of repeating seasonal pattern. The use of ARIMA models to fit a time series is called the Box-Jenkins procedure.

In order to apply the model, we first need to guarantee that the time series is stationary, or can be transformed into a stationary one. To simply put, a stationary time series is one whose properties is not dependent on time, with the mean, variance and autocorrelation structure do not change over time.

We can see from the initial plot that the time series is clearly not stationary, with visible seasonality and possibly trend. However, to be more precise, we can employ various stationarity tests. Here, the one we used is the Augmented Dickey-Fuller test, which is one of the commonly-used statistics.

We can see that the p-value is bigger than $\alpha = 0.05$. Therefore, the time series is unsurprisingly not stationary.

In order to stationarize the data, we need to remove the trend and seasonal components. We can do that by differencing the data by the desired order of differencing. Intuitively, this is analogous to taking the derivatives of functions.

Now, the data looks much more appropriate for SARIMAX modeling. We can check again with the Dickey-Fuller test.

Since the p-value is much smaller than $\alpha = 0.05$, the differenced time series is stationary. It follows that the suitable values for the terms $d$ and $D$ are both 1.

We now proceed to check the correlation structure within the transformed data using Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots. This step is crucial in order to identify the suitable order of the AR and MA terms, both non-seasonal and seasonal.

From the plots, a complex pattern with unexpected increases and decreases in the values of the functions can be found. Therefore, we cannot preclude any orders for the AR and MA terms.

With regards to the seasonal terms, the plots show expected behaviours with spikes at lags equals to the number of days in a week. It can be hypothesized that P = 1 and Q = 2, with the tapering autocorrelation function and the sinusoidal shape of the partial autocorrelation function. Let us check our guesses in the next section where we apply the model.

Model Estimation

For the exogenous variables, one prime candidate is the date of holidays in Finland. We can procure such data using the holidays package.

Now we can finally get to apply the model. In order to expedite the process, we can use the Grid Search method, which involves experimenting with many different parameters and selecting the best by a loss function. For ARIMA models, Akaike Information Criterion (AIC) is a widely used estimator, and we will also use it for this task.

We can see that there is not a big variation between the AICs, i.e. most of the models are of similar quality. Let us list the estimated best models.

According to the AIC score, the best model to choose is $$ \text{SARIMA } (1,1,2) \times (1,1,2)7. $$

Even though the parsimony principle is violated for this model, the margin is small enough, and we can gain from some flexibility.

Model Validation

To determine the goodness of fit of the model, we can examine its residuals using the standard assumption: the error term $\epsilon_t$ should be white noise drawings from a fixed distribution with a constant mean and variance.

We can check this by looking at the various plots showing the distribution of the residuals. In addition, the Ljung-box test can also be used to do this more precisely.

From the plots, the residuals seem to be normally distributed around 0 - which is the condition that we need - with slightly heavy tails. However, looking at the Ljung box statistics, we cannot reject the hypothesis that the data are not independently distributed, since the p-values are smaller than $\alpha = 0.05$ for some lags from 6 onwards.

Nevertheless, let us use this model to predict on the test set and judge it for ourselves.

The result looks very satisfactory! During the first 2 weeks of the month, the forecasted values fit well to the actual ones, possibly with the exception of the 6th of December - the Independence day of Finland.

With regards to the winter holiday season, the model unfortunately did not do as well. Contrary to the first 2 weeks, the only day where the values are more accurately predicted is the 27th, despite the addition of the "holiday" variable. This shows the challenges of forecasting during exceptional time periods. Nevertheless, the model still show promises in forecasting when the data behaves predictably.

LSTM

Feature Engineering

By preparing the data well in advance, we can help simplify models and improve their performance by a great extent with comparatively little effort. This means to come up with better representations of data or adding more useful information, i.e. to engineer features.

Holidays Data Generation

To get started, the exogenous 'holiday' variable can be reused entirely from the SARIMAX model.

Cyclical feature encoding

The separate datetime data on its own is not very useful to the learning task. Therefore, we need a way to encode the periodicity of the time series from our knowledge into the data. This can be achieved by modelling time as a circular scale, which is simply mapping the date time values using trigonometric functions.

Feature Vector Normalization

Since there are big discrepancies between the scales of the different features, we need to normalize them into a common scale, without distorting the differences in the ranges of values or losing information.

Modeling

Certainly, we need to split the dataset into train, validation, and test sets first. The test set here will be of the same time period as the one in the SARIMAX model, so we can compare the performance more easily.

In order for the LSTM to work, we need to create sliding windows from the data so that for each datapoint, its features is the window of a determined size of the past values . In this project, the width of the features vector is 28 days, or 4 weeks, while there is only one label for each window.

A window of data

The only secondary step left is to convert the data into Pytorch Tensors.

We now can finally get to building the neural network! Our model has 2 hidden LSTM layers with 512 neurons each. The input and output layers has 4 and 1 nodes corresponding to the shape of the features and labels vector respectively. The Adam optimizer is used, and the learning rate is dynamically modified using a scheduler.

The architecture is inspired by OmerS. Learning Pytorch LSTM Deep Learning with M5 Data.

So far, so good. Now we need to check the error of the model on the test set without normalization.

This is only a marginal improvement over the SARIMAX model. In fact, we can see that the LSTM model performs worse than the classical method after accounting for the far greater level of complexity of the Deep Neural Networks. This is in agreement with the literature on time series forecasting, with deep learning methods having not delivered on their promises.

Nevertheless, we can further examine the fitness of the model on the data by visualizing the results, starting with the all the data available.

The forecasts appear to fit the observed values well, but is worse than that of the SARIMA with consistent under-prediction of the max values. Let us look at the prediction on the test set compare it with that of the SARIMAX.

Despite better fitness during some time points, the model consistently under-predict the values by a wide margin. Concerning the holiday period, the forecasts still have the same defects as that of the SARIMAX model.

In conclusion, even though the presented models have presented reasonable accuracy, time series forecasting still presents many challenges. That said, this is still a very active research area, offering much improvement in the future.

Further Readings