Homework 2&3

Part A

In this part, we will use 168 and 48 hours ago consumption values as the na?ve predictions for next day’s consumption.

library("data.table")
consumption = fread("GercekZamanliTuketim-01012016-19052020.csv")
setnames(consumption,names(consumption)[3],'value')
consumption[,date:=as.Date(Tarih,'%d.%m.%Y')]
consumption[,hour:=as.numeric(substr(Saat,1,2))]
consumption=consumption[,list(date,hour,value)]
consumption[,value:=gsub(".", "",value, fixed = TRUE)]
consumption[,value:=as.numeric(gsub(",", ".",value, fixed = TRUE))]
head(consumption)

##          date hour    value
## 1: 2016-01-01    0 26277.24
## 2: 2016-01-01    1 24991.82
## 3: 2016-01-01    2 23532.61
## 4: 2016-01-01    3 22464.78
## 5: 2016-01-01    4 22002.91
## 6: 2016-01-01    5 21957.08

We will shift the data 168 and 48 hours to reach the predictions; we will also clear our N/A values.

# Shift data and clear N/A
consumption = consumption[,lag_48:=shift(consumption[,3],48)]
consumption = consumption[,lag_168:=shift(consumption[,3],168)]
full_consumption = consumption[complete.cases(consumption)]
head(full_consumption)

##          date hour    value   lag_48  lag_168
## 1: 2016-01-08    0 28602.02 29189.27 26277.24
## 2: 2016-01-08    1 27112.37 27614.02 24991.82
## 3: 2016-01-08    2 25975.34 26578.97 23532.61
## 4: 2016-01-08    3 25315.55 25719.19 22464.78
## 5: 2016-01-08    4 25128.15 25864.63 22002.91
## 6: 2016-01-08    5 25356.22 25918.59 21957.08

We filtered the test period as the dates after 1st of March, 2020 (including). Summary statistic; Mean Absolute Percentage Error and 0.1, 0.25, 0.5, 0.75 and 0.90 quantiles are included to better understand the distributional characteristics of the errors.

# Filter "test period" & calculate APE and MAPE for lag_48 and lag_168
test_period = full_consumption[date >= '2020-03-01']
test_period = test_period[, APE_48:=(abs(test_period$value-test_period$lag_48)/abs(test_period$value))*100]
test_period = test_period[, APE_168:=(abs(test_period$value-test_period$lag_168)/abs(test_period$value))*100]
MAPE_48 = mean(test_period$APE_48)
MAPE_168 = mean(test_period$APE_168)

# Display summary statistics
summary(test_period$APE_48)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00083  1.82062  6.15767  9.43840 12.79949 56.26744

summary(test_period$APE_168)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00399  2.01333  4.57090  5.89412  8.49889 36.98680

quantile(test_period$APE_48, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))

##        10%        25%        50%        75%        90% 
##  0.7104752  1.8206243  6.1576726 12.7994874 24.9016425

quantile(test_period$APE_168, probs = c(0.1, 0.25, 0.5, 0.75, 0.9))

##        10%        25%        50%        75%        90% 
##  0.7727876  2.0133265  4.5708979  8.4988936 12.6391592

To visually see what APE values for both approaches, a boxplot is printed.

# Plot a boxplot for APE_48 and APE_168
boxplot(test_period$APE_48, test_period$APE_168, names=c("APE_48","APE_168"))

From the boxplot, we can see that the interquantile range for previous 7 days is narrower; so it has less variance, therefore we can conclude that it is more accurate. We also observe a smaller median error for 7 days lag. We can also see that absolute percent error for the previous two days has more outliers than previous 7 days. We observe a high number of outliers; religious holidays or other holidays affect electricity consumption; hence this naive predictions may not count for these dates. We can conclude that using the consumption value of previous week’s same hour most likely is a better forecast than previous 2 days consumption value.