Compiled on 2020-09-21 by Maroun Bou Sleiman.

1 Introduction

This project is a data-driven attempt to reconstruct the Lebanese energy sector using openly accessible data. I have never seen a day (for 32 years and counting) where this sector was able to provide consistent electricity 24/7 in the whole country - a primordial feat that the majority of countries achieve. I will not go into details into the economic and political reasons behind this incompetence as there are already many reports about the subject.

The (few) reports I read about the electricity sector in Lebanon explore at the trends in production and fossil fuel consumption from a global perspective. To my knowledge, there is no analysis of the experience from the end-user’s perspective. The EDL (Electricite du Liban) website contains information about hourly feeding information by electrical station and outlet, which I obtained (through automated scraping of their website).

The power grid is essentially split into three main components: the power plant (producer), the power station or electrical substation, and the outlets coming out of each substation. The EDL website has very scarce information about the power plants, and even less about the substations, and almost nothing about the outlets. For example, besides the name of these entities, we have very little information about their geographical positions and technical parameters. That’s really bad for a multi-billion dollar endeavor. So we have to work with what we get, and we’ll use data-science to infer other information.

1.1 Main questions

  • What data are available on the topic?
  • Is the available data sufficient and of good quality?
  • Can we get any meaningful insights based on the available data?
  • What data should be available in the future and in what form?

1.2 Main Data Sources

  • Electricite du Liban (EDL) website.
  • Central Administration for Statistics of the Presidency of the Countil of Ministers, or CAS.

1.3 Caveats

  • As you’ll see later, the data is sparse and of generally low quality. This reflects on the quality of the analysis. Any inference from this data should be taken with a grain of salt.

1.4 Setup

This notebook was run on Renku. Here’s the link to the project. The project is using the “R basic template 4.0” with a few differences. Extra debian packages installed (see Dockerfile), extra R packages(see install.R).

Data from the EDL website was fetched using a custom web crawler/scraper and using this command renku run ./R/edl_feeding_crawler.R. See README.rmd for more information.

Then there is a post-processing step renku run ./R/edl_feeding_postprocess.R. Here, some minor modifications are done to the data and a feeble attempt to geolocate these stations based on a simple search using the tidygeocoder package. I had to do this since, not surprisingly, I could not find a better resource for the Lebanese power grid.

2 Analysing the EDL feeding data

2.1 Loading the pre-processed data

First load some R packages and the data, then make sure that the date column (dateoffeeding) is recognized as a date field (using the lubridate package).

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(cowplot)) # for arranging and combining plots
suppressPackageStartupMessages(library(igraph))
suppressPackageStartupMessages(library(visNetwork))
suppressPackageStartupMessages(library(leaflet))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(plotly)) # for interactive plots
suppressPackageStartupMessages(library(readxl)) # for later
load("../data/edl_feeding_crawler_output_postprocessed.RData", verbose = T)
## Loading objects:
##   feeding_all
##   outlet_meta
##   stations_dt
feeding_all <- feeding_all[-which(grepl("^Error", feeding_all[,1])),]
feeding_all$dateoffeeding <- ymd(feeding_all$dateoffeeding)

2.2 General glance at the data

Some stats: we have 85 main stations, 1008 outlets, and 170554 data points from 3985 days from a temporal range of 4306 days.

Earliest data point is 2008-10-20, latest data point is 2020-08-04.

The following table shows the station metadata. This data was manually obtained by by examining the source code of the EDL website. The station_id and global_id were created by me.

DT::datatable(stations_dt, caption = "Station metadata")

And here is the outlet metadata.

DT::datatable(outlet_meta, caption = "Outlet metadata")

Here’s the distrubution of the number of outlets per substation:

outlet_meta_table <- table(outlet_meta$station_id)
g <- outlet_meta %>% group_by(station_id) %>% summarise(n_outlets = length(id)) %>%
  ggplot(aes(n_outlets)) +
  geom_histogram(binwidth = 1, color = "white", fill = "grey") +
  theme_cowplot() +
  xlab("Number of outlets per station") +
  ylab("Number of stations")

ggplotly(g)

And here’s how the data looks like. 0 = no electricity, 1 = electricity

DT::datatable(head(feeding_all, n=100), caption = "First 100 records of data")

For convenience, one can reshape the dataset to a long format (larger in size). Then look at a subset of the data.

timecols <- paste0("time", 0:23)
feeding_all_reshaped <- feeding_all %>% gather(time, power, paste0("time", 0:23) )
feeding_all_reshaped$time <- as.numeric(gsub("time", "",feeding_all_reshaped$time))
feeding_all_reshaped$dateoffeeding_lub <- as_datetime(ymd(feeding_all_reshaped$dateoffeeding))
hour(feeding_all_reshaped$dateoffeeding_lub) <- feeding_all_reshaped$time

set.seed(123)
random_subsample <-sample(unique(feeding_all_reshaped$feedername), 50)

feeding_all_reshaped[feeding_all_reshaped$feedername %in% random_subsample, ] %>% 
  ggplot(aes(x = dateoffeeding_lub, y = feedername)) +
  geom_tile(aes(fill = factor(power) )) +
  theme_cowplot()+
  scale_fill_manual(values = c("red", "green")) +
  geom_vline(xintercept = as_date("2020-08-04"), color = "black") +
  xlab("Date")

The black vertical line is the day the Beirut port explosion occured. As you can see above, the data is sparse (a lot of missing points). There are periods where no data is available at all. Maybe they coincide with vacations of the data-entry guy?

A closer look at more recent times 2020-today:

feeding_all_reshaped[(feeding_all_reshaped$feedername %in% random_subsample), ] %>% 
  filter(dateoffeeding_lub > "2020-01-01") %>%
  ggplot(aes(x = dateoffeeding_lub, y = feedername)) +
  geom_tile(aes(fill = factor(power) )) +
  theme_cowplot()+
  scale_fill_manual(values = c("red", "green")) +
  geom_vline(xintercept = as_date("2020-08-04"), color = "black") +
  xlab("Date")

Summary view of the global trend (blue):

summ <- feeding_all_reshaped %>% group_by(dateoffeeding, feederid) %>% summarise(hoursOn = sum(power == 1), hoursOff = sum(power == 0))
ggplot(summ, aes(x = dateoffeeding, y = hoursOn)) + 
  geom_line(aes(group = feederid), alpha = 0.3) +
  geom_smooth(fill = "white") +
  scale_y_continuous(breaks = seq(0,24,4), limits = c(0,24)) +
  geom_vline(xintercept = as_date("2020-08-04"), color = "red") +
  theme_cowplot() +
  ylab("Hours of power per day") +
  xlab("Date")

Zoom in on recent times, and include the real data in the background (in black):

summ %>% filter(dateoffeeding > "2020-01-01") %>% 
  ggplot(aes(x = dateoffeeding, y = hoursOn)) + 
  geom_line(aes(group = feederid), alpha = 0.3) +
  geom_smooth(fill = "white") +
  scale_y_continuous(breaks = seq(0,24,4), limits = c(0,24)) +
  geom_vline(xintercept = as_date("2020-08-04"), color = "red") +
  theme_cowplot() +
  ylab("Hours of power per day") +
  xlab("Date")

Get some derived statistics about the substations (ratio of outlets that are ON at any given time). Also, since there is a large variation in the number of outlets per substation, create arbitrary sub-groups of outlets to facilitate comparisons (Those that have 0-2 substations, those that have 3-6 etc…).

First a histogram to examine the grouping of substations. Based on the available data, a substation can have a different number of outlets at any given time (I cannot tell whether this is an artefact of sparsity or evolution in the network). I have therefore selected the maximum number of outlets for each substation in the histogram below:

summ_substation <- feeding_all_reshaped %>% group_by(dateoffeeding_lub, substationid) %>% summarise(ratio_outlets_on = sum(power == 1)/length(power), n_outlets = length(power)) %>% ungroup() %>% mutate(n_outlet_group = cut_width(n_outlets, width = 3, center = 1.5))
g <- summ_substation %>% group_by(substationid) %>% summarize(group = n_outlet_group[which.max(n_outlets)]) %>%
  ggplot(aes(group)) +
  geom_histogram(stat = "count") +
  theme_cowplot() +
  xlab("Substation group based on number of outlets") +
  ylab("Number of Substations")
ggplotly(g)

General (smoothed) view of the ratios per group:

summ_substation_grouped <- summ_substation %>% group_by(dateoffeeding_lub, n_outlet_group) %>% summarize(mean_ratio_outlets_on = mean(ratio_outlets_on, na.rm = T))

g <- ggplot(summ_substation_grouped, aes(x = dateoffeeding_lub, y = mean_ratio_outlets_on, color = n_outlet_group)) + 
  # geom_line(aes(group = feederid), alpha = 0.3) +
  # geom_line() +
  geom_smooth(se = FALSE) +
  ylim(0,1) +
  geom_vline(xintercept = as_date("2020-08-04"), color = "red") +
  theme_cowplot() +
  ylab("Ratio of outlets online") +
  xlab("Date")

ggplotly(g)

2.3 Section conclusions

It is clear that starting 2020, there was a drop in feeding. This is no surprise and is known, but there is little talk in the media of how much exactly this is occuring. Also, if we know more about the exact locations that are fed by each outlet, we can integrate that data with many other datasets and ask some interesting questions:

  • Are wealthier neighborhoods getting more electricity?
  • What are the environmental considerations?
  • What about private generators? How did their tariffs evolve? what is their impact on the environment as a function of EDL feeding?
  • What effect/cost does that have on the economy?
  • … many other questions

3 Integration with other data sources

How can we explain this dip? This section explores the EDL (Electricite du Liban) data that is made available by the Lebanese Central Administration of Statistics (CAS) in excel workbooks.

3.1 Data download

The EDL data sheet is downloaded from http://www.cas.gov.lb/index.php/thematic-time-series. This is the link to the file: http://www.cas.gov.lb/images/Excel/Thematic%20Time%20Series/EDL%201995-2020.xlsx

We do not know how often this sheet is updated, by whom, and what version it is. Also the sheet will be hard to parse since it is prepared in excel and in a non-machine-friendly format.

download.file("http://www.cas.gov.lb/images/Excel/Thematic%20Time%20Series/EDL%201995-2020.xlsx", "../data/EDL 1995-2020.xlsx")

3.2 Reading the monthly supply of energy.

This sheet (sheet 2) shows the monthly production and consumption of the network. The production is split by different EDL and non-EDL sources. Some notes and criticism:

  • General design of the sheet is bad for data analysis. This is ok for the purpose of manual human inspection, but this is not how a governmental service (in 2020) should report data.
  • No metadata explaining what each column means, how it was calculated, and under what assumptions.
  • While it’s good to be tri-lingual, mixing arabic and latin complicates life. There should be separate table generated (automatically not manually) per language.
  • Two columns have the same name but different values for production and consumption. Example: Energie thermique / Thermal Energy / طاقة حرارية
  • Some columns are derived from others (sums of two or more columns) but this is not clear in the table unless one checks the excel formulas.
  • In many cases, there is no distinction between a missing data point and a zero data point.
  • February is written wront throughout (Febraury)!
  • Note to self: Unit throughout is KWh
dt_in <- readxl::read_xlsx("../data/EDL 1995-2020.xlsx", sheet = "Mensuel  - Monthly - شهري", skip = 5)
# deal with tri-lingual column names

colnames(dt_in) <- sapply(strsplit(colnames(dt_in), "/"), function(x){
  if(length(x) ==2){
    gsub(" +", "", x[1])
  }else{
    gsub(" +", "", x[2])
  }
})
# deal with duplicate column names. First instance is production, second is consumption

for (i in 1:ncol(dt_in)){ 
  m <- which(colnames(dt_in) %in% colnames(dt_in)[i])
 if(length(m) == 3){ 
   colnames(dt_in)[m] <- paste0(c("production_", "purchases_", "consumption_"),colnames(dt_in)[m]  )
 }else if(length(m) == 2){
   wpc <- which(colnames(dt_in) == "Purchases")
   if(min(m) > wpc){
     colnames(dt_in)[m] <- paste0(c("purchases_", "consumption_"),colnames(dt_in)[m]  )
   }else{
     colnames(dt_in)[m] <- paste0(c("production_", "consumption_"),colnames(dt_in)[m]  )
   }
   
 }
}

colnames(dt_in) <- gsub("^Safa$", "production_Safa", colnames(dt_in))


#deal with merged year cells
for(i in 1:nrow(dt_in)){
  if(is.na(dt_in$Year[i])) {
    dt_in$Year[i] <- curYear
  }else{
    curYear <- dt_in$Year[i]
  }
}
# parse the month cell and derive the date
# remove all month rows with "Total"
rowRemove <- which(grepl("Total", dt_in$Month))
dt_in <- dt_in[-rowRemove,]
dt_in$Month <- sapply(strsplit(dt_in$Month, "/"), function(x){
  gsub(" +", "", x[2])
})
dt_in$Date <- lubridate::dmy(paste0("01 ", substr(dt_in$Month,1,3), " ", dt_in$Year))

# remove production and consumption values of 0 and put NA in year 2020
first_zero_date <- dt_in$Date[min(which(dt_in$EDLfactoriesProduction == 0))]
dt_in <- dt_in[dt_in$Date < first_zero_date,]

3.2.1 Examining the data

3.2.1.1 Global production and consumption

Here’s the production and consumption data as a function of time. It seems there are a few missing or wrong data points (for example production 2000-12-01).

g <- ggplot(dt_in, aes(x = Date)) +
  geom_line(aes(y = EDLfactoriesProduction + Purchases)) +
  geom_line(aes(y = Networkconsumption), color = "red") +
  theme_cowplot() +
  scale_x_date(date_breaks = "5 years", date_labels = "%m-%Y")
  # scale_x_date(date_breaks = "1 year")
ggplotly(g)

3.2.2 Energy source breakdown by level

There are three different “levels”:

  • Level 1: “EDLfactoriesProduction”, “Purchases”, “Networkconsumption”
  • Level 2: “production, HydraulicEnergy”, “production, ThermalEnergy”, “purchases, HydraulicEnergy”, “purchases, ThermalEnergy”, “consumption, HydraulicEnergy”, “consumption, ThermalEnergy”
  • Level 3: others…

3.2.2.1 Level 1

# cols <- grep("production_|purchases_", colnames(dt_in), value = T)
cols_level1 <- c("EDLfactoriesProduction", "Purchases")
cols_level2 <- c("production_HydraulicEnergy", "production_ThermalEnergy", "purchases_HydraulicEnergy", "purchases_ThermalEnergy")
cols_level3 <- c("production_Safa", "production_Qadisha","production_Steam", "production_Gas", "production_Mixed",
                 "production_Reciprocatingengines-Zouk+Jieh", "production_Naameh", "purchases_FromSyria", "purchases_FromKarpoweships", "purchases_FromEgypt")

dir_component <- function(x){
  x$direction <- gsub("_(.)+", "", x$element)
  x$component <- gsub("(.)+_", "", x$element)
return(x)
}

dt_breakdown_l1 <- dt_in[,c("Date", cols_level1)] %>% pivot_longer(cols = cols_level1, names_to = "element", values_to = "KWh") %>% dir_component()
dt_breakdown_l2 <- dt_in[,c("Date", cols_level2)] %>% pivot_longer(cols = cols_level2, names_to = "element", values_to = "KWh") %>% dir_component()
dt_breakdown_l3 <- dt_in[,c("Date", cols_level3)] %>% pivot_longer(cols = cols_level3, names_to = "element", values_to = "KWh") %>% dir_component()


g <- ggplot(dt_breakdown_l1, aes(x=Date, y=KWh, fill = element)) +
  geom_bar(stat = "identity", position = "stack", color = "black", lwd = 0.1) +
  # facet_wrap(~direction, ncol = 1) +
  theme_cowplot() +
  theme(legend.position = "none") +
  scale_x_date(date_breaks = "5 years", date_labels = "%m-%Y") +
  ggtitle("Level 1 - Breakdown")
ggplotly(g)

3.2.2.2 Level 2

g <- ggplot(dt_breakdown_l2, aes(x=Date, y=KWh, fill = element)) +
  geom_bar(stat = "identity", position = "stack", color = "black", lwd = 0.1) +
  # facet_wrap(~direction, ncol = 1) +
  theme_cowplot() +
  theme(legend.position = "none") +
  scale_x_date(date_breaks = "5 years", date_labels = "%m-%Y") +
  ggtitle("Level 2 - Breakdown")
ggplotly(g)

3.2.2.3 Level 3

g <- ggplot(dt_breakdown_l3, aes(x=Date, y=KWh, fill = element)) +
  geom_bar(stat = "identity", position = "stack", color = "black", lwd = 0.1) +
  # facet_wrap(~direction, ncol = 1) +
  theme_cowplot() +
  theme(legend.position = "none") +
  scale_x_date(date_breaks = "5 years", date_labels = "%m-%Y") +
  ggtitle("Level 3 - Breakdown")
ggplotly(g)

3.2.3 Focus on more recent times

Level 3 plot as before

g <- ggplot(dt_breakdown_l3[year(dt_breakdown_l3$Date) > 2015,], aes(x=Date, y=KWh, fill = element)) +
  geom_bar(stat = "identity", position = "stack", color = "black", lwd = 0.1) +
  # facet_wrap(~direction, ncol = 1) +
  theme_cowplot() +
  theme(legend.position = "none")
ggplotly(g)

Breakdown by whether power is purchased or produced.

g <- ggplot(dt_breakdown_l3[year(dt_breakdown_l3$Date) > 2015,], aes(x=Date, y=KWh, fill = element)) +
  geom_bar(stat = "identity", position = "stack", color = "black", lwd = 0.1) +
  facet_wrap(~direction, ncol = 1) +
  theme_cowplot() +
  theme(legend.position = "none")
ggplotly(g)

It seems that the most affected element is the Karpowerships. Here’s the percentage of total power from the Karpowerships:

g <- dt_breakdown_l3 %>% group_by(Date) %>% dplyr::summarize(proportion_kpc = 100*(KWh/sum(KWh,na.rm = T))[which(element == "purchases_FromKarpoweships")]) %>%
  filter(year(Date) > 2012) %>%
  ggplot(aes(x = Date, y = proportion_kpc)) +
  geom_bar(stat="identity") +
  theme_cowplot() +
  ylab("Percentage of power from KPC")
ggplotly(g)

3.3 Reading the monthly expenses (treasury transfers from the ministry of Finance)

This table is in the same excel file, sheet name is Transferts - Transfers - تحوي.It contains information from the Ministry of Finance concerning the transfer of money (in Billion Lebanese Pounds) from the treasury (I presume).

3.3.1 Read the table and parse

dt_exp <- readxl::read_xlsx("../data/EDL 1995-2020.xlsx", sheet = "Transferts - Transfers - تحويل",
                            skip = 4)
# deal with tri-lingual column names
colnames(dt_exp) <- sapply(strsplit(colnames(dt_exp), "/"), function(x){
  if(length(x) ==2){
    gsub(" +", "", x[1])
  }else{
    gsub(" +", "", x[2])
  }
})

# deal with tri-lingual first row

dt_exp$BillionLBP <- sapply(strsplit(dt_exp$BillionLBP, "/"), function(x){
  if(length(x) ==2){
    gsub(" +", "", x[1])
  }else if(length(x) == 3){
    gsub(" +", "", x[2])
  }else{
    x[1]
  }
})

# remove last row 
dt_exp <- dt_exp[-nrow(dt_exp),]

# remove the total row, and the detailed ones to keep the four constituents that make the total.

toremove <- c("EDLofwhich:", "Principalpayments","Interestpayments", "BDLGuaranteedLoanpayments", "KPC & SPC", "EGAS", "ReimbursementofKPCandSonatrach", "ReimbursementofKPCandSonatrach")

# remove the detailed rows (that are nested within the main ones)
dt_exp <- dt_exp[-which(dt_exp$BillionLBP %in% toremove),]


tmp_category <- dt_exp$BillionLBP
dt_exp$BillionLBP <- NULL

which_yrs <- which(grepl("^Total", colnames(dt_exp)))
yrs <- gsub("Total", "", colnames(dt_exp)[which_yrs])
names(which_yrs) <- yrs
which_no_yrs <- which(!grepl("^Total", colnames(dt_exp)))

colnames(dt_exp)[which_no_yrs] <- paste0("01-",
                                         substr(colnames(dt_exp)[which_no_yrs], 1,3),
                                         "-",sapply(which_no_yrs, function(x){
  names(which_yrs)[min(which(which_yrs > x))]
}))
dt_exp <- dt_exp[,-which_yrs]

dt_exp$category <- tmp_category
dt_exp <- reshape::melt(as.data.frame(dt_exp),id.vars = "category", variable_name = "Date")
dt_exp$Date <- lubridate::dmy(dt_exp$Date)

DT::datatable(dt_exp)

3.3.2 Notes

  • The fields are very badly explained and there is a hierarchy in them. If I understand correctly:

  • “EDL of which” : this is basically the total of (Debt Service/Eurobonds, Reimbursement for purchase of gas and fuel, Transfers to Electricity Syria). These constituents are in bold in the sheet, and this is the only way one could figure it out (for a non-expert).

  • “Debt Service” is the umbrella term for the debt-related terms: C-loans and Eurobonds (of which printipal and interest payments), Banque Du Liban-Guaranteed Loan payments. This may not be very accurate, and I feel that the formula for calculating the total has changed with time due to some restructuring.

  • “Reimbursement for purchase of gas and fuel” : KPS and SPC (Karpowership and Sonatrach), EGAS

  • Unit is Billion LBP

  • Last data point is December 2019

3.3.3 Some plots

g <- ggplot(dt_exp, aes(x = Date)) +
  geom_bar(aes(y = value, fill = category), position = "stack", stat = "identity") +
  scale_x_date(date_breaks = "5 years", date_labels = "%m-%Y") +
  theme_cowplot() +
  theme(legend.position = "none") +
  ylab("Billions LBP")
  # scale_x_date(date_breaks = "1 year")
ggplotly(g)

Was the KPC supply proportional to money transfers? Doesn’t seem to be the case, but doesn’t have to be the case either. There are so many other factors to take in mind. I am not familiar with the terms of the contracts.

g <- dt_exp %>% filter(category == "Reimbursementforpurchaseofgasandfuel") %>% dplyr::full_join(by = "Date", dt_in %>% select(Date, purchases_FromKarpoweships)) %>%
  mutate(Year = factor(year(Date))) %>% 
  ggplot(aes(x=value, y = purchases_FromKarpoweships, date = Date, color = Year )) +
  geom_point() +
  theme_cowplot() +
  xlab("LBP billions") +
  ylab("Purchased KWh from Karpowerships") +
  theme(legend.position = "none")

ggplotly(g)

3.4 Section conclusions

This effort is an attempt to figure things out based on available (governmental) data. The quality, quantity, form, and content of the data available on the CAS website, in addition to the website itself, clearly indicate that the CAS has poor data management practices. The whole service needs to be radically improved as this service is supposed to be the beacon of sound data management and statistical methodology.

Criticism:

  • Very little data is available in general
  • Data is served in multi-sheet excel workbooks that have been manually assembled.
  • There is little or no metadata about each column, how it was calculated, and based on what source data.
  • There is no traceability and versioning of data

Minimal requirements:

  • Data should be stored in an adequate repository. Many free solutions exist and the CAS has an IT department. There is no excuse not to do better.
  • Data should be formatted in such a way that is machine-readable and in standard formats that are suitable for long-term preservation (xlsx is not a suitable format).
  • Data should not be edited manually unless there is no other way.
  • Statistical methods should be explained.
  • Code (it seems there is no code at the moment) should be open source.

Extra requirements:

  • Data should be made available to the wider public in interactive data showcases and dashboards.

4 Geographical Mapping and Graphing

Here goes my attempt to put things on a map!

First need to get powerplant data. The file ./data/edl_power_plant_2020.xls was created by me from two major sources:

Substations and outlets were mapped in a very naive way using the tidygeocoder package. This was done simply through a string search using the name. So this is of extremely low quality.

Here’s how the data look like:

power_plants <- readxl::read_xls("../data/edl_power_plant_2020.xls", skip = 2)
coords <- do.call(rbind, strsplit(gsub("( )+|( )+", "", power_plants$gps_coordinates), ","))
colnames(coords) <- c("latitude", "longitude")
coords <- apply(coords, MAR = c(1,2), as.numeric)
power_plants <- cbind(power_plants, coords)
DT::datatable(power_plants, width = "100%")

Now let’s infer some connectivity and plot the map. At the moment, we can place very few substations and outlets on the map due to lack of data, unfortunately. Also some links are nonsensical. For example, one substation was placed in Iraq because its name is “Tikrit / تكريت” (a city in Iraq).

graph_df <- data.frame(from = outlet_meta$global_station_id, 
                       to = outlet_meta$global_id, 
                       station_id = outlet_meta$station_id, 
                       outlet_id = outlet_meta$id)
graph_df$from_lat <- stations_dt$latitude[match(outlet_meta$global_station_id, stations_dt$global_id)]
graph_df$from_lng <- stations_dt$longitude[match(outlet_meta$global_station_id, stations_dt$global_id)]

graph_df$to_lat <- outlet_meta$latitude
graph_df$to_lng <- outlet_meta$longitude

vertex_meta <- data.frame(name = unique(c(graph_df$from, graph_df$to)))
vertex_meta$type <- gsub("_(.)+", "", vertex_meta$name)
vertex_meta$internal_id <- gsub("(.)+_", "", vertex_meta$name)

vertex_meta <- split(vertex_meta, vertex_meta$type)
vertex_meta <- lapply(vertex_meta, function(x){
  if(x$type[1] == "station"){
    out <- x
    out$label <- stations_dt$station_name[match(x$name, stations_dt$global_id)]
    out
  }else{
    out <- x
    out$label <- outlet_meta$text[match(x$name, outlet_meta$global_id)]
    out
  }
})
vertex_meta <- do.call(rbind, vertex_meta)
vertex_meta$shape <- plyr::mapvalues(vertex_meta$type, c("outlet", "station"), c("circle", "square"))

# g <- igraph::graph_from_data_frame(graph_df, vertices = vertex_meta)
# 
# visNetwork::visIgraph(g) %>% visIgraphLayout(layout = "layout_nicely") %>%
#   visNodes(size = 10) %>%
#   visOptions(highlightNearest = list(enabled = T, hover = T),
#              nodesIdSelection = T)


lf <- leaflet(stations_dt) %>%
  addTiles() %>%
  leaflet::addAwesomeMarkers(lng = stations_dt$longitude, 
                             lat = stations_dt$latitude, 
                             label = stations_dt$station_name, 
                             icon = awesomeIcons(icon = "bolt", library = "fa"),
                             group = "Substations") %>% 
  leaflet::addCircles(lng = outlet_meta$longitude, lat = outlet_meta$latitude, label = outlet_meta$text,
                      group = "Outlets") %>% 
  leaflet::addAwesomeMarkers(lng = power_plants$longitude, 
                             lat = power_plants$latitude, 
                             label = power_plants$power_plant_group, 
                             icon = awesomeIcons(icon = ifelse(power_plants$type=="Thermal", "industry", "tint"),
                                                 library = rep("fa", nrow(power_plants)),
                                                 iconColor = rep("black", nrow(power_plants)),
                                                 markerColor = rep("orange", nrow(power_plants))),
                             group = "Power plants") %>%
  fitBounds(35, 33, 37, 35)

which_known_path <- which(rowSums(!is.na(graph_df)[,c("from_lat", "from_lng", "to_lat", "to_lng")]) == 4)
for(i in which_known_path){
  lf <- lf %>% leaflet::addPolylines(lng = c(graph_df$from_lng[i], graph_df$to_lng[i]), lat = c(graph_df$from_lat[i], graph_df$to_lat[i]), group = "Connections")
}
lf %>%
  addLayersControl(overlayGroups = c("Power plants", "Substations", "Outlets", "Connections"),
    options = layersControlOptions(collapsed = FALSE))

Orange markers are power plants (whether thermal or hydroelectric). Blue markers are substations. Small red markers are outlets. Note that the substations and outlet locations are guessed programmatically from their names because there is no available data about their locations. Also note that outlets are not supposed to be points, but paths like roads, but this data is not available either.

5 Global conclusions and TODO

This exercise led me to two main conclusions:

  • There is very little data out there. People have to become more data-oriented.
  • Data is lacking sufficient metadata and is not versioned.
  • The data is of low quality and not immediately useable.
  • Data, when available, is not given out in proper formats (file format and table format within the file)

To do list:

  • Talk to EDL and ask for more technical and geographical data
  • Get expenditure data by EDL
  • Attempt to integrate the Electricity Transmission Network from here
  • Look into fuel import data, also made available by CAS

6 R Session Information

sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] readxl_1.3.1     plotly_4.9.2.1   lubridate_1.7.8  leaflet_2.0.3   
##  [5] visNetwork_2.0.9 igraph_1.2.5     cowplot_1.1.0    forcats_0.5.0   
##  [9] stringr_1.4.0    dplyr_0.8.5      purrr_0.3.4      readr_1.3.1     
## [13] tidyr_1.0.3      tibble_3.0.1     ggplot2_3.3.0    tidyverse_1.3.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4.6      lattice_0.20-41   assertthat_0.2.1  digest_0.6.25    
##  [5] plyr_1.8.6        R6_2.4.1          cellranger_1.1.0  backports_1.1.7  
##  [9] reprex_0.3.0      evaluate_0.14     httr_1.4.1        pillar_1.4.4     
## [13] rlang_0.4.6       lazyeval_0.2.2    rstudioapi_0.11   data.table_1.12.8
## [17] Matrix_1.2-18     DT_0.13           rmarkdown_2.1     splines_4.0.0    
## [21] labeling_0.3      htmlwidgets_1.5.1 munsell_0.5.0     broom_0.5.6      
## [25] compiler_4.0.0    modelr_0.1.7      xfun_0.13         pkgconfig_2.0.3  
## [29] mgcv_1.8-31       htmltools_0.4.0   tidyselect_1.1.0  bookdown_0.19    
## [33] reshape_0.8.8     fansi_0.4.1       viridisLite_0.3.0 crayon_1.3.4     
## [37] dbplyr_1.4.3      withr_2.2.0       grid_4.0.0        nlme_3.1-147     
## [41] jsonlite_1.6.1    gtable_0.3.0      lifecycle_0.2.0   DBI_1.1.0        
## [45] magrittr_1.5      scales_1.1.1      cli_2.0.2         stringi_1.4.6    
## [49] farver_2.0.3      fs_1.4.1          xml2_1.3.2        ellipsis_0.3.1   
## [53] generics_0.0.2    vctrs_0.3.0       tools_4.0.0       glue_1.4.1       
## [57] hms_0.5.3         crosstalk_1.1.0.1 yaml_2.2.1        colorspace_1.4-1 
## [61] rvest_0.3.5       knitr_1.28        haven_2.2.0