1. Introduction

Trains are considered as one of the most significant transportations in Belgium. Passengers complaint about the experiences on trains as it is hard to find seats and avoid the crowdedness. It will cause dramatic waste in time as well as resources. For instance, passenger may miss the train if it is too crowded. According to the report, National Railway Company of Belgium has been continuing to loss profit during the recent years. Belgium government as well as National Railway Company of Belgium are intended to develop an app with specific algorithm to predict the occupancy of each train. As developers, we are selected as one of teams to help to design this functional app.

The app will be served as a daily train information hub for passenger to access occupancy, times, price, origin, destination, prediction accuracy and other historical data. It will help the government and planners to better understand the train system. The company could allocate the vehicle types, resources and adjust price based on the occupancy to ensure the maximum profit. On the other hand, passenger can have more options and flexible schedules for taking the trains.

Below is the algorithm we developed to better predict the train occupancy. The following steps have been organized:data collection, date wrangling, exploratory analysis, spatial process, feature engineering as well as modeling and validation.

Below are some packages to be loaded as well as functions such as knn.

mapTheme <- function(base_size = 10) {
  theme(
    text = element_text(color = "black"),
    plot.title = element_text(size = 12,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    axis.title = element_blank(),
    strip.background = element_rect(fill = "white", color = "white"),
    strip.text.x = element_text(size = 8),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank()
  )
}

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  axis.ticks=element_blank())
q5 <- function(variable) {as.factor(ntile(variable, 5))}
qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                          c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

nn_function <- function(measureFrom,measureTo,k) {
  
  nn <-   
    FNN::get.knnx(measureTo, measureFrom, k)$nn.dist
  
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint)
  
  return(output)  
}

Modes <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux))
  ux[tab == max(tab)]
}

2. Data Collection

As the developers, we downloaded the data from a great Kaggle dataset including a training and test csv and a station location dataset. we set up the right coordination and then load the Belgium map with stations locations. Some of the train stations are in Belgium and some of them are in other countries.

load train occupancy data and station locations