A-musa Data-solution - Slide Presentation Predicting Airline Passenger Satisfaction using Random Forest Algorithm

Outline

What is Random Forest Algorithm?
Benefits of Random Forest Algorithm
Predicting Airline Passenger Satisfaction using
Random Forest Algorithm

List of Packages used

library(readr)
library(DiagrammeR)
library(tidyverse)
library(report)
library(gtsummary)
#library(caret)
library(e1071)
library(sjPlot)
library(performance)
#library(ggstatsplot)
library(SmartEDA)
library(dlookr)
library(lme4)
library(lmerTest)
#library(neuralnet)
#library(DataExplorer)
#library(rpivotTable)
# library(ConfusionTableR)
# library(reshape)
# library(mlbench)
# library(Boruta)
# library(rpart)
# library(rpart.plot)
# library(randomForest)
#library(flextable)
library(readr)
Capd <- read_csv("Capd.csv")

Material and Methodology

A secondary data-set from the Kaggle data science platform
Machine algorithms Random Forest classifier
Data sets were partitioned into training and testing,
Of which 70% of the data was retained as a training set while 30% was considered as a testing set.

What is Random Forest Algorithm?

Random Forest Algorithm is an ensemble learning method for classification and regression
Combines multiple decision trees to create a more accurate and stable prediction
Uses bagging and feature randomness when building each individual tree

Benefits of Random Forest Algorithm

High accuracy and robustness
Handles missing values and outliers
Reduces variance compared to a single decision tree

Predicting Airline Passenger Satisfaction

Collect data on airline passengers
Use the Random Forest Algorithm to build a model
Use the model to predict passenger satisfaction

Reading the Data File

library(readr)
Capd <- read_csv("Capd.csv")

Pre-processing

# Capd %>% na.omit() # Removing missing values in the dataset 
# 
# # Converting data types using the following methods below
# Capd$Gender <- as.factor(Capd$Gender)
# Capd$customer_type <-  as.factor(Capd$customer_type)
# Capd$type_of_travel <- as.factor(Capd$type_of_travel)
# Capd$customer_class <- as.factor(Capd$customer_class)
# Capd$flight_distance <- as.integer (Capd$flight_distance)
# Capd$inflight_wifi_service <- as.factor(Capd$inflight_wifi_service)
# Capd$ease_of_online_booking <- as.factor(Capd$ease_of_online_booking)

Demo-graphical Characteristics

Characteristic	N = 129,880¹
Gender
Female	65,899 (51%)
Male	63,981 (49%)
customer_type
disloyal Customer	23,780 (18%)
Loyal Customer	106,100 (82%)
customer_class
Business	62,160 (48%)
Eco	58,309 (45%)
Eco Plus	9,411 (7.2%)
¹ n (%)

Estimated Parameters

Capd %>% 
  select(age,flight_distance,cleanliness,departure_delay_in_minutes,arrival_delay_in_minutes) %>% 
na.omit() %>%  report_parameters()

  - age: n = 129487, Mean = 39.43, SD = 15.12, Median = 40.00, MAD = 17.79, range: [7, 85], Skewness = -3.38e-03, Kurtosis = -0.72, 0% missing
  - flight_distance: n = 129487, Mean = 1190.21, SD = 997.56, Median = 844.00, MAD = 767.99, range: [31, 4983], Skewness = 1.11, Kurtosis = 0.27, 0% missing
  - cleanliness: n = 129487, Mean = 3.29, SD = 1.31, Median = 3.00, MAD = 1.48, range: [0, 5], Skewness = -0.30, Kurtosis = -1.01, 0% missing
  - departure_delay_in_minutes: n = 129487, Mean = 14.64, SD = 37.93, Median = 0.00, MAD = 0.00, range: [0, 1592], Skewness = 6.85, Kurtosis = 101.88, 0% missing
  - arrival_delay_in_minutes: n = 129487, Mean = 15.09, SD = 38.47, Median = 0.00, MAD = 0.00, range: [0, 1584], Skewness = 6.67, Kurtosis = 95.12, 0% missing

Cross-Tabulation of Customers Type

tab_xtab(var.row = Capd$Gender, 
         var.col = Capd$customer_type,
         show.row.prc = T)

Gender	customer_type		Total
Gender	disloyal Customer	Loyal Customer	Total
Female	12843 19.5 %	53056 80.5 %	65899 100 %
Male	10937 17.1 %	53044 82.9 %	63981 100 %
Total	23780 18.3 %	106100 81.7 %	129880 100 %
χ²=124.313 · df=1 · φ=0.031 · p=0.000

Cross-Tabulation based on Class

tab_xtab(var.row = Capd$Gender, 
         var.col = Capd$customer_class,
         show.row.prc = T)

Gender	customer_class			Total
Gender	Business	Eco	Eco Plus	Total
Female	31263 47.4 %	29670 45 %	4966 7.5 %	65899 100 %
Male	30897 48.3 %	28639 44.8 %	4445 6.9 %	63981 100 %
Total	62160 47.9 %	58309 44.9 %	9411 7.2 %	129880 100 %
χ²=20.908 · df=2 · Cramer's V=0.013 · p=0.000

Conceptual Model used

Feature Selection

#FS <- Boruta(satisfaction~., data = Capd, doTrace =2)

Data Partition

# set.seed(1234) # A Random Sampling with replacement
# 
# #Data Partition in to trainign and testing
# 
# Model <- sample(2, nrow(Capd), replace = T, prob = c(0.7, 0.3))
# train <- Capd[Model ==1,]
# test <- Capd[Model ==2,]

Building the Model

# #Random Forest Model
# set.seed(333)
# as.data.frame(Capd) # we converted the data in to datafarme
# 
# rf23 <-randomForest(satisfaction~., data = train, method = "class", na.action=na.exclude)

Training set of the Model

# # Prediction & Confusion Matrix - Test
# p <- predict(rf23, train)
# confusionMatrix(p, train$satisfaction)

Training set of the Model

Testing Set of the Model

# p2 <- predict(rf23, test)
# confusionMatrix(p2, test$satisfaction)

Testing Set of the Model

Confusion Metrix for Training set

# ConfusionTableR::binary_visualiseR(train_labels = train$satisfaction,
#                                    truth_labels= train$satisfaction,
#                                    class_label1 = "Not satisfied",
#                                    class_label2 = "Satisfied",
#                                    quadrant_col1 = "#28ACB4",
#                                    quadrant_col2 = "#4397D2",
#                                    custom_title = "Confusion Metric on Airline P",
#                                    text_col= "black")

Confusion Metrix for Training set

Confusion Metrix for Testing set

# ConfusionTableR::binary_visualiseR(train_labels = test$satisfaction,
#                                    truth_labels= test$satisfaction,
#                                    class_label1 = "Not satisfied",
#                                    class_label2 = "Satisfied",
#                                    quadrant_col1 = "#28ACB4",
#                                    quadrant_col2 = "#4397D2",
#                                    custom_title = "Confusion Metric on Airline P",
#                                    text_col= "black")

Confusion Metrix for Testing set

Conclusion

The overall accuracy of the model was found to be 95% with a sensitivity of 97% and specificity of 93%
Most of the customers were unsatisfied with the airline service. Therefore based on these findings, the Random-Forest algorism predicated 57% at 95% accuracy with a sensitivity of 97% and specificity of 93% that the participants were not satisfied with the daily operation of the airline industry, especially in the areas involved in Air travelers purchasing ticket/booking online, values added services

Recommendation

Considering that 57% of participants reported not being satisfied with airline service rendered this is a significant proportion that may significantly decrease the daily, weekly or monthly income revenues generated.
Therefore, this findings recommends that the airline industry should endeavor to improve daily operation services, especially in the areas of travelers purchasing tickets/booking online, value-added services such, as In-flight Wi-Fi service, check-in service, Baggage handling
This will increase the volume of patronage and, as a result, boost their market share and hence profitability.