Machine-Learning-for-Flight-Ticket-Pricing-DST-1

Project Report

Problem Statement

Can we use Machine Learning to help a customer decide the optimal time to purchase a flight ticket?

ABSTRACT

Airlines employ complex, secretly-kept algorithms to vary flight ticket prices over time based on several factors,including seat availability,airline capacity, the price of oil, seasonality, etc. At any point in time, a customer looking to purchase a flight ticket has the option to buy or wait (in the hope of the flight price reducing in future).However, since they lack knowledge of these algorithms, customers often default to purchasing a ticket as early as possible rather than trying to optimize their time of purchase.However, vast quantities of data regarding flight ticket prices are available on the Internet. Through this project,we hoped to use this data to help customers make their decisions. We created an airline ticket-buying agent that tries to buy a customer’s flight ticket to optimize for price of purchase.We have selected website to scrap the Indian flights data.

Prerequisites

You need to have installed following softwares and libraries in your machine before running this project.

Python 3 Anaconda: It will install ipython notebook and most of the libraries which are needed like sklearn, pandas, seaborn, matplotlib, numpy, scipy,streamlit.

For more details refer repo path : Web App Model/Flask/requirements.txt

Sample Data

Below is the small sample of our dataset:

Data Overview

Data Source --> Dataset/

Data points --> 330939 rows

Dataset date range --> April 2021 to May 2021

Dataset Attributes:

Price - flight price

departure_time - flight schedule time

arrival_time - arrival time of flight

Airline Cabin - There are three type

E - Economy

PE - Premium Economy

B - Business

Dept_city - Departure city

Dept_date - Departure Date

arrival_city - Arrival city

stops - Number of stops

duration - Flight duration in minutes

weekday dept_hours

Dept_flights_time

optimal_hours

BLUEPRINT

The blueprint file structure follows the following pattern: Data --> Data Processing-->EDA-->Training Model-->Test Model & Evaluation-->Model Prediction-->Model Deployment

Machine Learning Framework:

Assume a customer decides to purchase a ticket for a particular flight at time = X hours before departure. The optimal time to purchase the ticket t0pt is:

in the range [X hours before dep., 4 hours before dep.]
time at which we achieve minimum flight price until departure

We have used 'LightGBM' algorithm to predict first optimal time then to predict price for each of the cabin classes whose architecture is as below:

Predict optimal time architecture for Economy Class

ec=LGBMRegressor(n_estimators=1200) #ec=RandomForestRegressor() ec.fit(x_train,y_train) #pred = rfg.predict(x_cv) pred = ec.predict(x_test) rmse = np.sqrt(mean_squared_error(y_test, pred)) r2= r2_score(y_test, pred) print("RMSE : % f" %(rmse)) print("R2 : % f" %(r2))

RMSE: 0.00000

R2: 1.0000

Predict Price architecture for Economy Class

ec_price=LGBMRegressor(n_estimators=1200) ec_price.fit(x_train,y_train) pred = ec_price.predict(x_test) rmse = np.sqrt(mean_squared_error(y_test, pred)) r2= r2_score(y_test, pred) print("RMSE : % f" %(rmse)) print("R2 : % f" %(r2))

RMSE: 0.141890

R2: 0.869850

Predict Optimal Time architecture for Business Class

bs=LGBMRegressor(n_estimators=1000) #bs=RandomForestRegressor(n_estimators=100 ) bs.fit(x_train,y_train) #pred = rfg.predict(x_cv) pred = bs.predict(x_test) rmse = np.sqrt(mean_squared_error(y_test, pred)) r2= r2_score(y_test, pred) print("RMSE : % f" %(rmse)) print("R2 : % f" %(r2))

RMSE: 0.0000

R2: 1.00000

Predict Price architecture for Business Class

bs_price=LGBMRegressor(n_estimators=1000) #bs=RandomForestRegressor(n_estimators=100 ) bs_price.fit(x_train,y_train) #pred = rfg.predict(x_cv) pred = bs_price.predict(x_test) rmse = np.sqrt(mean_squared_error(y_test, pred)) r2= r2_score(y_test, pred) print("RMSE : % f" %(rmse)) print("R2 : % f" %(r2))

RMSE: 0.121185

R2: 0.899513

Predict Optimal Time architecture for Premium Economy Class

pe=LGBMRegressor(n_estimators=1500) #pe = CatBoostRegressor() #rfg=RandomForestRegressor(n_estimators=100 ) pe.fit(x_train,y_train) #pred = rfg.predict(x_cv) pred = pe.predict(x_test) rmse = np.sqrt(mean_squared_error(y_test, pred)) r2= r2_score(y_test, pred) print("RMSE : % f" %(rmse)) print("R2 : % f" %(r2))

RMSE: 0.00000

R2: 1.00000

Predict Price architecture for Premium Economy Class

pe=LGBMRegressor(n_estimators=1500) #pe = CatBoostRegressor() #rfg=RandomForestRegressor(n_estimators=100 ) pe.fit(x_train,y_train) #pred = rfg.predict(x_cv) pred = pe.predict(x_test) rmse = np.sqrt(mean_squared_error(y_test, pred)) r2= r2_score(y_test, pred) print("RMSE : % f" %(rmse)) print("R2 : % f" %(r2))

RMSE: 0.093673

R2: 0.839008

Final Model output on WebApp

Using Flask Heroku Web App

Team Members : !

Deployment Steps :

To deploy model on Heroku we have 2 options, by Heroku CLI or by GitHub. We have selected deployment by GitHub

Step 1 : Create an account on heroku.com

Step 2 : Upload all files on GitHub

Step 3 : Deployment method --> GitHub

Step 4 : App connected to GitHub

Step 5 : Select Manual deploy Deploy--> the current state of a branch to this app should be Master.

Step 6 : Resolve package error if occurs and test your Pulic URL

Flask Code : Web App Model/Flask/

Public URL : https://prediction-price-for-flight.herokuapp.com/

Frontend Of the Streamlit

https://docs.google.com/presentation/d/15HfriKFJ5acUQJ1qqCTX2-4JDUT5InTOfC6PH3KF9EE/e

Here the results template "https://docs.google.com/presentation/d/15HfriKFJ5acUQJ1qqCTX2-4JDUT5InTOfC6PH3KF9EE/edit#slide=id.gb69d85bd22_0_12"

Demo:

WhatsApp.Video.2021-06-01.at.00.25.23.mp4

Team Members :

We have used RandomForestRegressor algorithm to predict first optimal time then to predict price whose architecture is as below:

Predict optimal time architecture

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed: 15.4min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed: 19.7min finished
RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=RandomForestRegressor(bootstrap=True,
                                                   ccp_alpha=0.0,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   max_samples=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators=100,
                                                   n_jobs=None, oob_score=False,
                                                   random_state=None, verbose=0,
                                                   warm_start=False),
                   iid='deprecated', n_iter=10, n_jobs=-1,
                   param_distributions={'max_depth': [5, 10, 15, 20, 50],
                                        'min_samples_split': [2, 3, 5, 10]},
                   pre_dispatch='2*n_jobs', random_state=None, refit=True,
                   return_train_score=False, scoring=None, verbose=2)

Selected best_params_ after hyperparameter tunning : {'min_samples_split': 5, 'max_depth': 20}

Accuracy = 0.9017739133612731

Predict price architecture

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed: 14.5min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed: 19.6min finished
RandomizedSearchCV(cv=5, error_score=nan,
                   estimator=RandomForestRegressor(bootstrap=True,
                                                   ccp_alpha=0.0,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   max_samples=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators=100,
                                                   n_jobs=None, oob_score=False,
                                                   random_state=None, verbose=0,
                                                   warm_start=False),
                   iid='deprecated', n_iter=10, n_jobs=-1,
                   param_distributions={'max_depth': [5, 10, 15, 20, 50],
                                        'min_samples_split': [2, 3, 5, 10]},
                   pre_dispatch='2*n_jobs', random_state=None, refit=True,
                   return_train_score=False, scoring=None, verbose=2)

Selected best_params_ after hyperparameter tunning : {'min_samples_split': 5, 'max_depth': 20}

Accuracy = 0.9351213338653643

Final Model output on WebApp

Using Flask Heroku Web App

Deployment Steps :

To deploy model on Heroku we have 2 options, by Heroku CLI or by GitHub. We have selected deployment by GitHub

Step 1 : Create an account on heroku.com

Step 2 : Upload all files on GitHub

Step 3 : Deployment method --> GitHub

Step 4 : App connected to GitHub

Step 5 : Select Manual deploy Deploy--> the current state of a branch to this app should be Master.

Step 6 : Resolve package error if occurs and test your Pulic URL

Flask Code : Web App Model/Flask/

Public URL : https://mlflightpred.herokuapp.com/

Demo :

Screen_Recording_20210527-195935_Chrome.mp4

Steps that we performed:

Tools used:

Python
Pycharm
Jupyter Notebook
Google Colab
DataBricks
Streamlit
Flask
GitHub
GitBash
SublimeTextEditor

### Libraries used:
* Pandas
* Numpy
* scipy
* sklearn
* lightgbm
* Boosting
* selenium
* Matplotlib
* Seaborn
* Plotly
* Cufflinks

Commands that we used for deployement:

git init
git add .
git status
git commit -m "First commit"
git status

heroku create
git remote -v
git push origin master

heorku logs --tail

Procfile:

web: sh setup.sh && streamlit run gh.py

Setup.sh:

mkdir -p ~/.streamlit/

echo "\
[server]\n\
headless = true\n\
port = $PORT\n\
enableCORS = false\n\
\n\
" > ~/.streamlit/config.toml

Author

-Yasin Shah

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Blue-Print file		Blue-Print file
Dataset		Dataset
EDA File		EDA File
Models/Flask Model		Models/Flask Model
Reinforcement-Learning-master		Reinforcement-Learning-master
Web App Model		Web App Model
Web Scraping code		Web Scraping code
LICENSE		LICENSE
New_model3.ipynb		New_model3.ipynb
README.md		README.md
Screenshot.png		Screenshot.png
img.jpg		img.jpg

Folders and files

Latest commit

History

Repository files navigation

Machine-Learning-for-Flight-Ticket-Pricing-DST-1

Project Report

Problem Statement

Can we use Machine Learning to help a customer decide the optimal time to purchase a flight ticket?

ABSTRACT

Prerequisites

Sample Data

Data Overview

Data Source --> Dataset/

Data points --> 330939 rows

Dataset date range --> April 2021 to May 2021

Dataset Attributes:

BLUEPRINT

Machine Learning Framework:

Predict optimal time architecture for Economy Class

RMSE: 0.00000

R2: 1.0000

Predict Price architecture for Economy Class

RMSE: 0.141890

R2: 0.869850

Predict Optimal Time architecture for Business Class

RMSE: 0.0000

R2: 1.00000

Predict Price architecture for Business Class

RMSE: 0.121185

R2: 0.899513

Predict Optimal Time architecture for Premium Economy Class

RMSE: 0.00000

R2: 1.00000

Predict Price architecture for Premium Economy Class

RMSE: 0.093673

R2: 0.839008

Final Model output on WebApp

Using Flask Heroku Web App

Deployment Steps :

Frontend Of the Streamlit

Demo:

Predict optimal time architecture

Accuracy = 0.9017739133612731

Predict price architecture

Accuracy = 0.9351213338653643

Final Model output on WebApp

Using Flask Heroku Web App

Deployment Steps :

Demo :

Steps that we performed:

Tools used:

Author

DECLARATION

A project report on Machine-Learning-for-Flight-Ticket-Pricing project Successfully submitted By

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages