Machine Learning and Big Data in Finance
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
ICM317 - Machine Learning and Big Data in Finance
Introduction:
This is Part-1 of your project and accounts for 20% of the overall ICM317 assessment.
The project covers material from Lectures 1 to 10 of ICM317. In addition, you need to
complete Part-2 of the project which accounts for an additional 20% of the overall
ICM317 assessment, details of Part-2 are below. Part-2 will cover material from
Lectures 11 to 20 of ICM317. The two parts of the project are completely independent
of each other. You should work alone on this project and make your own individual
submission.
Project Background: You work in the Debt Origination team at Hyper Big Bank. Your
team helps companies to raise money by finding buyers for Commercial Bonds.
For those unfamiliar with Commercial Bonds, they work like this. Let’s say a company
called Superstar Manufacturing wants to open a new factory and they need to raise
money for the project. Superstar Manufacturing could approach the Debt Origination
team at Hyper Big Bank to create a Bond Issuance. For example, Hyper Big Bank could
work with Superstar Manufacturing (The Bond Issuer) to create 5-year Bonds, each
bond will have a Face Value of $100 and pay a Coupon of $4 per year (for example).
Hyper Big Bank will then try and find buyers for the Bonds, the money raised (minus a
fee from Hyper Big Bank) will be passed to Superstar Manufacturing (The Bond Issuer).
Over the course of the next 5 years, Superstar Manufacturing will pay $4 per year to
each Bond Holder for each Bond that they own, this will be equivalent to Interest Paid.
At the end of 5 years, Superstar Manufacturing will pay $100 (The Face Value) to each
Bond Holder for each Bond that they own. The Bond Buyer has a risk that Superstar
Manufacturing will not be able to maintain payments and then the Bond Buyer has a
risk of losing money. For this reason, Bond Buyers would prefer some Bonds over
others depending on the level of the Coupon ($4 per year in this case), the initial Bond
Price and the Bond Buyers perceived probability of the Bond Issuer defaulting.
The Debt Origination team is working on Four Bond issuances for Four Companies, we
will call the Four Bonds [BondA, BondB, BondC, BondD]. Sales of BondA, BondB and
BondC are going very well, however there is little interest from clients of Hyper Big
Bank to buy BondD.
The Debt Origination Team Manager has asked you as the Data Science Specialist in
the team to work on a targeted pitch campaign to increase sales of BondD. The idea is
to reach out to some of the other Bond Buyers ([BondA, BondB, BondC]) to encourage
them to also buy BondD. The Debt Origination team has so-far collected Data from
2,000 Customers each of whom has purchased one of [BondA, BondB, BondC] and
some of whom have also purchased BondD. The task for you is to build a Machine
Learning model to predict which customers who have purchased one of [BondA,
BondB, BondC] are predicted to also purchase BondD. Your working model can then
be used for a targeted pitch campaign to select future buyers of one of [BondA, BondB,
BondC] to encourage them to also purchase BondD.
The model you will build is an example of a Cross-Selling Model. We know some
customers who buy one of [BondA, BondB, BondC] have also bought BondD. There
may be some Features of the Customer (Original Bond Purchase, Location, Wealth,
Risk Appetite etc.) that makes them more likely to buy certain combinations of the
available Bonds. Building a Machine Learning model is a highly efficient way of finding
out which existing customers we should be trying to cross-sell BondD to.
The Data Set is provided to you in an Excel (CSV) file. The customer information has
been anonymized to protect customer privacy. You are provided with the customer
information in 12 columns (each a Feature) which are labelled [Feat0, Feat1, Feat2, …
Feat11]. Of the 2000 available rows only 1500 have been provided to you. An
Independent Model Validation Team (Dr Mininder Sethi) has kept aside the further
500 rows for their own testing of the model and information that you will provide.
Project Questions: The Debt Origination Team Manager would like to know if you can
build a Machine Learning Model to predict if a customer who has purchased one of
[BondA, BondB, BondC] will also buy BondD. This is a Classification Problem with two
classes [Class 0=No (Not a Buyer of BondD), Class 1=Yes (Is a Buyer of BondD)]. In
particular, the Debt Origination Team Manager would like you to answer the following
questions.
1- Is it possible to build a predictive Machine Learning Model?
2- If so, which Machine Learning Model would you recommend and why?
3- How accurate is your proposed model on the Data provided?
4- If we are to start collecting Data for less than the current 12 Features then
which Features would you recommend and why?. How many Features do we
need to collect Data for?, what is the lowest possible number of Features to
maintain 70% accuracy?
5- One of Features (Feat3) is the Bond that the customer had originally purchased
(one of [BondA, BondB, BondC]). Are previous purchases of any of BondA,
BondB or BondC in particular more indicative of a purchase of BondD?
The Debt Origination Team Manager would like to see your answers to the questions above
in an executive (summary) report. The Debt Origination Team Manager would also
appreciate any further insights on the Data that you can provide beyond the questions
above. Extra insights might include advice on collecting future Data. The Debt Origination
Team Manager is particularly sensitive about attempted cross-selling of BondD to customers
who are unlikely to follow through with a purchase, so accuracy of your model when it
predicts Class 1=Yes (Is a Buyer of BondD) is important.
The Debt Origination Team Manager would appreciate the provision of technical details about
your Data decisions. In particular, the manager would be interested to know why you have
chosen a particular Machine Learning model, how you deal with missing Data and outliers and
your construction of a Training Set and Testing Set.
You should make the following submissions.
1- An executive report. The report should be no longer than 9 pages and contain no more
than 3000 words. The report should also contain at least 4 visualizations (charts of
some kind). You may add more visualizations if you wish to, but the final report length
should not exceed 9 pages. If your report exceeds 9 pages, then only the first 9 pages
will be considered for grading. If your report exceeds 3000 words, then only the first
3000 words will be considered for grading.
2- A single Jupyter Notebook that can be used to reproduce the numerical results of your
report. The Jupyter Notebook should also be able to reproduce the visualizations in
your report. Your Jupyter Notebook may contain extra information or insights that are
not in your report. You may use such extra information to develop your own thoughts.
However, any information inside the Jupyter Notebook that is not added into your
written report will NOT be used for grading.