Multitask Machine Learning
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP9417 Project: Multitask Machine Learning (MML)
Project Description
As a Data Scientist at Predictive Solutions Inc., you have become comfortable working with any type of dataset that
comes your way. Your newest client, a medical researcher in an undisclosed branch at the local hospital, is interested
in utilizing machine learning to understand data obtained from a recent clinical trial they conducted. In this particular
dataset, there are n = 1000 observations and p = 111 features To ensure privacy of patient data, the features have been
anonymized (that is, the features are generically labelled X1, X2, . . . ,.) The features are a mix of binary, categorical
and continuous valued data that which contain information about each patient. In this problem, the outcome is
multivariate, which means that there are multiple target variables to predict as opposed to the usual case in which we
have a single target variable. Each target is a specific medical condition. This sort of problem is known as Multitask
Learning. The data will be released on March 25, 2022.
Description of the Data
The client has provided you with the following data sets in numpy.array format: X train, X test, Y train. You will
need to use best practices to come up with a model that generates predictions for X test which will be submitted for
evaluation and will count towards your final grade. The X variable is comprised of tabular data, and each feature is
of dimension 1000 × 111. The Y variable is comprised of tabular data of dimension 1000 × 11, so that there are 11
binary targets (tasks) that need to be predicted. The loss function used for this problem is the average binary cross
entropy loss, i.e. if Yij denotes the j-th target for the i-th observation, and Yˆij is the corresponding prediction from
your model, then the total loss is:
1
n
n∑
i=1
1
11
11∑
j=1
LXE(Yij , Yˆij)
,
where
LXE(Yij , Yˆij) = −Yij log(Yˆij)− (1− Yij) log(1− Yˆij)
is the usual binary cross entropy loss.
Important Aspects
The following problems should be considered and discussed in detail in your report:
• Data: Perform an extensive exploratory data analysis (EDA). This should include a pre-processing step in which
the data is cleaned. You should pay particular attention to the following questions:
1. Which features are most likely to be predictive of each target variable?
2. What, if any, are the relationship between target variables?
• Research: Provide a summary of the multi-task learning literature. Be sure to explain rigorously some of the
algorithms that are used. It is a good idea to pick one or two areas to explore further here. The report should
be well written and well referenced.
• Modelling: The approach to modelling is open ended and you should think carefully about the types of models
you wish to deploy. It is generally a bad idea to build a large number of generic models. Instead, you should
think carefully about the models you want to use and how best to build them. Regardless of the models you
choose, you need to: