Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP0050 Assignment
Data Download from moodle the file COMP0050CourseworkData.zip. This contains two datasets: 1) bankPortfolios.csv: The data contain information about assets held by 7783 US commercial banks in their balance sheet in the 4th quarter of 2007. These were collected from the Wharton Research Data Services. Each row of the file is associated with a bank. The first 14 columns represent investments in the following asset classes.
Column 15 contains banks debt. Finally, column 16 contains a binary variable corresponding to the output variable y that you need to predict (1 denotes default, 0 no default). Information on defaults comes from the list of bank failures from the Federal Deposit Insurance Corporation (FBL- FDIC) for the period 1/1/2008 - 7/1/2011.
2) 48_Industry_Portfolios_daily.csv: this dataset comes from Ken French’s website and contains daily equity returns for 48 industries in the U.S. Note that in the spreadsheet you can find two sets of data that correspond to different ways of computing industrial averages (either weighted equally or by market cap). You are free to select the version you prefer.
Tasks There will be two tasks corresponding to the two datasets: 1. The task is to build a model to predict whether a bank will default. You should compare the performance of different methods (e.g. logistic regression, classification trees/forests) in terms of their ability to correctly predict defaults. You are free to focus on a subset of the data (e.g. a reduced set of features, or a subset of banks) and to manipulate the data as you like, but you should explain your rationale. You should address in your analysis the issue of unbalanced data.
2. Focus on a subset of the data and perform a clustering analysis on the daily equity return data. Can you find meaningful interpretations of the clusters? Do certain industries tend to cluster together? For both tasks, justify whether you want to focus only on subsamples of the data. You are also free to explore questions related to the data and the tasks you think are interesting, as long as your analysis includes the development of predictive models of defaults for what concerns task 1