ECON6113/STAT6113: Term Project Guidelines
ECON 6113 Term Project Guidelines
econometric models.
Your “client” for this project is a mortgage lender that wishes to engage in risk-based pricing
for its mortgage loans. The first step in establishing risk-based pricing is the construction of a
mortgage “scorecard” that predicts the probability that a loan defaults. You will develop such a
scorecard using data that I provide you derived from the Freddie Mac Single Family Loan-Level
Dataset; the full version of this data “covers approximately 22.19 million fixed-rate mortgages
originated between January 1, 1999 and June 30, 2015 [that were purchased or guaranteed by
Freddie Mac].” I will be providing you with a small random sample of loans from a particular
origination cohort. You are tasked with developing an econometric model that predicts the
probability that a given loan defaults, where a default is defined as going 60 days-past-due or
entering foreclosure at any point within 4 years of origination. In what follows a “good” account
is one that does not default, whereas a “bad” account is one that does default.
The Model
Let Di be an indicator variable that takes a value of one if loan i defaults within 4 years of
origination and is zero otherwise. This variable is called “BAD_OVER_48_60” in the data file.
Additionally, let Xi be a vector of variables that describe the characteristics of the borrower and
the loan that are available at the time of origination. For the project, you will use the data that I
provide you (the “development data”) to estimate
Pr[Di = 1|Xi] = G(Xiβ) (1)
Let β̂ denote the vector of estimated parameters for Equation 1. After you estimate β̂, your model
will be evaluated on its ability to distinguish between “good” and “bad” accounts as measured
by the Komolgorov-Smirnov (KS) statistic calculated on an out-of-time sample (the “OOT data”).
The student that builds the model with the highest KS statistic on the OOT sample will have 5
points added to her or his final grade.
Since you will all be working with the same development data, variation in the performance of
the models between students will be driven entirely by differences in how you build your mod-
els. In building your model, you may need to create new variables, such as interaction variables
and transformations. All of your data cleaning and model specification decisions, however,
must be explained and justified in your written report.
When building your model, the User Guide for the mortgage data will be of critical importance;
this guide can be found in the “Files-Term Project-Documentation” folder on the course Canvas
page. This file provides an overview of the loan-level data, describes the file layout, and defines
all of the variables that are included in the data.
Data Cleaning
While Freddie Mac has cleaned up the source files significantly, no dataset is perfect. As an
econometrician it is your responsibility to make sure that the data that you are using is correct;
2/7
ECON 6113 Term Project Guidelines
using data that is riddled with errors can have a seriously detrimental impact on your analysis.
As they say: “garbage in, garbage out.”
To make sure that you are working with a “clean” sample, you will likely want to remove some
observations from the data that appear to have been coded incorrectly or that are missing key
data elements. Some suggestions for cleaning your data are listed below.
• How frequently is a variable missing in the data? Will including the variable result in you
losing a large fraction of the overall sample?
• Plot a histogram of all of the variables you are considering for use in your model. Can
you identify any observations that are likely data entry errors? If so, you should consider
removing those observations from the data.
• When merging datasets, always verify that the merge was performed correctly.
• Calculate the summary statistics for all of the variables that you are considering for use in
your models. Do the means, minima, and maxima “make sense?” For example, economists
often work with variables that must logically be positive (e.g., prices) or bounded (e.g.,
fractions that must be between 0 and 1). If your data do not conform to expectations, you
should inspect the data more closely to understand what is going on. Note that in some
databases, values that appear to be bizarre (e.g., -99999 for a price) actually have a specific
meaning that can be gleaned from the codebook.
• ALWAYS READ THE CODEBOOK! In our case, the codebook for the Freddie Mac data is
called the “User Guide.” I cannot stress this enough. The easiest way to run into trouble
when building an econometric model is to just start throwing variables into a model before
you have any understanding of what those variables actually measure.
Data Sampling
Before building your model, you must use random sampling to split your data into two distinct
pieces: a development sample (70%) and a holdout sample (30%). The development sample is
the set of observations that will be used to build your model, and the holdout sample is the set
of observations on which you are to test the performance of your model.
Model Specification
You are free to specify your model as you please. I have, however, included some things you may
want to consider below when specifying your final model.
• The final specification of the model should be based on sound statistical reasoning. Since
you are building a predictive model, this means that you should only include variables in
your models that improve the models’ ability to differentiate between defaulting and non-
defaulting loans. You can identify these variables through traditional hypothesis testing,
an analysis of the model’s overall predictive performance, or some combination of the two
approaches.
3/7
ECON 6113 Term Project Guidelines
• You should think about how to incorporate non-linearities and interactions into your model
to improve model performance.
The Report
To receive credit for the project, you must submit a report via Canvas that contains the follow-
ing elements. The report should be written in clear, concise English as if it was aimed at a
professional audience. DO NOT SIMPLY TURN IN A LIST OF BULLET POINTS.
Data Description
In this portion of the report, you should describe the data that you are using to estimate your
model. Detailed information on the nature of the data can be found in the User Guide. If you
imposed any filters to remove likely outliers, those filters should be described in detail in this
section. For example, if you dropped observations with DTIs in excess of 90 because these are
likely data entry errors, you should state this exclusion in this part of the paper.
The Data Description section should include a table with summary statistics (such as the mean,
median, range, and standard deviation) for any of the variables that you include in your final
model. Please note that this section should only contain information on the variables that you
used to build and test your models; you do not need to discuss variables that are included in the
mortgage data that you do not utilize in your analysis.
You must use random sampling to split your data into two distinct pieces: a development sample
(70%) and a holdout sample (30%). The development sample is the set of observations that will
be used to build your model, and the holdout sample is the set of observations on which you
are to test the performance of your model. Describe how these samples were constructed in this
portion of the report.
Model Description
In this portion of the report, you should write down your model in mathematical notation. For
example, if Di is the default indicator and Xi is the vector of regressors, you should write down
your default model as
Pr[Di = 1|Xi = 1] = G(Xiβ)
For this project, you will use the logistic link function for G().
Lastly, and perhaps most importantly, this section should clearly state what your model is
designed to do, define the dependent variable, and describe your expectations for the rela-
tionship between each of the variables and mortgage default. For example, if you include DTI
in the model, you need to explain whether or not you expect DTI to increase default risk and
why you expect such a relationship to hold.
4/7
ECON 6113 Term Project Guidelines
Commentary on Initial Model Specification
Discuss in this section how you initially specified your model. Was your specification informed
by economic theory? Do you have expectations for the signs of any of the variables?
Next, discuss any initial testing that you conducted that you used to arrive. For example, if you
initially included DTI in your model but found that it was statistically insignificant, state this in
the report.
Final Model Output and Commentary
You must include a regression table that includes the estimated coefficients, standard errors,
and corresponding levels of statistical significance for your final specification. Below that table,
you should discuss and interpret your results. For example, if you included DTI in the model
because you expected higher DTIs to be associated with higher default rates, you should discuss
whether the final model results were consistent with expectations. In this section you should
emphasize the ceteris paribus interpretation of the regression coefficients. Turning back to the
DTI example, if you have a positive and statistically significant coefficient on the DTI term, you
should emphasize that, all else equal, borrowers with higher DTIs are more likely to default. In
this section you also must perform a marginal effects analysis and discuss the impact of the
variables in your model on the probability – not the log-odds – of default.
Model Performance Testing
After developing your final model using the development data, you must perform an analysis of
out-of-sample performance and compare this against in-sample performance. For this analysis,
the “score” is simply the predicted probability that a loan defaults based on your model. This
analysis must contain the following elements.
• An assessment of the discriminatory power of your model based on the KS statistic. If s
denotes the score, then the KS statistic is defined as
KS ≡ max
s
(F (s|B)− F (s|G))
where F (s|B) and F (s|G) are the cumulative distribution functions for the scores condi-
tional on the account being bad and good, respectively. Calculate KS using your develop-
ment data, and then calculate KS using your holdout data. Interpret the results. Does the
ability of your model to differentiate between good and bad accounts appear to be stable
out of sample?
• An assessment of the model’s predictive accuracy. To conduct this assessment, first partition
your scores into ten groups based on the deciles of the PD distribution. Within each decile,
sum over all of the predicted PDs in the decile to calculate the expected number of bads
within the decile. Calculate the expected number of good accounts similarly by summing
over the values of PD of the accounts that are within the decile. Formally, this expected bad
calculation can be written as
B̂d =
N
∑
i=1
PDi1id
5/7
ECON 6113 Term Project Guidelines
Expected Actual Expected Actual
Band Range Bad (B̂d) Bad (Bd) Good (Ĝd) Good (Gd)
1 (0, p1] ∑Ni=1 PDi1i1 B1 ∑
N
i=1 (1− PDi) 1i1 N1 − B1
2 (p1, p2] ∑Ni=1 PDi1i2 B2 ∑
N
i=1 (1− PDi) 1i2 N2 − B2
3 (p2, p3] ∑Ni=1 PDi1i3 B3 ∑
N
i=1 (1− PDi) 1i3 N3 − B3
4 (p3, p4] ∑Ni=1 PDi1i4 B4 ∑
N
i=1 (1− PDi) 1i4 N4 − B4
5 (p4, p5] ∑Ni=1 PDi1i5 B5 ∑
N
i=1 (1− PDi) 1i5 N5 − B5
6 (p5, p6] ∑Ni=1 PDi1i6 B6 ∑
N
i=1 (1− PDi) 1i6 N6 − B6
7 (p6, p7] ∑Ni=1 PDi1i7 B7 ∑
N
i=1 (1− PDi) 1i7 N7 − B7
8 (p7, p8] ∑Ni=1 PDi1i8 B8 ∑
N
i=1 (1− PDi) 1i8 N8 − B8
9 (p8, p9] ∑Ni=1 PDi1i9 B8 ∑
N
i=1 (1− PDi) 1i9 N9 − B9
10 (p9, 1] ∑Ni=1 PDi1i10 B10 ∑
N
i=1 (1− PDi) 1i10 N10 − B10
Table 1: An “Expected-Versus-Actual” Table.
where PDi is the estimated probability of default for account i, N denotes the total number
of accounts in the sample, and 1id is an indicator variable that is equal to 1 if account i is
included in decile or “band” d.
The number of expected goods in decile d can be written as
Ĝd =
N
∑
i=1
(1− PD)i 1id
• Construct a table that compares the expected goods (Ĝd) and actual goods (Gd) and the
expected bads (B̂d) and actual bads (Bd) for each of the score bands. Repeat this calculation
for the development sample and the holdout sample. Table 1 is a template for these
calculations. Based on these results, how well does your model predict default within the
bands? Is the accuracy of these predictions stable out-of-sample?
Code Submission
A key portion of the project is running your code on an out-of-time data sample. IF YOUR CODE
DOES NOT RUN ON THIS OUT-OF-TIME SAMPLE, YOU WILL NOT RECEIVE CREDIT
FOR THE PROJECT. To ensure that your code can be run on this holdout sample, your code must
be written in a manner that satisfies the requirements listed below. You will find sample code in
the “Files-Term-Project-StataCode” folder entitled “Basic_Regression_OutOfSample_LOGIT_W_TESTS.do”
that you can use to structure your code so that it conforms with these expectations.
• At the beginning of your do-file, define the directory that contains the data using the fol-
lowing local macro syntax: local data_dir “ZZZ” where ZZZ is the directory that contains
the data for the project. When I run your code on the out-of-time data, this directory will
be swapped to the directory on my computer where the out-of-time data resides.
6/7
ECON 6113 Term Project Guidelines
• All variable transformations and other data cleaning steps must be performed by calling a
separate do-file from within your main do-file. This do-file must be named “data_cleaning.do.”
To properly score the observations in the out-of-time data, the same variable transforma-
tions and data cleaning steps that were used in the construction of the initial model must
also be applied on the out-of-time observations. When I run your code to score the out-of-
time data, “data_cleaning.do” will be called to perform the necessary steps.
Plagiarism Detection
As a condition of taking this course, all required papers may be subject to submission for textual
similarity review to Turnitin.com via Canvas for the detection of plagiarism. All submitted
papers will be included as source documents in the Turnitin.com reference database solely for
the purpose of detecting plagiarism of such papers. No student papers will be submitted to
Turnitin.com without a student’s written consent and permission. If a student does not provide
such written consent and permission, the instructor may: (i) require a short reflection paper on
research methodology; (ii) require a draft bibliography prior to submission of the final paper; or
(iii) require the cover page and first cited page of each reference source to be photocopied and
submitted with the final paper
Grading
The grade that you receive for your term project will depend on 4 components: content (20%),
execution (20%), interpretation (30%), and writing (30%). The term project grading rubric, which
can be found on Canvas in the “Files-Term Project” folder, contains detailed information on how
scores for each of these components are determined.