Foundations of Business Analytics
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
BUS1040: Foundations of Business Analytics
Homework
This homework consists of 5 problems that require you to submit a written response. You should submit
a PDF to GradeScope and match the page number with the questions that you answered. You can find
detailed instructions on how to scan and submit your assignments on Canvas. If you fail to match the
page to the corresponding question, the marker will not be able to view your response, and thus, you
will be awarded a 0 mark for the question.
You may not use concepts or material from other classes (say, a linear algebra class you may have taken)
in your solutions. All the problems can be done using only the material from this class, and we will
deduct points from solutions that refer to outside material.
The homework is due by 5pm on Friday, the 4th of November. Late homework will not be accepted
unless special consideration is granted.
i
1. (10 points) The cost of fitting a model with p basis functions and N data points (say, using QR
factorization) is 2Np2 flops. In this exercise, we explore the complexity of carrying out 10-fold cross
validation on the same data set. We assume that N is much larger than p, i.e., N ? p, and divide
the data set into 10 folds, each with N/10 data points. The na¨?ve method is to fit 10 different
models, each using 9 of the folds, using the QR factorization, which requires 10 · 2(0.9)Np2 = 18Np2
flops. (To evaluate each of these models on the remaining fold requires 2(N/10)p flops, which can
be ignored compared to the cost of fitting the models.) So the na¨?ve method of carrying out 10-fold
cross validation requires, not surprisingly, around 10× the number of flops as fitting a single model.
The method below outlines another way to carry out 10-fold cross-validation. Give the total flop
count for each step, keeping only the dominant terms, and compare the total cost of the method to
that of the na¨?ve method. Let A1, . . . , A10 denote the (N/10)×p blocks of the data matrix associated
with the folds, and let b1, . . . , b10 denote the right-hand sides in the least squares fitting problem.
Step 1: Form the Gram matrices Gi = A
T
i Ai and the vectors ci = A
T
i bi.
Step 2: Form G = G1 + · · ·+G10 and c = c1 + · · ·+ c10.
Step 3: For k = 1, . . . , 10, compute θk = (G?Gk)?1(c? ck).
2. Suppose that x is anN -vector representing time series data. The (one step ahead) prediction problem
is to guess xt+1, based on x1, . . ., xt. We will base our prediction x?t+1 of xt+1 on the previous M
values, xt, xt?1, . . ., xt?M+1. (The number M is called the memory or lag of our predictor.) When
the prediction is a linear function,
x?t+1 = β1xt + β2xt?1 + · · ·+ βMxt?M+1,
it is called an auto-regressive predictor. (It is possible to add an offset to x?t+1, but we will leave it
out for simplicity.) Of course we can only use our auto-regressive predictor for M ≤ t ≤ N ? 1.
Some very simple and natural predictors have this form. One example is the predictor x?t+1 = xt,
which guesses that the next value is the same as the current one. Another one is x?t+1 = xt + (xt ?
xt?1), which guesses what xt+1 is by extrapolating a line that passes through xt and xt?1.
We judge a predictor (i.e., the choice of coefficients βi) by the mean-square prediction error
J =
1
N ?M
N?1∑
t=M
(x?t+1 ? xt+1)2.
A sophisticated choice of the coefficients βi is the one that minimizes J . We will call this the
least-squares auto-regressive predictor.
(a) (10 points) Find and report J for the two simple predictors described above, i.e., x?t+1 = xt
and x?t+1 = xt + (xt ? xt?1), for both the train and test data. Round your results to 4 decimal
places.
(b) (10 points) Find the matrix A and the vector b for which J = ∥Aβ? b∥2/(N ?M). This allows
you to find the coefficients that minimize J , i.e., the auto-regressive predictor that minimizes
the mean-square prediction error on the given time series. Be sure to give the dimensions of A
and b.
(c) (10 points) For M = 2, . . . , 12, find the coefficients that minimize the mean-square prediction
error on the time series x train given in time series data.ipynb. The same file has a second
time series x test that you can use to test or validate your predictor. Give the values of the
mean-square error on the train and test series for each value ofM . What is a good choice ofM?
Round your results to 4 decimal places.
Hint. You can use the toeplitz function described on page 65 of the Python companion. It
will make your life a lot easier.
Page 1 of 3.
3. Consider a fitting problem with n = 1, so x(1), . . . , x(N) and y(1), . . . , y(N) are scalars. We consider
two types of closely related models. The first is a piecewise-linear model with knot points at ?1
and 1, as described on pages 256-257 of the textbook, and illustrated in Figure 13.8. The second
is a stratified model (see page 272 of the textbook), with three independent affine models, one for
x < ?1, one for ?1 ≤ x ≤ 1, and one for x > 1. (In other words, we stratify on x taking low, middle,
or high values.)
(a) (5 points) Are these two models the same? Is one more general than the other?
(b) (5 points) How many parameters does each model have?
(c) (5 points) What can you say about the training set RMS error, and the testing set RMS error
that would be achieved using least squares with these two models?
4. The file lsq classifier data.ipynb contains feature n-vectors x1, . . . , xN , and the associated bi-
nary labels, y1, . . . , yN , each of which is either +1 or ?1. The feature vectors are stored as an n×N
matrix X with columns x1, . . . , xN , and the labels are stored as an N -vector y. We will evaluate the
error rate on the (training) data X, y and (to check if the model generalizes) a test set Xtest, ytest,
also given in lsq classifier data.ipynb.
(a) (10 points) Least squares classifier. Find β, v that
minimize
N∑
i=1
(xTi β + v ? yi)2
on the training set. Our predictions are then f?(x) = sign(xTβ + v). Report the classification
error on the training and test sets, i.e., the fraction of examples where f?(xi) ?= yi. There is no
need to report the β, v values.
(b) (10 points) Regularized least squares classifier. Now we add regularization to improve the gen-
eralization ability of the classifier. Find β, v that
minimize
N∑
i=1
(xTi β + v ? yi)2 + λ∥β∥2,
where λ > 0 is the regularization parameter, for a range of values of λ. Solve the problem for
the following values of λ: 10?1, 100, 101, 102, 103. Based on your results, suggest a reasonable
choice of λ and report the corresponding classification error on the training and test sets. Again,
there is no need to report the β, v values. Hint: plot the training and test set errors against
log10(λ).
Page 2 of 3.
5. In this exercise, you will replicate the results presented during the lecture. Perform least squares
classification with the handwritten digits dataset (MNIST). The following Python code imports the
data set from sklearn and divides it into train and test sets:
import numpy as np
from sklearn.datasets import fetch_openml
X, y = fetch_openml(’mnist_784’, version=1, return_X_y=True, as_frame=False)
X = X.astype(float)
y = y.astype(int)
X_train = X[10000:, :]
X_test = X[:10000, :]
y_train = y[10000:]
y_test = y[:10000]
(a) (10 points) In the lecture example, we used a Boolean classifier to distinguish the digit zero from
the other nine digits. The following Python code demonstrates how to obtain the coefficients
of the Boolean classifier theta_hat.
# Create binary outcomes for train set
Bin_y_train = np.where(y_train == 0, 1, -1)
# Create binary outcomes for test set
Bin_y_test = np.where(y_test == 0, 1, -1)
A_train = np.c_[np.ones(len(y_train)), X_train]
A_test = np.c_[np.ones(len(y_test)), X_test]
theta_hat = np.linalg.pinv(A_train) @ Bin_y_train
Compute a Boolean classifier that distinguishes the digit one from the other nine digits. Report
the train and test confusion matrices.
(b) (10 points) Compute a multi-class classifier that classifies all 10 different digits. Report the
train and test confusion matrices.
(c) (10 points) Create 5000 new features using the method described on page 302 of the textbook.
Fix your random seed to be 0 with the command np.random.seed(0). Compute a 10-class
classifier with the new features and report the train and test confusion matrices.
(d) (10 points) Note that your dataset is slightly different from the one we used in the lecture
example, although both come from the same source. In particular, you haven’t done any pre-
processing to the images. Hence your confusion matrices will look slightly different. Comment
on what else causes the differences between your results and the results presented in the lecture
(there is more to this story than preprocessing)