Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Advanced Machine Learning GR5242
All Problems count the same. No collaboration (of any sort, including with peers from other sections or previous years) is allowed. Instead, the assignment is to be treated as you would a take-home exam. Homework submission: please submit your homework by publishing a notebook that cleanly displays your code, results and plots to pdf. For the non-coding questions, you can typeset using LATEX or markdown, or you can include neatly scanned answers. Please make sure to put everything together into a single pdf file for submission. This adjusts the gradient most into a given direction. Which? Problem 3: Compare DBScan and MeanShift This problem is an extension of Homework 1 Problem 8. In that problem you were given a data generator for 3 datasets, on which you had to compare the performances of MeanShift, K-means, and Spectral Clustering. You are to re-use the data generator given in Homework 1 Problem 8, and now run DBScan on the three datasets (we want to compare this to MeanShift on the same datasets. The corresponding Python function DBSCAN is imported as follows: from sklearn.cluster import DBSCAN The DBSCAN function uses two parameters eps and min samples. The first parameter eps corre- sponds to the bandwidth parameter h, in the course notes. The second parameter min samples is proportional to the level λ from the course notes. More precisely, we have λ = min samplesVol(B(0,h)) . Therefore you first have to make appropriate choices of these parameters as follows. • Set eps to be 20% quantile of interpoint distances when using DBSCAN, i.e., same setting as for MeanShift in Homework 1. You can use the function estimate bandwidth() from sklearn.cluster. • For minsamples in the range [100, 400, 700, 1000], pick the smallest value that gives more than one cluster (a) Plot the results of the clustering obtained using DBSCAN for the 3 datasets (as done in Homework 1). (b) Plot the results of clustering using MeanShift (with parameters set as in Homework 1). If you results for Homework 1 were correct you can simply provide those again here. (c) DBSCAN should vastly outperform MeanShift on at least 2 of the datasets? Which 2 datasets? Give a simple explanation as to why DBSCAN might do better there. Problem 4: Regularization - Gradient Boosting We will work on the classical digits dataset for this question. The dataset contains 1797 hand- written digits, 0 to 9, as 8 × 8 gray-scale images. Each datapoint (or feature vector) x is of Page 2 dimension 64, encoding 64 pixel values of an image; the class label y is simply the digit it- self. For instance, Figure 1 shows the image feature of an observation with label 0. For more details, please refer to https://scikit-learn.org/stable/modules/generated/sklearn. datasets.load_digits.html. Figure 1: An example of the digit image with label 0 You are asked to implement gradient boosting in Python to return a classifier from images to digits. Code for loading the dataset: # Load the dataset d = datasets.load_digits () X, y = (d.data , d.target) # Split into training and test sets X_train , X_test = X[:1000], X[1000:] y_train , y_test = y[:1000], y[1000:] Complete the following tasks: (a) Fit sklearn.ensemble.GradientBoostClassifier objects with all the combinations (four in total) of the choices of parameters as follows: • use deviance as the fitting criterion; • total number of iterations (i.e. total number of weak learners): T = 1500; • set max leaf nodes = 4, i.e. the weak learners are trees with at most four leaf nodes. • learning rate: try both η = 0.1 and 0.01; • the fraction of training data to use for gradient fitting: try both p = 0.5 and 1; • leave other parameters as default. For each of the 4 parameter settings, plot the test deviance against the boosting iterations t = 1, . . . , T . (Hint: GradientBoostClassifier.staged decision function() and GradientBoostClassifier.loss () are useful here.) (b) How does stochastic gradient boosting (SGB) compare to the non-stochastic one, especially for larger η = 0.1? Is there any other advantage for SGB, other than the performance on test dataset? (c) For the cases where η = 0.1, will early stopping be helpful? What about for the cases where η = 0.01? Give a simple reason why. (d) Consider for the case where η = 0.1 and p = 1. Determine an optimal stopping time by 10-fold cross-validation. Plot the stopping times as a dashed vertical line and the resulting test deviance as a dashed horizontal line in the same plot for part (a). The candidate stopping times are {20, 30, 40, 60, 80, 100, 200, 500}. (Hint: Use GridSearchCV() func- tion from sklearn.model selection to implement the cross validation on the parameter n estimators.) Page 3 (e) Now, let’s explore the performance under 0-1 misclassification error, which is most common as a criterion of classification performance. It is common folklore, that for 0-1 loss, boosting procedures tend to not overfit as fast as for other performance measures.