Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
HW4Anom
Question 1 (40 points) In this question, you will model trac counts in Pittsburgh using Gaussian process (GP) regression. The included dataset, "PittsburghTracCounts.csv", represents the average daily trac counts computed by trac sensors at over 1,100 locations in Allegheny County, PA. The data was collected from years 2012-2014 and compiled by Carnegie Mellon University’s Trac21 Institute; we have the longitude, latitude, and average daily count for each sensor. Given this dataset, your goal is to learn a model of trac count as a function of spatial location. To do so, t a Gaussian Process regression model to the observed data. While you can decide on the precise kernel specication, you should try to achieve a good model t, as quantied by a log marginal likelihood value greater than (i.e., less negative than) -1400. Here are some hints for getting a good model t: We recommend that you take the logarithm of the trac counts, and then subtract the mean of this vector, before tting the model. Since the data is noisy, don't forget to include a noise term (WhiteKernel) in your model. When tting a GP with RBF kernel on multidimensional data, you can learn a separate length scale for each dimension, e.g., length_scale=(length_scale_x, length_scale_y). Your Python code should provide the following ve outputs: 1) The kernel after parameter optimization and tting to the observed data. (10 pts) 2) The log marginal likelihood of the training data. (5 pts) 3) Show a 2-D plot of the model's predictions over a mesh grid of longitude/latitude (with color corresponding to the model's predictions) and overlay a 2-D scatter plot of sensor locations (with color corresponding to the observed values). (10 pts) 4) What percentage of sensors have average trac counts more than two standard deviations higher or lower than the model predicts given their spatial location? (5 pts) 5) Show a 2-D scatter plot of the sensor locations, with three colors corresponding to observed values a) more than two standard deviations higher than predicted, b) more then two standard deviations lower than predicted, and c) within two standard deviations of the predicted values. (10 pts) MLC HW 4 import pandas as pd import numpy as np from google.colab import drive drive.mount('/content/gdrive')
Longitude Latitude AvgDailyTrafficCount log 0 -80.278366 40.468606 84.0 4.430817 1 -80.162117 40.384598 95.0 4.553877 2 -80.221205 40.366778 97.0 4.574711 3 -80.142455 40.622084 111.0 4.709530 4 -80.131975 40.544915 125.0 4.828314 ... ... ... ... ... 1110 -79.843684 40.498619 13428.0 9.505097 1111 -79.926842 40.425383 13713.0 9.526100 1112 -80.065730 40.397582 13822.0 9.534017 1113 -79.863848 40.429878 14172.0 9.559023 1114 -79.848609 40.479233 14891.0 9.608512 1115 rows × 4 columns Data1 Given an unlabeled dataset with two real-valued attributes, we perform cluster-based anomaly detection by running k-means, choosing the number of clusters k automatically using the Schwarz criterion. Four clusters are formed: A: 100 points, center (0, 0), standard deviation 0.1 B: 150 points, center (35, 5), standard deviation 5 C: 2 points, center (15, 20), standard deviation 1 Question 2: Cluster-based anomaly detection (10 points)
D: 200 points, center (10, 10), standard deviation 1 Given the four points below, which of these points are, and are not, likely to be anomalies? Choose “Anomaly” or “Not Anomaly”, and provide a brief explanation, for each point. (Hint: your answers should take into account the size and standard deviation of each cluster as well as the distances to cluster centers.) (1, 0) Anomaly / Not Anomaly (35, 2) Anomaly / Not Anomaly (15, 19) Anomaly / Not Anomaly (10, 11) Anomaly / Not Anomaly Your answer here For this question, use the "County Health Indicators" dataset provided to identify the most anomalous counties. Please list the top 5 most anomalous counties computed using each of the following models. (We recommend that, as a pre-processing step, you drop na values, and make sure all numeric values are treated as oats not strings.) Part 1: Learn a Bayesian network structure using only the six features ["'% Smokers'","'% Obese'","'Violent Crime Rate'","'80/20 Income Ratio'","'% Children in Poverty'","'Average Daily PM2.5'"]. Use pd.cut() to discretize each feature into 5 categories: 0,1,2,3,4. (a) Use HillClimbSearch and BicScore to learn the Bayesian network structure (5 pts) (b) Which 5 counties have the lowest (most negative) log-likelihood values? Please show a ranked list of the top counties' names and log-likelihood values. (10 pts) Part 2: Cluster based anomaly detection. Use all numeric features for this part, and do not discretize. (a) Clustering with k-means. Please use k=3 clusters. Compute each record's distance to the nearest cluster center and report the ve counties which have the longest distances. (10 pts) (b) Cluster with Gaussian Mixture. Please repeat (2)a but use log-likelihood for each record (rather than distance) as the measure of anomalousness. (10 pts) Part 3: Choose one more anomaly detection model you prefer and report the top 5 most anomalous counties by the model you chose. (10 pts) Question 3: Anomaly detection (50 points)